Skip to content

What does the Facebook experiment teach us?

July 1, 2014

I’m intrigued by the reaction that has unfolded around the Facebook “emotion contagion” study. (If you aren’t familiar with this, read this primer.) As others have pointed out, the practice of A/B testing content is quite common. And Facebook has a long history of experimenting on how it can influence people’s attitudes and practices, even in the realm of research. An earlier study showed that Facebook decisions could shape voters’ practices. But why is it that *this* study has sparked a firestorm?

In asking people about this, I’ve been given two dominant reasons:

  1. People’s emotional well-being is sacred.
  2. Research is different than marketing practices.

I don’t find either of these responses satisfying.

The Consequences of Facebook’s Experiment

Facebook’s research team is not truly independent of product. They have a license to do research and publish it, provided that it contributes to the positive development of the company. If Facebook knew that this research would spark the negative PR backlash, they never would’ve allowed it to go forward or be published. I can only imagine the ugliness of the fight inside the company now, but I’m confident that PR is demanding silence from researchers.

I do believe that the research was intended to be helpful to Facebook. So what was the intended positive contribution of this study? I get the sense from Adam Kramer’s comments that the goal was to determine if content sentiment could affect people’s emotional response after being on Facebook. In other words, given that Facebook wants to keep people on Facebook, if people came away from Facebook feeling sadder, presumably they’d not want to come back to Facebook again. Thus, it’s in Facebook’s better interest to leave people feeling happier. And this study suggests that the sentiment of the content influences this. This suggests that one applied take-away for product is to downplay negative content. Presumably this is better for users and better for Facebook.

We can debate all day long as to whether or not this is what that study actually shows, but let’s work with this for a second. Let’s say that pre-study Facebook showed 1 negative post for every 3 positive and now, because of this study, Facebook shows 1 negative post for every 10 positive ones. If that’s the case, was the one week treatment worth the outcome for longer term content exposure? Who gets to make that decision?

Folks keep talking about all of the potential harm that could’ve happened by the study – the possibility of suicides, the mental health consequences. But what about the potential harm of negative content on Facebook more generally? Even if we believe that there were subtle negative costs to those who received the treatment, the ongoing costs of negative content on Facebook every week other than that 1 week experiment must be more costly. How then do we account for positive benefits to users if Facebook increased positive treatments en masse as a result of this study? Of course, the problem is that Facebook is a black box. We don’t know what they did with this study. The only thing we know is what is published in PNAS and that ain’t much.

Of course, if Facebook did make the content that users see more positive, should we simply be happy? What would it mean that you’re more likely to see announcements from your friends when they are celebrating a new child or a fun night on the town, but less likely to see their posts when they’re offering depressive missives or angsting over a relationship in shambles? If Alice is happier when she is oblivious to Bob’s pain because Facebook chooses to keep that from her, are we willing to sacrifice Bob’s need for support and validation? This is a hard ethical choice at the crux of any decision of what content to show when you’re making choices. And the reality is that Facebook is making these choices every day without oversight, transparency, or informed consent.

Algorithmic Manipulation of Attention and Emotions

Facebook actively alters the content you see. Most people focus on the practice of marketing, but most of what Facebook’s algorithms do involve curating content to provide you with what they think you want to see. Facebook algorithmically determines which of your friends’ posts you see. They don’t do this for marketing reasons. They do this because they want you to want to come back to the site day after day. They want you to be happy. They don’t want you to be overwhelmed. Their everyday algorithms are meant to manipulate your emotions. What factors go into this? We don’t know.

Facebook is not alone in algorithmically predicting what content you wish to see. Any recommendation system or curatorial system is prioritizing some content over others. But let’s compare what we glean from this study with standard practice. Most sites, from major news media to social media, have some algorithm that shows you the content that people click on the most. This is what drives media entities to produce listicals, flashy headlines, and car crash news stories. What do you think garners more traffic – a detailed analysis of what’s happening in Syria or 29 pictures of the cutest members of the animal kingdom? Part of what media learned long ago is that fear and salacious gossip sell papers. 4chan taught us that grotesque imagery and cute kittens work too. What this means online is that stories about child abductions, dangerous islands filled with snakes, and celebrity sex tape scandals are often the most clicked on, retweeted, favorited, etc. So an entire industry has emerged to produce crappy click bait content under the banner of “news.”

Guess what? When people are surrounded by fear-mongering news media, they get anxious. They fear the wrong things. Moral panics emerge. And yet, we as a society believe that it’s totally acceptable for news media – and its click bait brethren – to manipulate people’s emotions through the headlines they produce and the content they cover. And we generally accept that algorithmic curators are perfectly well within their right to prioritize that heavily clicked content over others, regardless of the psychological toll on individuals or the society. What makes their practice different? (Other than the fact that the media wouldn’t hold itself accountable for its own manipulative practices…)

Somehow, shrugging our shoulders and saying that we promoted content because it was popular is acceptable because those actors don’t voice that their intention is to manipulate your emotions so that you keep viewing their reporting and advertisements. And it’s also acceptable to manipulate people for advertising because that’s just business. But when researchers admit that they’re trying to learn if they can manipulate people’s emotions, they’re shunned. What this suggests is that the practice is acceptable, but admitting the intention and being transparent about the process is not.

But Research is Different!!

As this debate has unfolded, whenever people point out that these business practices are commonplace, folks respond by highlighting that research or science is different. What unfolds is a high-browed notion about the purity of research and its exclusive claims on ethical standards.

Do I think that we need to have a serious conversation about informed consent? Absolutely. Do I think that we need to have a serious conversation about the ethical decisions companies make with user data? Absolutely. But I do not believe that this conversation should ever apply just to that which is categorized under “research.” Nor do I believe that academe is necessarily providing a golden standard.

Academe has many problems that need to be accounted for. Researchers are incentivized to figure out how to get through IRBs rather than to think critically and collectively about the ethics of their research protocols. IRBs are incentivized to protect the university rather than truly work out an ethical framework for these issues. Journals relish corporate datasets even when replicability is impossible. And for that matter, even in a post-paper era, journals have ridiculous word count limits that demotivate researchers from spelling out all of the gory details of their methods. But there are also broader structural issues. Academe is so stupidly competitive and peer review is so much of a game that researchers have little incentive to share their studies-in-progress with their peers for true feedback and critique. And the status games of academe reward those who get access to private coffers of data while prompting those who don’t to chastise those who do. And there’s generally no incentive for corporates to play nice with researchers unless it helps their prestige, hiring opportunities, or product.

IRBs are an abysmal mechanism for actually accounting for ethics in research. By and large, they’re structured to make certain that the university will not be liable. Ethics aren’t a checklist. Nor are they a universal. Navigating ethics involves a process of working through the benefits and costs of a research act and making a conscientious decision about how to move forward. Reasonable people differ on what they think is ethical. And disciplines have different standards for how to navigate ethics. But we’ve trained an entire generation of scholars that ethics equals “that which gets past the IRB” which is a travesty. We need researchers to systematically think about how their practices alter the world in ways that benefit and harm people. We need ethics to not just be tacked on, but to be an integral part of how *everyone* thinks about what they study, build, and do.

There’s a lot of research that has serious consequences on the people who are part of the study. I think about the work that some of my colleagues do with child victims of sexual abuse. Getting children to talk about these awful experiences can be quite psychologically tolling. Yet, better understanding what they experienced has huge benefits for society. So we make our trade-offs and we do research that can have consequences. But what warms my heart is how my colleagues work hard to help those children by providing counseling immediately following the interview (and, in some cases, follow-up counseling). They think long and hard about each question they ask, and how they go about asking it. And yet most IRBs wouldn’t let them do this work because no university wants to touch anything that involves kids and sexual abuse. Doing research involves trade-offs and finding an ethical path forward requires effort and risk.

It’s far too easy to say “informed consent” and then not take responsibility for the costs of the research process, just as it’s far too easy to point to an IRB as proof of ethical thought. For any study that involves manipulation – common in economics, psychology, and other social science disciplines – people are only so informed about what they’re getting themselves into. You may think that you know what you’re consenting to, but do you? And then there are studies like discrimination audit studies in which we purposefully don’t inform people that they’re part of a study. So what are the right trade-offs? When is it OK to eschew consent altogether? What does it mean to truly be informed? When it being informed not enough? These aren’t easy questions and there aren’t easy answers.

I’m not necessarily saying that Facebook made the right trade-offs with this study, but I think that the scholarly reaction of research is only acceptable with IRB plus informed consent is disingenuous. Of course, a huge part of what’s at stake has to do with the fact that what counts as a contract legally is not the same as consent. Most people haven’t consented to all of Facebook’s terms of service. They’ve agreed to a contract because they feel as though they have no other choice. And this really upsets people.

A Different Theory

The more I read people’s reactions to this study, the more that I’ve started to think that the outrage has nothing to do with the study at all. There is a growing amount of negative sentiment towards Facebook and other companies that collect and use data about people. In short, there’s anger at the practice of big data. This paper provided ammunition for people’s anger because it’s so hard to talk about harm in the abstract.

For better or worse, people imagine that Facebook is offered by a benevolent dictator, that the site is there to enable people to better connect with others. In some senses, this is true. But Facebook is also a company. And a public company for that matter. It has to find ways to become more profitable with each passing quarter. This means that it designs its algorithms not just to market to you directly but to convince you to keep coming back over and over again. People have an abstract notion of how that operates, but they don’t really know, or even want to know. They just want the hot dog to taste good. Whether it’s couched as research or operations, people don’t want to think that they’re being manipulated. So when they find out what soylent green is made of, they’re outraged. This study isn’t really what’s at stake. What’s at stake is the underlying dynamic of how Facebook runs its business, operates its system, and makes decisions that have nothing to do with how its users want Facebook to operate. It’s not about research. It’s a question of power.

I get the anger. I personally loathe Facebook and I have for a long time, even as I appreciate and study its importance in people’s lives. But on a personal level, I hate the fact that Facebook thinks it’s better than me at deciding which of my friends’ posts I should see. I hate that I have no meaningful mechanism of control on the site. And I am painfully aware of how my sporadic use of the site has confused their algorithms so much that what I see in my newsfeed is complete garbage. And I resent the fact that because I barely use the site, the only way that I could actually get a message out to friends is to pay to have it posted. My minimal use has made me an algorithmic pariah and if I weren’t technologically savvy enough to know better, I would feel as though I’ve been shunned by my friends rather than simply deemed unworthy by an algorithm. I also refuse to play the game to make myself look good before the altar of the algorithm. And every time I’m forced to deal with Facebook, I can’t help but resent its manipulations.

There’s also a lot that I dislike about the company and its practices. At the same time, I’m glad that they’ve started working with researchers and started publishing their findings. I think that we need more transparency in the algorithmic work done by these kinds of systems and their willingness to publish has been one of the few ways that we’ve gleaned insight into what’s going on. Of course, I also suspect that the angry reaction from this study will prompt them to clamp down on allowing researchers to be remotely public. My gut says that they will naively respond to this situation as though the practice of research is what makes them vulnerable rather than their practices as a company as a whole. Beyond what this means for researchers, I’m concerned about what increased silence will mean for a public who has no clue of what’s being done with their data, who will think that no new report of terrible misdeeds means that Facebook has stopped manipulating data.

Information companies aren’t the same as pharmaceuticals. They don’t need to do clinical trials before they put a product on the market. They can psychologically manipulate their users all they want without being remotely public about exactly what they’re doing. And as the public, we can only guess what the black box is doing.

There’s a lot that needs reformed here. We need to figure out how to have a meaningful conversation about corporate ethics, regardless of whether it’s couched as research or not. But it’s not so simple as saying that a lack of a corporate IRB or a lack of a golden standard “informed consent” means that a practice is unethical. Almost all manipulations that take place by these companies occur without either one of these. And they go unchecked because they aren’t published or public.

Ethical oversight isn’t easy and I don’t have a quick and dirty solution to how it should be implemented. But I do have a few ideas. For starters, I’d like to see any company that manipulates user data create an ethics board. Not an IRB that approves research studies, but an ethics board that has visibility into all proprietary algorithms that could affect users. For public companies, this could be done through the ethics committee of the Board of Directors. But rather than simply consisting of board members, I think that it should consist of scholars and users. I also think that there needs to be a mechanism for whistleblowing regarding ethics from within companies because I’ve found that many employees of companies like Facebook are quite concerned by certain algorithmic decisions, but feel as though there’s no path to responsibly report concerns without going fully public. This wouldn’t solve all of the problems, nor am I convinced that most companies would do so voluntarily, but it is certainly something to consider. More than anything, I want to see users have the ability to meaningfully influence what’s being done with their data and I’d love to see a way for their voices to be represented in these processes.

I’m glad that this study has prompted an intense debate among scholars and the public, but I fear that it’s turned into a simplistic attack on Facebook over this particular study rather than a nuanced debate over how we create meaningful ethical oversight in research and practice. The lines between research and practice are always blurred and information companies like Facebook make this increasingly salient. No one benefits by drawing lines in the sand. We need to address the problem more holistically. And, in the meantime, we need to hold companies accountable for how they manipulate people across the board, regardless of whether or not it’s couched as research. If we focus too much on this study, we’ll lose track of the broader issues at stake.

Corrupt Personalization

June 26, 2014

(“And also Bud Light.”)

In my last two posts I’ve been writing about my attempt to convince a group of sophomores with no background in my field that there has been a shift to the algorithmic allocation of attention — and that this is important. In this post I’ll respond to a student question. My favorite: “Sandvig says that algorithms are dangerous, but what are the the most serious repercussions that he envisions?” What is the coming social media apocalypse we should be worried about?

google flames

This is an important question because people who study this stuff are NOT as interested in this student question as they should be. Frankly, we are specialists who study media and computers and things — therefore we care about how algorithms allocate attention among cultural products almost for its own sake. Because this is the central thing that we study, we don’t spend a lot of time justifying it.

And our field’s most common response to the query “what are the dangers?” often lacks the required sense of danger. The most frequent response is: “extensive personalization is bad for democracy.” (a.k.a. Pariser’s “filter bubble,” Sunstein’s “egocentric” Internet, and so on). This framing lacks a certain house-on-fire urgency, doesn’t it?

(sarcastic tone:) “Oh, no! I’m getting to watch, hear, and read exactly what I want. Help me! Somebody do something!”

Sometimes (as Hindman points out) the contention is the opposite, that Internet-based concentration is bad for democracy.  But remember that I’m not speaking to political science majors here. The average person may not be as moved by an abstract, long-term peril to democracy as the average political science professor. As David Weinberger once said after I warned about the increasing reliance on recommendation algorithms, “So what?” Personalization sounds like a good thing.

As a side note, the second most frequent response I see is that algorithms are now everywhere. And they work differently than what came before. This also lacks a required sense of danger! Yes, they’re everywhere, but if they are a good thing

So I really like this question “what are the the most serious repercussions?” because I think there are some elements of the shift to attention-sorting algorithms that are genuinely “dangerous.” I can think of at least two, probably more, and they don’t get enough attention. In the rest of this post I’ll spell out the first one which I’ll call “corrupt personalization.”

Here we go.

Common-sense reasoning about algorithms and culture tells us that the purveyors of personalized content have the same interests we do. That is, if Netflix started recommending only movies we hate or Google started returning only useless search results we would stop using them. However: Common sense is wrong in this case. Our interests are often not the same as the providers of these selection algorithms.  As in my last post, let’s work through a few concrete examples to make the case.

In this post I’ll use Facebook examples, but the general problem of corrupt personalization is present on all of our media platforms in wide use that employ the algorithmic selection of content.

(1) Facebook “Like” Recycling

Screen Shot 2012-12-10 at 12.44.34 PM

(Image from ReadWriteWeb.)

On Facebook, in addition to advertisements along the side of the interface, perhaps you’ve noticed “featured,” “sponsored,” or “suggested” stories that appear inside your news feed, intermingled with status updates from your friends. It could be argued that this is not in your interest as a user (did you ever say, “gee, I’d like ads to look just like messages from my friends”?), but I have bigger fish to fry.

Many ads on Facebook resemble status updates in that there can be messages endorsing the ads with “likes.” For instance, here is an older screenshot from ReadWriteWeb:

pages you may like on facebook

Another example: a “suggested” post was mixed into my news feed just this morning recommending World Cup coverage on Facebook itself. It’s a Facebook ad for Facebook, in other words.  It had this intriguing addendum:

CENSORED likes facebook

So, wait… I have hundreds of friends and eleven of them “like” Facebook?  Did they go to http://www.facebook.com and click on a button like this:

Facebook like button magnified

But facebook.com doesn’t even have a “Like” button!  Did they go to Facebook’s own Facebook page (yes, there is one) and click “Like”? I know these people and that seems unlikely. And does Nicolala really like Walmart? Hmmm…

What does this “like” statement mean? Welcome to the strange world of “like” recycling. Facebook has defined “like” in ways that depart from English usage.  For instance, in the past Facebook has determined that:

  1. Anyone who clicks on a “like” button is considered to have “liked” all future content from that source. So if you clicked a “like” button because someone shared a “Fashion Don’t” from Vice magazine, you may be surprised when your dad logs into Facebook three years later and is shown a current sponsored story from Vice.com like “Happy Masturbation Month!” or “How to Make it in Porn” with the endorsement that you like it. (Vice.com example is from Craig Condon [NSFW].)
  2. Anyone who “likes” a comment on a shared link is considered to “like” wherever that link points to.  a.k.a. “‘liking a share.” So if you see a (real) FB status update from a (real) friend and it says: “Yuck! The McLobster is a disgusting product idea!” and your (real) friend include a (real) link like this one — that means if you clicked “like” your friends may see McDonald’s ads in the future that include the phrase “(Your Name) likes McDonalds.” (This example is from ReadWriteWeb.)

fauxLike_mcdonalds

This has led to some interesting results, like dead people “liking” current news stories on Facebook.

There is already controversy about advertiser “like” inflation, “like” spam, and fake “likes,” — and these things may be a problem too, but that’s not what we are talking about here.  In the examples above the system is working as Facebook designed it to. A further caveat: note that the definition of “like” in Facebook’s software changes periodically and when they are sued. Facebook now has an opt-out setting for the above two “features.”

But these incendiary examples are exceptional fiascoes — on the whole the system probably works well. You likely didn’t know that your “like” clicks are merrily producing ads on your friends pages and in your name because you cannot see them.  These “stories” do not appear on your news feed and cannot be individually deleted.

Unlike the examples from my last post you can’t quickly reproduce these results with certainty on your own account. Still, if you want to try, make a new Facebook account under a fake name (warning! dangerous!) and friend your real account. Then use the new account to watch your status updates.

Why would Facebook do this? Obviously it is a controversial practice that is not going to be popular with users. Yet Facebook’s business model is to produce attention for advertisers, not to help you — silly rabbit. So they must have felt that using your reputation to produce more ad traffic from your friends was worth the risk of irritating you. Or perhaps they thought that the practice could be successfully hidden from users — that strategy has mostly worked!

In sum this is a personalization scheme that does not serve your goals, it serves Facebook’s goals at your expense.

(2) “Organic” Content

This second group of examples concerns content that we consider to be “not advertising,” a.k.a. “organic” content. Funnily enough, algorithmic culture has produced this new use of the word “organic” — but has also made the boundary between “advertising” and “not advertising” very blurry.

funny-organic-food-ad

 

The general problem is that there are many ways in which algorithms act as mixing valves between things that can be easily valued with money (like ads) and things that can’t. And this kind of mixing is a normative problem (what should we do) and not a technical problem (how do we do it).

For instance, for years Facebook has encouraged nonprofits, community-based organizations, student clubs, other groups, and really anyone to host content on facebook.com.  If an organization creates a Facebook page for itself, the managers can update the page as though it were a profile.

Most page managers expect that people who “like” that page get to see the updates… which was true until January of this year. At that time Facebook modified its algorithm so that text updates from organizations were not widely shared. This is interesting for our purposes because Facebook clearly states that it wants page operators to run Facebook ad campaigns, and not to count on getting traffic from “organic” status updates, as it will no longer distribute as many of them.

This change likely has a very differential effect on, say, Nike‘s Facebook page, a small local business‘s Facebook page, Greenpeace International‘s Facebook page, and a small local church congregation‘s Facebook page. If you start a Facebook page for a school club, you might be surprised that you are spending your labor writing status updates that are never shown to anyone. Maybe you should buy an ad. Here’s an analytic for a page I manage:

this week page likes facebook

 

The impact isn’t just about size — at some level businesses might expect to have to insert themselves into conversations via persuasive advertising that they pay for, but it is not as clear that people expect Facebook to work this way for their local church or other domains of their lives. It’s as if on Facebook, people were using the yellow pages but they thought they were using the white pages.  And also there are no white pages.

(Oh, wait. No one knows what yellow pages and white pages are anymore. Scratch that reference, then.)

No need to stop here, in the future perhaps Facebook can monetize my family relationships. It could suggest that if I really want anyone to know about the birth of my child, or I really want my “insightful” status updates to reach anyone, I should turn to Facebook advertising.

Let me also emphasize that this mixing problem extends to the content of our personal social media conversations as well. A few months back, I posted a Facebook status update that I thought was humorous. I shared a link highlighting the hilarious product reviews for the Bic “Cristal For Her” ballpoint pen on Amazon. It’s a pen designed just for women.

bic crystal for her

The funny thing is that I happened to look at a friend of mine’s Facebook feed over their shoulder, and my status update didn’t go away. It remained, pegged at the top of my friend’s news feed, for as long as 14 days in one instance. What great exposure for my humor, right? But it did seem a little odd… I queried my other friends on Facebook and some confirmed that the post was also pegged at the top of their news feed.

I was unknowingly participating in another Facebook program that converts organic status updates into ads. It does this by changing their order in the news feed and adding the text “Sponsored” in light gray, which is very hard to see. Otherwise at least some updates are not changed. I suspect Facebook’s algorithm thought I was advertising Amazon (since that’s where the link pointed), but I am not sure.

This is similar to Twitter’s “Promoted Tweets” but there is one big difference.  In the Facebook case the advertiser promotes content — my content — that they did not write. In effect Facebook is re-ordering your conversations with your friends and family on the basis of whether or not someone mentioned Coke, Levi’s, and Anheuser Busch (confirmed advertisers in the program).

Sounds like a great personal social media strategy there–if you really want people to know about your forthcoming wedding, maybe just drop a few names? Luckily the algorithms aren’t too clever about this yet so you can mix up the word order for humorous effect.

(Facebook status update:) “I am so delighted to be engaged to this wonderful woman that I am sitting here in my Michelob drinking a Docker’s Khaki Collection. And also Coke.”

Be sure to use links. I find the interesting thing about this mixing of the commercial and non-commercial to be that it sounds to my ears like some sort of corny, unrealistic science fiction scenario and yet with the current Facebook platform I believe the above example would work. We are living in the future.

So to recap, if Nike makes a Facebook page and posts status updates to it, that’s “organic” content because they did not pay Facebook to distribute it. Although any rational human being would see it as an ad. If my school group does the same thing, that’s also organic content, but they are encouraged to buy distribution — which would make it inorganic. If I post a status update or click “like” in reaction to something that happens in my life and that happens to involve a commercial product, my action starts out as organic, but then it becomes inorganic (paid for) because a company can buy my words and likes and show them to other people without telling me. Got it? This paragraph feels like we are rethinking CHEM 402.

The upshot is that control of the content selection algorithm is used by Facebook to get people to pay for things they wouldn’t expect to pay for, and to show people personalized things that they don’t think are paid for. But these things were in fact paid for.  In sum this is again a scheme that does not serve your goals, it serves Facebook’s goals at your expense.

The Danger: Corrupt Personalization

With these concrete examples behind us, I can now more clearly answer this student question. What are the most serious repercussions of the algorithmic allocation of attention?

I’ll call this first repercussion “corrupt personalization” after C. Edwin Baker. (Baker, a distinguished legal philosopher, coined the phrase “corrupt segmentation” in 1998 as an extension of the theories of philosopher Jürgen Habermas.)

Here’s how it works: You have legitimate interests that we’ll call “authentic.” These interests arise from your values, your community, your work, your family, how you spend your time, and so on. A good example might be that as a person who is enrolled in college you might identify with the category “student,” among your many other affiliations. As a student, you might be authentically interested in an upcoming tuition increase or, more broadly, about the contention that “there are powerful forces at work in our society that are actively hostile to the college ideal.”

However, you might also be authentically interested in the fact that your cousin is getting married. Or in pictures of kittens.

Grumpy-Cat-meme-610x405

Corrupt personalization is the process by which your attention is drawn to interests that are not your own. This is a little tricky because it is impossible to clearly define an “authentic” interest. However, let’s put that off for the moment.

In the prior examples we saw some (I hope) obvious places where my interests diverged from that of algorithmic social media systems. Highlights for me were:

  • When I express my opinion about something to my friends and family, I do not want that opinion re-sold without my knowledge or consent.
  • When I explicitly endorse something, I don’t want that endorsement applied to other things that I did not endorse.
  • If I want to read a list of personalized status updates about my friends and family, I do not want my friends and family sorted by how often they mention advertisers.
  • If a list of things is chosen for me, I want the results organized by some measure of goodness for me, not by how much money someone has paid.
  • I want paid content to be clearly identified.
  • I do not want my information technology to sort my life into commercial and non-commercial content and systematically de-emphasize the noncommercial things that I do, or turn these things toward commercial purposes.

More generally, I think the danger of corrupt personalization is manifest in three ways.

  1. Things that are not necessarily commercial become commercial because of the organization of the system. (Merton called this “pseudo-gemeinschaft,” Habermas called it “colonization of the lifeworld.”)
  2. Money is used as a proxy for “best” and it does not work. That is, those with the most money to spend can prevail over those with the most useful information. The creation of a salable audience takes priority over your authentic interests. (Smythe called this the “audience commodity,” it is Baker’s “market filter.”)
  3. Over time, if people are offered things that are not aligned with their interests often enough, they can be taught what to want. That is, they may come to wrongly believe that these are their authentic interests, and it may be difficult to see the world any other way. (Similar to Chomsky and Herman’s [not Lippman's] arguments about “manufacturing consent.”)

There is nothing inherent in the technologies of algorithmic allocation that is doing this to us, instead the economic organization of the system is producing these pressures. In fact, we could design a system to support our authentic interests, but we would then need to fund it. (Thanks, late capitalism!)

To conclude, let’s get some historical perspective. What are the other options, anyway? If cultural selection is governed by computer algorithms now, you might answer, “who cares?” It’s always going to be governed somehow. If I said in a talk about “algorithmic culture” that I don’t like the Netflix recommender algorithm, what is supposed to replace it?

This all sounds pretty bad, so you might think I am asking for a return to “pre-algorithmic” culture: Let’s reanimate the corpse of Louis B. Mayer and he can decide what I watch. That doesn’t seem good either and I’m not recommending it. We’ve always had selection systems and we could even call some of the earlier ones “algorithms” if we want to.  However, we are constructing something new and largely unprecedented here and it isn’t ideal. It isn’t that I think algorithms are inherently dangerous, or bad — quite the contrary. To me this seems like a case of squandered potential.

With algorithmic culture, computers and algorithms are allowing a new level of real-time personalization and content selection on an individual basis that just wasn’t possible before. But rather than use these tools to serve our authentic interests, we have built a system that often serves a commercial interest that is often at odds with our interests — that’s corrupt personalization.

If I use the dominant forms of communication online today (Facebook, Google, Twitter, YouTube, etc.) I can expect content customized for others to use my name and my words without my consent, in ways I wouldn’t approve of. Content “personalized” for me includes material I don’t want, and obscures material that I do want. And it does so in a way that I may not be aware of.

This isn’t an abstract problem like a long-term threat to democracy, it’s more like a mugging — or at least a confidence game or a fraud. It’s violence being done to you right now, under your nose. Just click “like.”

In answer to your question, dear student, that’s my first danger.

* * *

ADDENDUM:

This blog post is already too long, but here is a TL;DR addendum for people who already know about all this stuff.

I’m calling this corrupt personalization because I cant just apply Baker’s excellent ideas about corrupt segments — the world has changed since he wrote them. Although this post’s reasoning is an extension of Baker, it is not a straightforward extension.

Algorithmic attention is a big deal because we used to think about media and identity using categories, but the algorithms in wide use are not natively organized that way. Baker’s ideas were premised on the difference between authentic and inauthentic categories (“segments”), yet segments are just not that important anymoreBermejo calls this the era of post-demographics.

Advertisers used to group demographics together to make audiences comprehensible, but it may no longer be necessary to buy and sell demographics or categories as they are a crude proxy for purchasing behavior. If I want to sell a Subaru, why buy access to “Brite Lights, Li’l City” (My PRIZM marketing demographic from the 1990s) when I can directly detect “intent to purchase a station wagon” or “shopping for a Subaru right now”? This complicates Baker’s idea of authentic segments quite a bit. See also Gillespie’s concept of calculated publics.

Also Baker was writing in an era where content was inextricably linked to advertising because it was not feasible to decouple them. But today algorithmic attention sorting has often completely decoupled advertising from content. Online we see ads from networks that are based on user behavior over time, rather than what content the user is looking at right now. The relationship between advertising support and content is therefore more subtle than in the previous era, and this bears more investigation.

Okay, okay I’ll stop now.

* * *

(This is a cross-post from Multicast.)

New Report Released: Few Legal Remedies for Victims of Online Harassment

June 10, 2014

For the last year, I’ve been working with Fordham’s Center on Law and Information Policy to research what legal remedies are available to victims of online harassment. We investigated cyberharassment law, cyberstalking law, defamation law, hate speech, and cyberbullying statutes. We found that although online harassment and hateful speech is a significant problem, there are few legal remedies for victims.

Report Highlights

  • Section 230 of the Communications Decency Act provides internet service providers(including social media sites, blog hosting companies, etc.) with broad immunity from liability for user-generated content.
  • Given limited resources, law enforcement personnel prioritize other cases over
    prosecuting internet-related issues.
  • Similarly, there are often state jurisdictional issues which make successful prosecution
    difficult, as victim and perpetrator are often in different states, if not different countries.
  • Internet speech is protected under the First Amendment. Thus, state laws regarding online
    speech are written to comply with First Amendment protections, requiring fighting
    words, true threats, or obscene speech (which are not protected). This generally means
    that most offensive or obnoxious online comments are protected speech.
  • For an online statement to be defamatory, it must be provably false rather than a matter of
    opinion. This means that the specifics of language used in the case are extremely
    important.
  • While there are state laws for harassment and defamation, few cases have resulted in
    successful prosecution. The most successful legal tactic from a practical standpoint has
    been using a defamation or harassment lawsuit to reveal the identities of anonymous
    perpetrators through a subpoena to ISPs then settling. During the course of our research,
    we were unable to find many published opinions in which perpetrators have faced
    criminal penalties, which suggests that the cases are not prosecuted, they are not appealed
    when they are prosecuted, or that the victim settles out of court with the perpetrator and
    stops pressing charges.
  • In offline contexts, hate speech laws seem to only be applied by courts as penalty
    enhancements; we could locate no online-specific hate speech laws.
  • Given this landscape, the problem of online harassment and hateful speech is unlikely to
    be solved solely by victims using existing laws; law should be utilized in combination
    with other practical solutions.

The objective of the project is to provide a resource that may be used by the general public, and in particular, researchers, legal practitioners, Internet community moderators, and victims of harassment and hateful speech online. If you’re working on online harassment, cyberbullying, revenge porn, or a host of related issues, we hope this will be of service to you.

Also, read it to find out the difference between calling someone a “bitch” and a “skank” online, what a “true threat” is, and why students are probably at the most risk of being prosecuted for online speech acts.

Download the report from SSRN

What Is Music Technology For?

May 13, 2014

(x-posted on Super Bon!)

In late March and early April, I attended three events that together signal some interesting shifts in thinking about music technology and sound.  The first, a day-long symposium on March 24th I co-organized with Nancy Baym, was entitled “What Is Music Technology For?”  It came after a weekend-long instalment of MusicTechFest, which brings together people from the arts, industry, education and academe to talk about music technology.  For our more academically-focused event, we brought together humanists, social scientists, engineers, experimentalists, artists and policy activists (among others) to discuss our mutual interests and investments in music technology.  Rather than editing a collection that would come out two years from now, Nancy and I decided to try assembling a manifesto, a project that gave direction to the day and also helped us think in terms of common problems and goals.

The result is now available online at musictechifesto.net, and I encourage you to visit, read and sign.

That event was followed by two others which I think show at least a possibility for a sea change in how we talk about music technology and with whom.

The following weekend found me at the University of Maryland, for their “Sound+” conference.  I presented a (still early version of) my work on Dennis Gabor and time-stretched audio, and listened to a wide range of papers from (mostly) English and literature scholars on sonic problems.  But of course Maryland is home to the Maryland Institute for Technology in the Humanities, and that combined with a critical mass of people interested in theory and interdisciplinarity meant we also had some conversations that looked outward, especially a roundtable on mutual sonic interests across the humanities and sciences at the end of the second day.

The weekend after that (4-5 April) found me at the Machine Fantasies conference at Tufts University (across town), which brought together musicologists, anthropologists, composers, engineers, artists and computer scientists to have conversations about what it means for machines to make music, and how we might think about both the pasts and the futures of music technology.

Combined with other events, like the huge MusDig conference at Oxford last summer, there seems to be a growing interest in working across established interdisciplinary boundaries.  In other words, while humanists and social scientists are used to talking with one another, and while engineers and computer scientists are used to talking with one another, there now seems to be a growing (and one hopes, critical) mass of people who want to work across intellectual and institutional boundaries.

Speaking as someone coming out of the humanities and “soft” or “critical” social sciences, this is a major change brought on, I think, by several concurrent developments (and keep in mind this is musings in a blog post, not a careful intellectual history):

1.  A renewed interest in making, probably heavily lubricated by the turn to the “digital humanities” in some fields, but also by a re-assessment of the role of critique.  A generation ago, I came up learning that to be critical required one to be separate.  But increasingly, we are seeing integration of critique with other scholarly modes. Anne Balsamo’s mapping of the technological imagination in Designing Culture captures this beautifully.

2.  A new openness to humanistic and interpretive approaches in the world of music engineering and science.  I can’t say that I know them to have been “closed” in previous generations–that may well not have been the case.  But I have personally spent the last 10 years or so in dialogue with people in a variety of scienc-y and engineering-y spheres of music technology design, development and research.  I have found a great deal of openness to and interest in the kinds of ideas in which I usually traffic, and what began really as a “study of” a group of people has evolved into a series of “collaborations with.”  To that end, and to provide a little institutional leverage (or play space), I have joined McGill’s Centre for Interdisciplinary Research in Music, Media and Technology (CIRMMT, pronounced “Kermit,” like the frog).

3.  Some of this may also be the result of changing institutional configurations and easy familiarity with tools. Two generations ago, when places like Stanford’s Center for Computer Music and Acoustics were getting off the ground, to do anything with computers and music (or music and technology more broadly), you needed a space and resources, you needed specialized equipment, and you needed specialized knowledge.  Today, those tools are cheaper and more available than ever.  There is something lost when people aren’t heading over to the mainframe or computer lab and running into each other that way–common spaces are so central to interdisciplinarity.  But there is something gained when we all have an easy sense of the available tools, and some of our questions are beginning to converge.

4. Some of the theoretical concerns of humanists, like what it means to make or listen to music, what it means to be a musician or fan, what technology is or should be, how the various music industries ought to be organized, and what the nature of an instrument or instrumentality is–these questions are suddenly on the table and pressing issues for everyone.  The answers we come up with now can have practical impact as we imagine the next generation of music technologies, or worry after the increasingly precarious status of people who make their living from music or sound work. In other words, we are in the enviable–and impossible–position of having a lot of thinking to do, and having a chance to act on those thoughts.

These are exciting, challenging, messy and incomplete developments.  They hold a great deal of promise.  It is up to us to pop our heads up from our silos, to think big, and try to work together in different kinds of spaces to move some of these shared agendas forward.

A Manifesto For Music Technologists

May 13, 2014

March 21-23, we held the first Music Tech Fest in North America at Microsoft Research New England. It was a three day bonanza of ideas spanning a mind-bending spectrum of ways to connect music and technology.

The day after, 21 scholars met for a symposium we called What is Music Technology For? Our goal was to write a manifesto. Today we are proud to announce the launch of the Manifesto. As we say on the about page:

Those at the symposium were motivated by a passion for music, a fascination with technology and culture, and a concern for how music technology is now developing. Recognizing the fertility of music technology as a subject that bridges computational, scientific, social scientific and humanistic approaches, we looked for common ground across those fields. We debated and developed a set of shared principles about the future of music technology.

Built from the notes of that day’s event, and revised together in the weeks that followed, this manifesto is the collaboratively-authored product of this meeting.

Read more about the manifesto and who was involved on the about page. We hope those of you with overlapping interests in music and in technology will sign on.

Adding the bling: The role of social media data intermediaries

May 7, 2014

Last month, Twitter announced the acquisition of Gnip, one of the main sources for social media data—including Twitter data. In my research I am interested in the politics of platforms and data flows in the social web and in this blog post I would like to explore the role of data intermediaries—Gnip in particular—in regulating access to social media data. I will focus on how Gnip regulates the data flows for social media APIs and how it capitalizes on these data flows. By turning the licensing of API access into an profitable business model the role of these data intermediaries have specific implications for social media research.

The history of Gnip

Gnip launched on July 1st, 2008 as a platform offering access to data from various social media sources. It was founded by Jud Valeski and MyBlogLog founder Eric Marcoullier as “a free centralized callback server that notifies data consumers (such as Plaxo) in real-time when there is new data about their users on various data producing sites (such as Flickr and Digg)” (Feld 2008). Eric Marcoullier’s background in blog service MyBlogLog is of particular interest as Gnip has taken core ideas behind the technical infrastructure of the blogosphere and has repurposed them for the social web.

MyBlogLog

MyBlogLog was a distributed social network for bloggers which allowed them to connect to their blog readers. From 2006-2008 I actively used MyBlogLog. I had a MyBlogLog widget in the sidebar of my blog displaying the names and faces of my blog’s latest visitors. As part of my daily blogging routine I checked out my MyBlogLog readers in the sidebar, visited unknown readers’ profile pages and looked at which other blogs they were reading. It was not only a way to establish a community around your blog, but you could also find out more about your readers and use it as a discovery tool to find new and interesting blogs. In 2007, MyBlogLog was acquired by Yahoo! and six months later founder Eric Marcoullier left Yahoo! while his technical co-founder Todd Sampson stayed on (Feld 2008). In February 2008, MyBlogLog added a new feature to their service which displayed “an activity stream of recent activities by all users on various social networks – blog posts, new photos, bookmarks on Delicious, Facebook updates, Twitter updates, etc.” (Arrington 2008). In doing so, they were no longer only focusing on the activities of other bloggers in the blogosphere but also including their activities on social media platforms and moving into the ‘lifestreaming’ space by aggregating social updates in a central space (Gray 2008). As a service originally focused on bloggers, they were expanding their scope to take the increasing symbiotic relationship between the blogosphere and social media platforms into account (Weltevrede & Helmond, 2012). But in 2010 MyBlogLog came to an end when Yahoo! shut down a number of services including del.icio.us and MyBlogLog (Gannes 2010).

Ping – Gnip

After leaving Yahoo! in 2007, MyBlogLog-founder Eric Marcoullier started working on a new idea which would eventually become Gnip. In two blog posts by Brad Feld from Foundry Group–an early Gnip investor–Feld provides insights into the ideas behind Gnip and its name. Gnip is ‘ping’ spelled backwards and Feld recounts how Marcoullier was “originally calling the idea Pingery but somewhere along the way Gnip popped out and it stuck (“meta-ping server” was a little awkward)” (Feld 2008). Ping is a central technique in the blogosphere that allows (blog) search engines and other aggregators to know when a blog has been updated. This notification system is built into blog software so that when you publish a new blog post, it automatically sends out a ping (a XML-RPC signal) that notifies a number of ping services that your blog has been updated. Search engines then poll these services to detect blog updates so that they can index these new blog posts. This means that search engines don’t have poll the millions or billions of blogs out there for updates but that they only have to poll these central ping services. Ping solved a scalability issue of update notifications in the blogosphere because polling a very large number of blogs on a very frequent basis is impossible. Ping servers established themselves as “the backbone of the blogosphere infrastructure and are a crucially important piece of the real-time web” (Arrington 2005). In my MA thesis on the symbiotic relationship between blog software and search engines I describe how ping servers form an essential part of the blogosphere’s infrastructure because they act as centralizing forces in the distributed network of blogs that notify subscriber, aggregators and search engines of new content (Helmond 2008, 70). Blog aggregators and blog search engines could get fresh content from updated blogs by polling central ping servers instead of individual blogs.

APIs as the glue of the social web

Gnip sought to solve a scalability issue of the social web—third parties constantly polling social media platform APIs for new data— in a similar manner by becoming a central point for new content from social media platforms offering access to their data. Traditionally, social media platforms have offered (partial) access to their data to outsiders by using APIs, application programming interfaces. APIs can be seen as the industry-preferred method to gain access to platform data—in contrast to screen scraping as an early method to repurpose social media data (Helmond & Sandvig, 2010). Social media platforms can regulate data access through their APIs, for example by limiting which data is available and how much of it can be requested and by whom. APIs allow external developers to build new applications on top of social media platforms and they have enabled the development of an ecosystem of services and apps that make use of social media platform data and functionality (see also Bucher 2013). Think for example of Tinder, the dating app, which is built on top of the Facebook platform. When you install Tinder you have to log in with your Facebook account, after which the dating app finds matches based on proximity but also on shared Facebook friends and shared Facebook likes. Another example of how APIs are used is the practice of sharing content across various social media platforms using social buttons (Helmond 2013). APIs can be seen as the glue of the social web, connecting social media platforms and creating a social media ecosystem.

APIs overload

But the birth of this new “ecosystem of connective media” (van Dijck 2013) and its reliance on APIs (Langlois et. al 2009) came with technical growing pains:

Web services that became popular overnight had performance issues, especially when their APIs were getting hammered. The solution for some was to simply turn off specific services when the load got high, or throttle (limit) the number of API calls in a certain time period from each individual IP address (Feld 2008).

With the increasing number of third-party applications constantly requesting data, some platforms started to limit access or completely shut down API access. This did not only have implications for developers building apps on top of platforms but also for the users of these platforms. Twitter implemented a daily limit of 70 requests per hour which also affected users. If you exceeded the 70 requests per hour—which also included tweeting, replying or retweeting—you simply were simply cut off. Actively live tweeting an event could easily exceed the imposed limit. In the words of Nate Tkacz, commenting on another user being barred from posting during a conference: “in this world, to be prolific, is to be a spammer.”

capt

Collection of Twitter users commenting on Twitter’s rate limits. Slide from my 2012  API critiques lecture.

However, limiting the number of API calls, or shutting down API access did not fix the actual problem and affected users too. Gnip was created to address the issue of third-parties constantly polling social media platform APIs for new data by bringing these different APIs together into one system (Feld 2008). Similar to central ping services in the blogosphere Gnip would become the central service to call social media APIs and to poll for new data: “Gnip plans to sit in the middle of this and transform all of these interactions back to many-to-one where there are many web services talking to one centralized service – Gnip” (Feld 2008). Instead of thousands of applications frequently calling individual social media platform APIs, they could now call a single API, the Gnip API thereby leveraging the API load for these platforms. Since its inception Gnip has acted as an intermediary of social data and it was specifically designed “to sit in between social networks and other web services that produce a lot of user content and data (like Digg, Delicious, Flickr, etc.) and data consumers (like Plaxo, SocialThing, MyBlogLog, etc.) with the express goal of reducing API load and making the services more efficient” (Arrington 2008). In a blogpost on Techcrunch, covering the launch of Gnip, author Nik Cubrilovic explains in detail how Gnip functions as “a web services proxy to enable consuming services to easily access user data from a variety of sources:”

A publisher can either push data to Gnip using their API’s, or Gnip can poll the latest user data. For consumers, Gnip offers a standards-based API to access all the data across the different publishers. A key advantage of Gnip is that new events are pushed to the consumer, rather than relying on the consuming application to poll the publishers multiple times as a way of finding new events. For example, instead of polling Digg every few seconds for a new event for a particular user, Gnip can ping the consuming service – saving multiple round-trip API requests and resolving a large-scale problem that exists with current web services infrastructure. With a ping-based notification mechanism for new events via Gnip the publisher can be spared the load of multiple polling requests from multiple consuming applications (Cubrilovic 2008).

Gnip launched as a central service offering access to a great number of popular APIs from platforms including Digg, Flickr, del.icio.us, MyBlogLog, Six Apart and more. At launch, technology blog ReadWrite described the new service as “the grand central station and universal translation service for the new social web” (Kirkpatrick 2008).

Gnip’s business model as data proxy

Gnip regulates the data flows between various social media platforms and social media data consumers by licensing access to these data flows. In September 2008, a few months after the initial launch, Gnip launched it’s “2.0” version which no longer required data consumers to poll for new data with Gnip, but instead, new data would be pushed to them in real-time (Arrington 2008). While Gnip initially launched as a free service, the new version also came with a freemium business model:

Gnip’s business model is freemium – lots of data for free and commercial data consumers pay when they go over certain thresholds (non commercial use is free). The model is based on the number of users and the number of filters tracked. Basically, any time a service is tracking more than 10,000 people and/or rules for a certain data provider, they’ll start paying at a rate of $0.01 per user or rule per month, with a maximum payment of $1,000 per month for each data provider tracked (Arrington 2008).

Gnip connects to various social media platform APIs and then licenses access to this data through the single Gnip API. In doing so Gnip has turned data reselling—besides advertising—into a profitable business model for the social web, not only for Gnip itself but also for social media platforms that make use of Gnip. I will continue by briefly discussing Gnip and Twitter’s relationship before discussing the implications of this emerging business model for social media researchers.

Gnip and Twitter

Gnip and Twitter’s relationship goes back to 2008 when Twitter decided to open up its data stream by giving Gnip access to the Twitter XMPP “firehose” which sent out all of Twitter’s data in a realtime data stream (Arrington 2008). At Gnip’s launch Twitter was not part of the group of platforms offering access to their data. A week after the launch Eric Marcoullier explained “That Twitter Thing” to its users—who were asking for Twitter data—by explaining that Gnip was still waiting for access to Twitter’s data and by outlining how Twitter could benefit from doing so. Only a week later Twitter gave Gnip access to their resource-intensive XMPP “firehose” thereby shifting the infrastructural load, that it was suffering from, to Gnip. With this data access deal Gnip and Twitter became unofficial partners. On October 2008 Twitter outlined the different ways to get data into and out of Twitter for developers and hinted at giving Gnip access to its full data, including meta-data, which until then had been on an experimental basis. It wasn’t until 2010 that their partnership with experimental perks became official.

In 2010 Gnip became Twitter’s first authorized data reseller offering access to “the Halfhose (50 percent of Tweets at a cost of $30,000 per month), the Decahose (10 percent of Tweets for $5,000 per month) and the Mentionhose (all mentions of a user including @replies and re-Tweets for $20,000 per month)” (Gannes 2010). Notably absent is the so-called ‘firehose,’ the real-time stream of all tweets. Twitter previously sold access to the firehose to Google ($15 million) and Microsoft ($10 million) in 2009. Before the official partnership announcement with Gnip, Twitter’s pricing model for granting access to data had been rather arbitrary since ““Twitter is focused on creating consumer products and we’re not built to license data,” Williams said, adding, “Twitter has always invested in the ecosystem and startups and we believe that a lot of innovation can happen on top of the data. Pricing and terms definitely vary by where you are from a corporate perspective”” (Gannes 2010). In this interview Evan Williams states that Twitter was never built for licensing data, which may be a reason they entered into a relationship with Gnip in the first place. In contrast to Twitter, Gnip’s infrastructure was built to regulate API traffic which at the same time enables the monetization of licensing access to the data available through APIs. This became even clearer in August 2012 when Twitter announced a new version of its API which came with a new and stricter rate limiting (Sippey 2012). The new restrictions imposed through the Twitter API version 1.1 meant that developers could request less data which affected third-party clients for Twitter (Warren 2012).

Two weeks later Twitter launched its “Certified Products Program” which focused on three product categories: engagement, analytics and data resellers—including Gnip (Lardinois 2012). With the introduction of Certified Products shortly after the new API restrictions, Twitter made clear that large scale access to Twitter data had to be bought. In a blog post addressing the changes in the new Twitter API v1.1, Gnip’s product manager Adam Torres calculates that the new restrictions come down to 80% less data (Tornes 2013). In the same post he also promotes Gnip as the paid-for solution:

Combined with the existing limits to the number of results returned per request, it will be much more difficult to consume the volume or levels of data coverage you could previously through the Twitter API. If the new rate limit is an issue, you can get full coverage commercial grade Twitter access through Gnip which isn’t subject to rate limits (Tornes 2013).

In February 2012 Gnip announced that it would become the first authorized reseller of “historical” (the past 30 days) for Twitter data. This marked another important moment in Gnip and Twitter’s business relationship, followed by the announcement of Gnip offering full access to historical Twitter data in October.

Twitter’s business model: Advertising & data licensing

The new API and the Certified Products Program point towards a shift in Twitter’s business model by introducing intermediaries such as analytics companies and data resellers for access to large scale Twitter data.

Despite Williams’ statement that Twitter wasn’t built for licensing data, it had previously been making a bit of money by selling access to its firehose as previously described. However, the main source of income for Twitter has always come from selling advertisements: “Twitter is an advertising business, and ads make up nearly 90% of the company’s revenue.” (Edwards 2014). While Twitter’s current business model relies on advertising, data licensing as a source of income is growing steadily: “In 2013, Twitter got $70 million in data licensing payments, up 48% from the year before” (Edwards 2014).

Using social media data for research

If we are moving towards the licensing of API access as a business model, then what does this mean for researchers working with social media data? Gnip is only one of the four data intermediaries—together with DataSift, Dataminr and Topsy (now owned by Apple, an indicator of big players buying up the middleman market of data)—offering access to Twitter’s firehose. Additionally, Gnip (now owned by Twitter) and Topsy (now owned by Apple) also offer access to the historical archive of all tweets. What are the consequences of intermediaries for researchers working with Twitter data? boyd & Crawford (2011) and Bruns & Stieglitz (2013) have previously addressed the issues that researchers are facing when working with APIs. With the introduction of data intermediaries data access has become increasingly hard to come by since ‘full’ access is often no longer available from the original source (the social media platform) but only through intermediaries at a hefty price.

Two months before the acquisition of Gnip by Twitter they announced a partnership in a new Data Grants program that would give a small selection of academic researchers access to all Twitter data. However, by applying for the grants program you had to accept their “Data Grant Submission Agreement v1.0.” Researcher Eszter Hargittai critically investigated the conditions of getting access to data for research and raised some important questions about the relationship between Twitter and researchers in her blog post ‘Wait, so what do you still own?

Even if we gain access to an expensive resource such as Gnip, the intermediaries also point to a further obfuscation of the data we are working with. The application programming interface (API), as the name already indicates, provides an interface to the data which explicates that we are always “interfacing” with the data and that we never have access to the “raw” data. In “Raw Data is an Oxymoron” edited by Lisa Gitelman, Bowker reminds us that data is never “raw” but always “cooked” (2013, p.  2). Social media intermediaries play an important role in “cooking” data. Gnip “cooks” its data by “Adding the Bling” referring to the addition of extra metadata to Twitter data. These so-called “Enrichments” include geo-data enrichments which “adds a new kind of Twitter geodata from what may be natively available from social sources.” In other words, Twitter data is enriched with data from other sources such as Foursquare logins.

For researchers, working with social media data intermediaries also requires new skills and new ways of thinking through data by seeing social media data as relational. Social media data are not only aggregated and combined but also instantly cooked through the addition of “bling.”

Acknowledgements

I would like to thank the Social Media Collective and visiting researchers for providing feedback on my initial thoughts behind this blogpost during my visit from April 14-18 at Microsoft Research New England. Thank you Kate Crawford, Nancy Baym, Mary Gray, Kate Miltner, Tarleton Gillespie, Megan Finn, Jonathan Sterne, Li Cornfeld as well as my colleague Thomas Poell from the University of Amsterdam.

Cross-posted from my own blog

SMC is hiring a Research Assistant!

May 1, 2014

UPDATE: At this time we have a great pool for 2014 and are no longer accepting applications.

—-
Microsoft Research (MSR) is looking for a Research Assistant for its Social Media Collective in the New England lab, based in Cambridge, Massachusetts. The Social Media Collective consists of Nancy Baym, Mary Gray, Jessa Lingel, and Kevin Driscoll in Cambridge, and Kate Crawford and danah boyd in New York City, as well as faculty visitors and Ph.D. interns. The RA will be working directly with Nancy Baym, Kate Crawford and Mary Gray.

An appropriate candidate will be a self-starter who is passionate and knowledgeable about the social and cultural implications of technology. Strong skills in writing, organisation and academic research are essential, as are time-management and multi-tasking. Minimal qualifications are a BA or equivalent degree in a humanities or social science discipline and some qualitative research training.

Job responsibilities will include:
– Sourcing and curating relevant literature and research materials
– Producing literature reviews and/or annotated bibliographies
– Coding ethnographic and interview data
– Editing manuscripts
– Working with academic journals on themed sections
– Assisting with research project and event organization

The RA will also get to collaborate on ongoing research and, while publication is not a guarantee, the RA will be encouraged to co-author papers while at MSR. The RAship will require 40 hours per week on site in Cambridge, MA, and remote collaboration with the researchers in the New York City lab. It is a 1-year only contractor position, paid hourly with flexible daytime hours. The start date will ideally be in late June, although flexibility is possible for the right candidate.

This position is ideal for junior scholars who will be applying to PhD programs in Communication, Media Studies, Sociology, Anthropology, Information Studies, and related fields and want to develop and hone their research skills before entering a graduate program. Current New England-based MA/PhD students are welcome to apply provided they can commit to 40 hours of on-site work per week.

To apply, please send an email to Nancy Baym (baym@microsoft.com) with the subject “RA Application” and include the following attachments:

- One-page (single-spaced) personal statement, including a description of research experience, interests, and professional goals
– CV or resume
– Writing sample (preferably a literature review or a scholarly-styled article)
– Links to online presence (e.g., blog, homepage, Twitter, journalistic endeavors, etc.)
– The names and emails of two recommenders

We will begin reviewing applications on May 12 and will continue to do so until we find an appropriate candidate.

Please feel free to ask quesions about the position in the comments! I have answered a couple of the most common ones there already.

Follow

Get every new post delivered to your Inbox.

Join 1,153 other followers