Heading to the Courthouse for Sandvig v. Sessions

E._Barrett_Prettyman_Federal_Courthouse,_DC

(or: Research Online Should Not Be Illegal)

I’m a college professor. But on Friday morning I won’t be in the classroom, I’ll be in courtroom 30 in the US District Courthouse on Constitution Avenue in Washington DC. The occasion? Oral arguments on the first motion in Sandvig v. Sessions.

You may recall that the ACLU, academic researchers (including me), and journalists are bringing suit against the government to challenge the constitutionality of “The Worst Law in Technology” — the US law that criminalizes most online research. Our hopes are simple: Researchers and reporters should not fear prosecution or lawsuits when we seek to obtain information that would otherwise be available to anyone, by visiting a Web site, recording the information we see there, and then publishing research results based on what we find.

As things stand, the misguided US anti-hacking law, called the Computer Fraud and Abuse Act (CFAA), makes it a crime if a computer user “exceeds authorized access.” What is authorized access to a Web site? Previous court decisions and the federal government have defined it as violating the site’s own stated “Terms of Service,” (ToS) but that’s ridiculous. The ToS is a wish-list of what corporate lawyers dream about, written by corporate lawyers. (Crazy example, example, example.) ToS sometimes prohibit people from using Web sites for research, they prohibit users from saying bad things about the corporation that runs the Web site, they prohibit users from writing things down. They should not be made into criminal violations of the law.

In the latest developments of our case, the government has argued that Web servers are private property, and that anyone who exceeds authorized access is trespassing “on” them. (“In” them? “With” them? It’s a difficult metaphor.) In other cases the CFAA was used to say that because Web servers are private, users are also wasting capacity on these servers, effectively stealing a server’s processing cycles that the owner would rather use for other things. I visualize a cartoon thief with a bag of electrons.

Are Internet researchers and data journalists “trespassing” and “stealing”? These are the wrong metaphors. Lately I’ve been imagining what would happen in the world of print if the CFAA metaphors were our guide back when the printing press were invented.

If you picked up a printed free newspaper like Express, the Metro, or the Chicago Reader at a street corner and the CFAA applied to it, there would be a lengthy “Terms of Readership” printed on an inside page in very small type. Since these are advertising-supported publications, it would say that people who belong to undesirable demographics are trespassing on the printed page if they attempt to read it. After all, the newspaper makes no money from readers who are not part of a saleable advertising audience. In fact, since the printing presses are private property, unwanted readers are stealing valuable ink and newsprint that should be reserved for the paper’s intended readers. To cover all the bases, readers would be forbidden from writing anything based on what they read in the paper if the paper’s owners wouldn’t like it. And readers could be sued by the newspaper or prosecuted by the federal government if they did any of these things. The scenario sounds foolish and overblown, but it’s the way that Web sites work now under the CFAA.

Another major government argument has been that we researchers and journalists have nothing to be concerned about because prosecutors will use this law with the appropriate discretion. Any vagueness is OK because we can trust them. Concern by researchers and reporters is groundless.

Yet federal prosecutors have a terrible record when it comes to the CFAA. And the idea that online platforms want to silence research and journalism is not speculative. After our lawsuit was filed, the Streaming Heritage research team funded by the Swedish Research Council (similar to the US National Science Foundation) received shocking news: Spotify’s lawyers had contacted the Research Council and asked the council to take “resolute action” against the project, suggesting it had violated “applicable law.” Professors Snickars, Vonderau, and others were studying the Spotify platform. What “law” did Spotify claim was being violated? The site’s own Terms of Service. (Here’s a description of what happened. Note: It’s in Swedish.)

This demand occurred just after a member of the research team appeared in a news story that characterized Spotify in a way that Spotify apparently did not like. Luckily, Sweden does not have the CFAA, and terms of service there do not hold the force of law. The Research Council repudiated Spotify’s claim that research studying private platforms was unethical and illegal if it violated the terms of service. Researchers and journalists in other countries need the same protection.

More Information

The full text of the motions in the case is available on the ACLU Web site. In our most recent filing there is an excellent summary of the case and the issues, starting on p. 6. You do not need to read the earlier filings for this to make sense.

There was a burst of news coverage when our lawsuit was filed. Standout pieces include the New Yorker’sHow an Old Hacking Law Hampers the Fight Against Online Discrimination” and “When Should Hacking Be Legal?” in The Atlantic.

The ACLU’s Rachel Goodman has recently published a short summary of how to do research under the shadow of the CFAA. It is titled as a tipsheet for “Data Journalism” but it applies equally well to academic researchers. A longer version co-authored with Esha Bhandari is also available.

(Note that I filed this lawsuit as a private citizen and it does not involve my university.)

IMAGE CREDIT: AgnosticPreachersKid via Wikimedia Commons

Why I Am Suing the Government — Update

[This is an old postSEE ALSO: The most recent blog post about this case.]

Last month I joined other social media researchers and the ACLU to file a lawsuit against the US Government to protect the legal right to conduct online research. This is newly relevant today because a community of devs interested in public policy started a petition in support of our court case. It is very nice of them to make this petition. Please consider signing it and sharing this link.

PETITION: Curiosity is (not) a crime
http://slashpolicy.com/petition/curiosity-is-not-a-crime/


For more context, see last month’s post: Why I Am Suing the Government.

 

Why I Am Suing the Government

(or: I write scripts, bots, and scrapers that collect online data)

[This is an old postSEE ALSO: The most recent blog post about this case.]

I never thought that I would sue the government. The papers went in on Wednesday, but the whole situation still seems unreal. I’m a professor at the University of Michigan and a social scientist who studies the Internet, and I ran afoul of what some have called the most hated law on the Internet.

Others call it the law that killed Aaron Swartz. It’s more formally known as the Computer Fraud and Abuse Act (CFAA), the dangerously vague federal anti-hacking law. The CFAA is so broad, you might have broken it. The CFAA has been used to indict a MySpace user for adding false information to her profile, to convict a non-programmer of “hacking,” to convict an IT administrator of deleting files he was authorized to access, and to send a dozen FBI agents to the house of a computer security researcher with their guns drawn.

Most famously, prosecutors used the CFAA to threaten Reddit co-founder and Internet activist Aaron Swartz with 50 years in jail for an act of civil disobedience — his bulk download of copyrighted scholarly articles. Facing trial, Swartz hung himself at age 26.

The CFAA is alarming. Like many researchers in computing and social science, writing scripts, bots, or scrapers that collect online data is a normal part of my work. I routinely teach my students how to do it in my classes. Now that all sorts of activities have moved online — from maps to news to grocery shopping — studying people means now means studying people online and thus gathering online data. It’s essential. 

Les raboteurs de parquet (cropped)

Image: Les raboteurs de parquet by Gustave Caillebotte (cropped)
SOURCE: Wikipedia

Yet federal charges were brought against someone who was downloading publicly available Web pages.

People might think of the CFAA as a law about hacking with side effects that are a problem for computer security researchers. But the law affects anyone who does social research, or who needs access to public information. 

I work at a public institution. My research is funded by taxes and is meant for the greater good. My results are released publicly. Lately, my research designs have been investigating illegal fraud and discrimination online, evils that I am trying to stop. But the CFAA made my research designs too risky. A chief problem is that any clause in a Web site’s terms of service can become enforceable under the CFAA.

I found that crazy. Have you ever read a terms of service agreement? Verizon’s terms of service prohibited anyone using a Verizon service from saying bad things about Verizon. As it says in the legal complaint, some terms of service prohibit you from writing things down (as in, with a pen) if you saw them on a particular — completely public — Web page.

These terms of service aren’t laws, they’re statements written by Web site owners describing what they’d like to happen if they ran the universe. But the current interpretation of the CFAA says that we must judge what is authorized on the Web by reading a site’s terms of service to see what has been prohibited. If you violate the terms of service, the current CFAA mindset is: you’re hacking.

That means anything a Web site owner writes in the terms of service effectively becomes the law, and these terms can change at any time.

Did you know that terms of service can expressly prohibit the use of a Web site by researchers? Sites effectively prohibit research by simply outlawing any saving or republication of their contents, even if they are public Web pages. Dice.com forbids “research or information gathering,” while LinkedIn says you can’t “copy profiles and information of others through any means” including “manual” means. You also can’t “[c]ollect, use, copy, or transfer any information obtained from LinkedIn,” or “use the information, content or data of others.” (This begs the question: How would the intended audience possibly use LindedIn and follow these rules? Memorization?)

As a researcher, I was appalled by the implications, once they sunk in. The complaint I filed this week has to do with my research on anti-discrimination laws, but it is not too broad to say this: The CFAA, as things stand, potentially blocks all online research. Any researcher who uses information from Web sites could be at risk from the provision in our lawsuit. That’s why others have called this case “key to the future of social science.”

If you are a researcher and you think other researchers would be interested in this information, please share this information. We need to get the word out that the present situation is untenable.

NEW: There is now an online petition started by a cool group of policy-minded devs on our behalf. Please consider signing and sharing it.

The ACLU is providing my legal representation, and in spirit I feel that they have taken this case on behalf of all researchers and journalists. If you care about this issue and you’d like to help, I urge you to contribute.

 

Want more? Here is an Op-Ed that I co-authored with my co-plaintiff Prof. Karrie Karahalios:

Most of what you do online is illegal. Let’s end the absurdity.
https://www.theguardian.com/commentisfree/2016/jun/30/cfaa-online-law-illegal-discrimination

Here is the legal complaint:

Sandvig v. Lynch
https://www.aclu.org/legal-document/sandvig-v-lynch-complaint

Here is a press release about the lawsuit:

ACLU Challenges Law Preventing Studies on “Big Data” Discrimination
https://www.aclu.org/news/aclu-challenges-law-preventing-studies-big-data-discrimination

Here is some of the news coverage:

Researchers Sue the Government Over Computer Hacking Law
https://www.wired.com/2016/06/researchers-sue-government-computer-hacking-law/

New ACLU lawsuit takes on the internet’s most hated hacking law
http://www.theverge.com/2016/6/29/12058346/aclu-cfaa-lawsuit-algorithm-research-first-amendment

Do Housing and Jobs Sites Have Racist Algorithms? Academics Sue to Find Out
http://arstechnica.com/tech-policy/2016/06/do-housing-jobs-sites-have-racist-algorithms-academics-sue-to-find-out/

When Should Hacking Be Legal?
http://www.theatlantic.com/technology/archive/2016/07/when-should-hacking-be-legal/489785/

Please note that I have filed suit as a private citizen and not as an employee of the University.

[Updated on 7/2 with additional links.]

[Updated on 8/3 with the online petition.]

 

How Do Users Take Collective Action Against Online Platforms? CHI Honorable Mention

What factors lead users in an online platform to join together in mass collective action to influence those who run the platform? Today, I’m excited to share that my CHI paper on the reddit blackout has received a Best Paper Honorable Mention! (Read the pre-print version of my paper here)

When users of online platforms complain, we’re often told to leave if we don’t like how a platform is run. Beyond exit or loyalty, digital citizens sometimes take a third option, organizing to pressure companies for change. But how does that come about?

I’m seeking reddit moderators to collaborate on the next stage of my research: running experiments together with subreddits to test theories of moderation. If you’re interested, you can read more here. Also, I’m presenting this work as part of larger talks at the Berkman Center on Feb 23 and the Oxford Internet Institute on March 16. I would love to see you there!

Having a formalized voice with online platforms is rare, though it has happened with San Francisco drag queens, the newly-announced Twitter Trust and Safety Council or the EVE player council, where users are consulted about issues a platform faces. These efforts typically keep users in positions of minimal power on the ladder of citizen participation, but they do give some users some kind of voice.

Another option is collective action, leveraging the collective power of users to pressure a platform to change how that platform works. To my knowledge, this has only happened four times on major U.S. platforms: when AOL community leaders settled a $15 million class action lawsuit for unpaid wages, when DailyKos writers went on strike in 2008, the recent Uber class action lawsuit, and the reddit blackout of July 2015, when moderators of 2,278 subreddits shut down their communities to pressure the company for better coordination and better moderation tools. They succeeded.

What factors lead communities to participate in such a large scale collective action? That’s the question that my paper set out to answer, combining statistics with the “thick data” of qualitative research.

The story of how I answered this question is also a story about finding ways to do large-scale research that include the voices and critiques of the people whose lives we study as researchers. In the turmoil of the blackout, amidst volatile and harmful controversies around hate speech, harassment, censorship, and the blackout itself, I made special effort to do research that included redditors themselves.

Theories of Social Movement Mobilization

Social movement researchers have been asking how movements come together for many decades, and there are two common schools, responding to early work to quantify collective action (see Olson, Coleman):

Political Opportunity Theories argue that social movements need the right people and the right moment. According to these theories, a movement happens when grievances are high, when social structure among potential participants is right, and when the right opportunity for change arises. For more on political opportunity theory, see my Atlantic article on the Facebook Equality Meme this past summer.

Resource Mobilization Theories argue that successful movements are explained less by grievances and opportunities and more by the resources available to movement actors. In their view, collective action is something that groups create out of their resources rather than something that arises out of grievances. They’re also interested in social structure, often between groups that are trying to mobilize people (read more).

A third voice in these discussions are the people who participate in movements themselves, voices that I wanted to have a primary role in shaping my research.

How Do You Study a Strike As It Unfolds?

I was lucky enough to be working with moderators and collecting data before the blackout happened. That gave me a special vantage for combining interviews and content analysis with statistical analysis of the reddit blackout.

Together with redditors, I developed an approach of “participatory hypothesis testing,” where I posed ideas for statistics on public reddit threads and worked together with redditors to come up with models that they agreed were a fair and accurate analysis of their experience. Grounding that statistical work involved a whole lot of qualitative research as well.

If you like that kind of thing, here are the details:

In the CHI paper, I analyzed 90 published interviews with moderators from before the blackout, over 250 articles outside reddit about the blackout, discussions in over 50 subreddits that declined to join the blackout, public statements by over 200 subreddits that joined the blackout, and over 150 discussions in blacked out subreddits after their communities were restored. I also read over 100 discussions in communities that chose not to join. Finally, I conducted 90 minute interviews with 13 subreddit moderators of subreddits of all sizes, including those that joined and declined to join the blackout.

To test hypotheses developed with redditors, I collected data from 52,735 non-corporate subreddits that received at least one comment in June 2015, alongside a list of blacked-out subreddits. I also collected data on moderators and comment participation for the period surrounding the blackout.

So What’s The Answer? What Factors Predict Participation in Action Against Platforms?

In the paper, I outline major explanations offered by moderators and translate them into a statistical model that corresponds to major social movement theories. I found evidence confirming many of redditor’s explanations across all subreddits, including aspects of classic social movement theories. These findings are as much about why people choose *not* to participate as much as they are about what factors are involved in joining:

    • Moderator Grievances were important predictors of participation. Subreddits with greater amounts of work, and whose work was more risky were more likely to join the blackout
    • Subreddit Resources were also important factors. Subreddits with more moderators were more likely to join the blackout. Although “default” subreddits played an important role in organizing and negotiating in the blackout, they were no more or less likely to participate, holding all else constant.
    • Relations Among Moderators were also important predictors, and I observed several cases where “networks” of closely-allied subreddits declined to participate.
    • Subreddit Isolation was also an important factor, with more isolated subreddits less likely to join, and moderators who participate in “metareddits” more likely to join.
    • Moderators Relations Within Their Groups were also important; subreddits whose moderators participated more in their groups were less likely to join the blackout.

Many of my findings go into details from my interviews and observations, well beyond just a single statistical model; I encourage you to read the pre-print version of my paper.

What’s Next For My reddit Research?

The reddit blackout took me by surprise as much as anyone, so now I’m back to asking the questions that brought me to moderators in the first place:

THANK YOU REDDIT! & Acknowledgments

CHI_Banner

First of all, THANK YOU REDDIT! This research would not have been possible without generous contributions from hundreds of reddit users. You have been generous all throughout, and I deeply appreciate the time you invested in my work.

Many other people have made this work possible; I did this research during a wonderful summer internship at the Microsoft Research Social Media Collective, mentored by Tarleton Gillespie and Mary Gray. Mako Hill introduced me to social movement theory as part of my general exams. Molly Sauter, Aaron Shaw, Alex Leavitt, and Katherine Lo offered helpful early feedback on this paper. My advisor Ethan Zuckerman remains a profoundly important mentor and guide through the world of research and social action.

Finally, I am deeply grateful for family members who let me ruin our Fourth of July weekend to follow the reddit blackout closely and set up data collection for this paper. I was literally sitting at an isolated picnic table ignoring everyone and archiving data as the weekend unfolded. I’m glad we were able to take the next weekend off! ❤

Facebook’s improved “Community Standards” still can’t resolve the central paradox

fb-policies1On March 16, Facebook updated its “Community Standards,” in ways that were both cosmetic and substantive. The version it replaced, though it had enjoyed minor updates, had been largely the same since at least 2011. The change comes on the heels of several other sites making similar adjustments to their own policies, including Twitter, YouTube, Blogger, and Reddit – and after months, even years of growing frustration and criticism on the part of social media users about platforms and their policies. This frustration and criticism is of two minds: sometimes, criticism about overly conservative, picky, vague, or unclear restrictions; but also, criticism that these policies fall far too short protecting users, particularly from harassment, threats, and hate speech.

“Guidelines” documents like this one are an important part of the governance of social media platforms; though the “terms of service” are a legal contract meant to spell out the rights and obligations of both the users and the company — often to impose rules on users and indemnify the company against any liability for their actions — it is the “guidelines” that are more likely to be read by users who have a question about the proper use of the site, or find themselves facing content or other users that trouble them. More than that, they serve a broader rhetorical purpose: they announce the platform’s principles and gesture toward the site’s underlying approach to governance.

Facebook described the update as a mere clarification: “While our policies and standards themselves are not changing, we have heard from people that it would be helpful to provide more clarity and examples, so we are doing so with today’s update.” Most of the coverage among the technology press embraced this idea (like here, here, here, here, here, and here). But while Facebook’s policies may not have changed dramatically, so much is revealed in even the most minor adjustments.

First, it’s revealing to look not just at what the rules say and how they’re explained, but how the entire thing is framed. While these documents are now ubiquitous across social media platforms, it is still a curiosity that these platforms so readily embrace and celebrate the role of policing their users – especially amidst the political ethos of Internet freedom, calls for “Net neutrality” at the infrastructural level, and the persistent dreams of the open Web. Every platform must deal with this contradiction, and they often do it in the way they introduce and describe guidelines. These guidelines pages inevitably begin with a paragraph or more justifying not just the rules but the platform’s right to impose them, including a triumphant articulation of the platform’s aspirations.

Before this update, Facebook’s rules were justified as follows: “To balance the needs and interests of a global population, Facebook protects expression that meets the community standards outlined on this page.” In the new version, the priority has shifted, from protecting speech to ensuring that users “feel safe:” “Our goal is to give people a place to share and connect freely and openly, in a safe and secure environment.” I’m not suggesting that Facebook has stopped protecting speech in order to protect users. All social media platforms struggle to do both. But which goal is most compelling, which is held up as the primary justification, has changed.

This emphasis on safety (or more accurately, the feeling of safety) is also evident in the way the rules are now organized. What were, in the old version, eleven rule categories are now fifteen, but they are now grouped into four broad categories – the first of which is, “ keeping you safe.” This is indicative of the effect of the criticisms of recent years: that social networking sites like Facebook and Twitter have failed users, particularly women, in the face of vicious trolling.

fb-policies2As for the rules themselves, it’s hard not to see them as the aftermath to so many particular controversies that have dogged the social networking site over the years. Facebook’s Community Standards increasingly look like a historic battlefield: while it may appear to be a bucolic pasture, the scars of battle remain visible, carved into the land, thinly disguised beneath the landscaping and signage. Some of the most recent skirmishes are now explicitly addressed: A new section on sexual violence and exploitation includes language prohibiting revenge porn. The rule against bullying and harassment now includes a bullet point prohibiting “Images altered to degrade private individuals,” a clear reference to the Photoshopped images of bruised and battered women that were deployed (note: trigger warning) against Anita Sarkessian and others in the Gamergate controversy. The section on self-injury now includes a specific caveat that body modification doesn’t count.

In this version, Facebook seems extremely eager to note that contentious material is often circulated for publicly valuable purposes, including awareness raising, social commentary, satire, and activism. A version of this appears again and again, as part of the rules against graphic violence, nudity, hate speech, self injury, dangerous organizations, and criminal activity. In most cases, these socially valuable uses are presented as a caveat to an otherwise blanket prohibition: even hate speech, which is almost entirely prohibited and in strongest terms, now has a caveat protecting users who circulate examples of hate speech for the purposes of education and raising awareness. It is clear that Facebook is ever more aware of its role as a public platform, where contentious politics and difficult debate can occur. Now it must offer to patrol the tricky line between the politically contentious and the culturally offensive.

Oddly, in the rule about nudity, and only there, the point about socially acceptable uses is not a caveat, but part of an awkward apology for imposing blanket restrictions anyway: “People sometimes share content containing nudity for reasons like awareness campaigns or artistic projects. We restrict the display of nudity because some audiences within our global community may be sensitive to this type of content – particularly because of their cultural background or age. In order to treat people fairly and respond to reports quickly, it is essential that we have policies in place that our global teams can apply uniformly and easily when reviewing content. As a result, our policies can sometimes be more blunt than we would like and restrict content shared for legitimate purposes.” Sorry, Femen. On the other hand, apparently its okay if its cartoon nudity: “Restrictions on the display of both nudity and sexual activity also apply to digitally created content unless the content is posted for educational, humorous, or satirical purposes.” A nod to Charlie Hebdo, perhaps? Or just a curious inconsistency.

The newest addition to the document, and the one most debated in the press coverage, is the new way Facebook now articulates its long-standing requirement that users use their real identity. The rule was recently challenged by a number of communities eager to use Facebook under aliases or stage names, as well as by communities (such as Native Americans) who find themselves on the wrong side of Facebook’s policy simply because the traditions of naming in their culture do not fit Facebook’s. After the 2014 scuffle with drag queens about the right to use a stage identity instead of or alongside a legal one, Facebook promised to make its rule more accommodating. in this update Facebook has adopted the phrase “ authentic identity,” their way of allowing adopted performance names but continuing to prohibit duplicate accounts. The update is also a chance for them to re-justify their rule: at more than one point in the document, and in the accompanying letter from Facebook’s content team, this “authentic identity” requirement is presented as assuring responsible and accountable participation: “Requiring people to use their authentic identity on Facebook helps motivate all of us to act responsibly, since our names and reputations are visibly linked to our words and actions.”

There is also some new language in an even older battle: for years, Facebook has been removing images of women breastfeeding, as a violation its rules against nudity. This has long angered a community of women who strongly believe that sharing such images is not only their right, but important for new mothers and for the culture at large (only in 2007, 2008, 2010, 20112012, 20132014, 2015…). After years of disagreements, protests, and negotiations, in 2014  published a special rule saying that it would allow images of breast-feeding so long as they did not include an exposed nipple. This was considered a triumph by many involves, though reports continue to emerge of women having photos removed and accounts suspended despite the promise. This assurance reappears in the new version of the community standards just posted: “We also restrict some images of female breasts if they include the nipple, but we always allow photos of women actively engaged in breastfeeding or showing breasts with post-mastectomy scarring.” The Huffington Post reads this as (still) prohibiting breastfeeding photos if they include an exposed nipple, but if the structure of this sentence is read strictly, the promise to “ always” allow photos of women breast-feeding seems to me to trump the previous phrase about exposed nipples. I may be getting nitpicky here, but it’s only as a result of years of back and forth about the precise wording of this rule, and Facebook’s willingness and ability to honor it in practice.

In my own research, I have tracked the policies of major social media platforms, noting both the changes and continuities, the justifications and the missteps. One could dismiss these guidelines as mere window dressing — as a performed statement of coherent values that do not in fact drive the actual enforcement of policy on the site, which so often turns out to be more slapdash or strategic or hypocritical. I find it more convincing to say that these are statements of both policy and principle that are struggled over at times, are deployed when they are helpful and can be sidestepped when they’re constraining, and that do important discursive work beyond simply guiding enforcement. These guidelines matter, and not only when they are enforced, and not only for lending strength to the particular norms they represent. Platforms adjust their guidelines in relation to each other, and smaller sites look to the larger ones for guidance, sometimes borrowing them wholesale. The rules as articulated by Facebook matter well beyond Facebook. And they perform, and therefore reveal in oblique ways, how platforms see themselves in the role of public arbiters of cultural value. They are also by no means the end of the story, as no guidelines in the abstract could possibly line up neatly with how they are enforced in practice.

Facebook’s newest update is consistent with changes over the past few years on many of the major sites, a common urge to both impose more rules and use more words to describe them clearly. This is a welcome adjustment, as so many of the early policy documents, including Facebook’s, were sparse, abstract, and unprepared for the variety and gravity of questionable content and a awful behavior they would soon face. There are some laudable principles made explicit here. On the other hand, adding more words, more detailed examples, and further clarifications does not – cannot – resolve the other challenge: these are still rules that must be applied in specific situations, requiring judgment calls made by overworked, freelance clickworkers. And, while it is a relief to see Facebook and other platforms taking a firmer stand on issues like misogyny, rape threats, trolling, and self-harm, they often are accompanied by ever more restriction not just of bad behavior but of questionable content, a place where the mode of ‘protection’ means something quite different, much more patronizing. The basic paradox remains: these are private companies policing public speech, and are often intervening according to a culturally specific or a financially conservative morality. It is the next challenge for social media to strike a better balance in this regard: more effectively intervening to protect users themselves, while intervening less on behalf of users’ values.

This is cross-posted on the Culture Digitally blog.

Tumblr, NSFW porn blogging, and the challenge of checkpoints

After Yahoo’s high-profile purchase of Tumblr, when Yahoo CEO Marissa Mayer said that she would “promise not to screw it up,” this is probably not what she had in mind. Devoted users of Tumblr have been watching closely, worried that the cool, web 2.0 image blogging tool would be tamed by the nearly two-decade-old search giant. One population of Tumblr users, in particular, worried a great deal: those that used Tumblr to collect and share their favorite porn. This is a distinctly large part of the Tumblr crowd: according to one analysis, somewhere near or above 10% of Tumblr is “adult fare.”

Now that group is angry. And Tumblr’s new policies, that made them so angry, are a bit of a mess. Two paragraphs from now, I’m going to say that the real story is not the Tumblr/Yahoo incident, or how it was handled, or even why it’s happening. But the quick run-down, and it’s confusing if you’re not a regular Tumblr user. Tumblr had a self-rating system: blogs with “occasional” nudity should self-rate as “NSFW”. Blogs with “substantial” nudity should rate themselves as “adult.” About two months ago, some Tumblr users noticed that blogs rated “adult” were no longer being listed with the major search engines. Then in June, Tumblr began taking both “NSFW” and “adult” blogs out of their internal search results — meaning, if you search in Tumblr for posts tagged with a particular word, sexual or otherwise, the dirty stuff won’t come up. Unless the searcher already follows your blog, then the “NSFW” posts will appear, but not the “adult” ones. Akk, here, this is how Tumblr tried to explain it:

What this meant is that your existing followers of a blog can largely still see your “NSFW” blog, but it would be very difficult for anyone new to find it. David Karp, founder and CEO of Tumblr, dodged questions about it on the Colbert Report, saying only that Tumblr doesn’t want to be responsible for drawing the lines between artistic nudity, casual nudity, and hardcore porn.

Then a new outrage emerged when some users discover that, in the mobile  version of Tumblr, some tag searches turn up no results, dirty or otherwise — and not just for obvious porn terms, like “porn,” but also for broader terms, like “gay”. Tumblr issued a quasi-explanation on their blog, which some commentators and users found frustratingly vague and unapologetic.

Ok. The real story is not the Tumblr/Yahoo incident, or how it was handled, or even why it’s happening. Certainly, Tumblr could have been more transparent about the details of their original policy, or the move in May or earlier to de-list adult Tumblr blogs in major search engines, or the decision to block certain tag results. Certainly, there’ve been some delicate conversations going on at Yahoo/Tumblr headquarters, for some time now, on how to “let Tumblr be Tumblr” (Mayer’s words) and also deal with all this NSFW blogging “even though it may not be as brand safe as what’s on our site” (also Mayer). Tumblr puts ads in its Dashboard, where only logged-in users see them, so arguably the ads are never “with” the porn — but maybe Yahoo is looking to change that, so that the “two companies will also work together to create advertising opportunities that are seamless and enhance the user experience.”

What’s ironic is that, I suspect, Tumblr and Yahoo are actually trying to find ways to remain permissive when it comes to NSFW content. They are certainly (so far) more permissive than some of their competitors, including Instagram, Blogger, Vine, and Pinterest, all of whom have moved in the last year to remove adult content, make it systematically less visible to their users, or prevent users from pairing advertising with it. The problem here is their tactics.

Media companies, be they broadcast or social, have fundamentally two ways to handle content that some but not all of their users find inappropriate.

First, they can remove some of it, either by editorial fiat or at the behest of the community. This means writing up policies that draw those tricky lines in the sand (no nudity? what kind of nudity? what was meant by the nudity?), and then either taking on the mantle (and sometimes the flak) of making those judgments themselves, or having to decide which users to listen to on which occasions for which reasons.

Second, and this is what Tumblr is trying, is what I’ll call the “checkpoint” approach. It’s by no means exclusive to new media: putting the X-rated movies in the back room at the video store, putting the magazines on the shelf behind the counter, wrapped in brown paper, scheduling the softcore stuff on Cinemax after bedtime, or scrambling the adult cable channel, all depend on the same logic. Somehow the provider needs to keep some content from some people and deliver it to others. (All the while, of course, they need to maintain their reputation as defender of free expression, and not appear to be “full of porn,” and keep their advertisers happy. Tricky.)

To run such a checkpoint requires (1) knowing something about the content, (2) knowing something about the people, and (3) having a defensible line between them.

First, the content. That difficult decision, about what is artistic nudity, what’s casual nudity, and what’s pornographic? It doesn’t go away, but the provider can shift the burden of making that decision to someone else — not just to get it off their shoulders, but sometimes to hand it someone more capable of making it. Adult movie producers or magazine publishers can self-rate their content as pornographic. An MPAA-sponsored board can rate films. There are problems, of course: either the “who are these people?” problem, as in the mysterious MPAA ratings board, or the “these people are self-interested” problem, as when TV production houses rate their own programs. Still, this self-interest can often be congruent with the interests of the provider: X-rated movie producers know that their options may be the back room or not at all, and gain little i pretending that they’re something they’re not.

Next, the people. It may seem like a simple thing, just keeping the dirty stuff on the top shelf and carding people who want to buy it. Any bodega shopkeep can manage to do it. But it is simple only because it depends on a massive knowledge architecture, the driver’s license, that it didn’t have to generate itself. This is a government sponsored, institutional mechanism that, in part, happens to be engaged in age verification. It requires a massive infrastructure for record keeping, offices throughout the country, staff, bureaucracy, printing services, government authorization, and legal consequences for cases of fraud. All that so that someone can show a card and prove they’re of a certain age. (That kind of certified, high-quality data is otherwise hard to come by, as we’ll see in a moment.)

Finally, a defensible line. The bodega has two: the upper shelf and the cash register. The kids can’t reach, and even the tall ones can’t slip away uncarded, unless they’re also interested in theft. Cable services use encryption: the signal is scrambled unless the cable company authorizes it to be unscrambled. This line is in fact not simple to defend: the descrambler used to be in the box itself, which was in the home and, with the right tools and expertise, openable by those who might want to solder the right tab and get that channel unscrambled. This meant there had to be laws against tampering, another external apparatus necessary to make this tactic stick.

Tumblr? Well. All of this changes a bit when we bring it into the world of digital, networked, and social media. The challenges are much the same, and if we notice that the necessary components of the checkpoint are data, we can see how this begins to take on the shape that it does.

The content? Tumblr asked its users to self-rate, marking their blog as “NSFW” or “adult.” Smart, given that bloggers sharing porn may share some of Tumblr’s interest in putting it behind the checkpoint: many would rather flag their site as pornographic and get to stay on Tumblr, than be forbidden to put it up at all. Even flagged, Tumblr provides them what they need: the platform on which to collect content, a way to gain and keep interested viewers. The categories are a little ambiguous — where is the line between “occasional” and “substantial” nudity to be drawn? Why is the criteria only about amount, rather than degree (hard core vs soft core), category (posed nudity vs sexual act), or intent (artistic vs unseemly)? But then again, these categories are always ambiguous, and must always privilege some criteria over others.

The people? Here it gets trickier. Tumblr is not imposing an age barrier, they’re imposing a checkpoint based on desire, dividing those who want adult content from those who don’t. This is not the kind of data that’s kept on a card in your wallet, backed by the government, subject to laws of perjury. Instead, Tumblr has two ways to try to know what a user wants: their search settings, and what they search for. If users have managed to correctly classify themselves into “Safe Mode,” indicating in the settings that they do not want to see anything flagged as adult, and people posting content have correctly marked their content as adult or not, this should be an easy algorithmic equation: “safe” searcher is never shown “NSFW” content. The only problems would be user error: searchers who do not set their search settings correctly, and posters who do not flag their adult content correctly. Reasonable problems, and the kind of leakage that any system of regulation inevitably faces. Flagging at the blog level (as opposed to flagging each post as adult or not) is a bit of a dull instrument: all posts from my “NSFW” blog are being withheld from safe searchers, even the ones that have no questionable content — despite the fact that by their own definition a “NSFW” tumblr blog only has “occasional” nudity. Still, getting people to rate every post is a major barrier, few will do so diligently, and it doesn’t fit into simple “web button” interfaces.

Defending the dividing line? Since the content is digital, and the information about content and users is data, it should not be surprising that the line here is algorithmic. Unlike the top shelf or the back room, the adult content on Tumblr lives amidst the rest of the archive. And there’s no cash register, which means that there’s no unavoidable point at which use can be checked. There is the login, which explains why non-logged-in users are treated as only wanting “safe” content. But, theoretically, an “algorithmic checkpoint” should work based on search settings and blog ratings. As a search happens, compare the searcher’s setting with the content’s rating, and don’t deliver the dirty to the safe.

But here’s where Tumblr took two additional steps, the ones that I think raise the biggest problem for the checkpoint approach in the digital context.

Tumblr wanted to extend the checkpoint past the customer who walks into the store and brings adult content to the cash register, out to the person walking by the shop window. And those passersby aren’t always logged in, they come to Tumblr in any number of ways. Because here’s the rub with the checkpoint approach: it does, inevitably, remind the population of possible users, that you do allow the dirty stuff. The new customer who walks into the video store, and sees that there is a back room, even if the never go in, may reject your establishment for even offering it. Can the checkpoint be extended, to decide whether to even reveal to someone that there’s porn available inside? If not in the physical world, maybe in the digital?

When Tumblr delisted its adult blogs from the major search engines, they wanted to keep Google users from seeing that Tumblr has porn. This, of course, runs counter to the fundamental promise of Tumblr, as a publishing platform, that Tumblr users (NSFW and otherwise) count on. And users fumed: “Removal from search in every way possible is the closest thing Tumblr could do to deleting the blogs altogether, without actually removing 10% of its user base.” Here is where we may see the fundamental tension at the Yahoo/Tumblr partnership: they may want to allow porn, but do they want to be known for allowing porn?

Tumblr also apparently wanted to extend the checkpoint in the mobile environment — or perhaps were required to, by Apple. Many services, especially those spurred or required by Apple to do so, aim to prevent the “accidental porn” situation: if I’m searching for something innocuous, can they prevent a blast of unexpected porn in response to my query? To some degree, the “NSFW” rating and the “safe” setting should handle this, but of course content that a blogger failed (or refused) to flag still slips through. So Tumblr (and other sites)  institute a second checkpoint: if the search term might bring back adult content, block all the results for that term. In Tumblr, this is based on tags: bloggers add tags that describe what they’ve posted, and search queries seek matches in those tags.

When you try to choreograph users based on search terms and tags, you’ve doubled your problem. This is not clean, assured data like a self-rating of adult content or the age on a driver’s license. You’re ascertaining what the producer meant when they tagged a post using a certain term, and what the searcher meant when they use the same term as a search query. If I search for the word “gay,” I may be looking for a gay couple celebrating the recent DOMA decision on the steps of the Supreme Court — or “celebrating” bent over the arm of the couch. Very hard for Tumblr to know which I wanted, until I click or complain.

Sometimes these terms line up quite well, either by accident, or on purpose: for instance when users of Instagram indicated pornographic images by tagging them “pornstagram,” a made-up word that would likely mean nothing else. (This search term no longer returns any results, although  — whoa! — it does on Tumblr!.) But in just as many cases, when you use the word gay to indicate a photo of your two best friends in a loving embrace, and I use the word gay in my search query to find X-rated pornography, it becomes extremely difficult for the search algorithm to understand what to do about all of those meanings converging on a single word.

Blocking all results to the query “gay,” or “sex”, or even “porn” may seem, form one vantage point (Yahoo’s?), to solve the NSFW problem. Tumblr is not alone in this regard: Vine and Instagram return no results to the search term “sex,” though that does not mean that no one’s using it as a tag – though Instagram returns millions of results for “gay,” Vine, like Tumblr, returns none. Pinterest goes further, using the search for “porn” as a teaching moment: it pops up a reminder that nudity is not permitted on the site, then returns results which, because of the policy, are not pornographic. By blocking search terms/tags, no porn accidentally makes it to the mobile platform or to the eyes of its gentle user. But, this approach fails miserably at getting adult content to those that want it, and more importantly, in Tumblr’s case, it relegates a broadly used and politically vital term like “gay” to the smut pile.

Tumblr’s semi-apology has begun to make amends. The two categories, “NSFW” and “adult” are now just “NSFW” and the blogs masked as such are now available in Tumblr’s internal search and in the major search engines. Tumblr has promised to work on a more intelligent filtering system. But any checkpoint that depends on data that’s expressive rather than systemic — what we say, as opposed to what we say we are — is going to step clumsily both on the sharing of adult content and the ability to talk about subjects that have some sexual connotations, and could architect the spirit and promise out of Tumblr’s publishing platform.

This was originally posted at Culture Digitally.

Data Dealer is Disastrous

(or, Unfortunately, Algorithms Sound Boring.)

Finally, a video game where you get to act like a database!

This morning, the print version of the New York Times profiled the Kickstarter-funded game “Data Dealer.” The game is a browser-based single-player farming-style clicker with a premise that the player “turns data into cash” by playing the role of a behind-the-scenes data aggregator probably modeled on a real company like Axciom.

Currently there is only a demo, but the developers have big future ambitions, including a multi-player version.  Here’s a screen shot:

Data Dealer screenshot
 
Data Dealer screen shot (click to enlarge.)

One reason Data Dealer is receiving a lot of attention is that there really isn’t anything else like it. It reminds me of the ACLU’s acclaimed “Ordering Pizza” video (now quite old) which vividly envisioned a dystopian future of totally integrated personal data through the lens of placing orders for pizza. The ACLU video shows you the user interface for a hypothetical software platform built to allow the person who answers the phone at an all-knowing pizza parlor to enter your order. 

(In the video, a caller tries to order a “double meat special” and is told that there will be an additional charge because of his high-blood pressure and high cholesterol. He complains about the high price and is told, “But you just bought those tickets to Hawaii!”)

The ACLU video is great because it uses a silly hook to get across some very important societal issues about privacy. It makes a topic that seems very boring — data protection and the risks involved in the interconnection of databases — vivid and accessible. As a teacher working with these issues, I still find the video useful today. Although it looks like the pizza ordering computer is running Windows 95.

Data Dealer has the same promise, but they’ve made some unusual choices. The ACLU’s goal was clearly public education about legal issues, and I think that the group behind Data Dealer has a similar goal. On their Kickstarter profile they describe themselves as “data rights advocates.”

Yet some of the choices made in the game design seem indefensible, as they might create awareness about data issues but they do so by promulgating misguided ideas about how data surveillance actually works. I found myself wondering: is it worth raising public awareness of these issues if they are presented in a way that is so distorted?

As a data aggregator, the chief antagonist in the demo is public opinion. While clearly that would be an antagonist for someone like Axciom, there are actually real risks to data aggregation that involve quantifiable losses. Data protection laws don’t exist solely because people are squeamish.

By focusing on public opinion, the message I am left with isn’t that privacy is really important, it is that “some people like it.” Those darn privacy advocates sure are fussy! (They periodically appear, angrily, in a pop-up window.) This seems like a much weaker argument than “data rights advocates” should be making. It even feels like the makers of Data Dealer are trying to demean themselves!  But maybe this was meant to be self-effacing.

I commend Data Dealer for grappling with one of the hardest problems that currently exists in the study of the social implications of computing: how to visualize things like algorithms and databases comprehensibly. In the game, your database is cleverly visualized as a vaguely vacuum-cleaner-like object. Your network is a kind of octopus-like shape. Great stuff!

However, some of the meatiest parts of the corporate data surveillance infrastructure go unmentioned, or are at least greatly underemphasized. How about… credit cards? Browser cookies? Other things are bizarrely over-emphasized relative to the actual data surveillance ecology: celebrity endorsements, online personality tests, and poster ad campaigns.

Algorithms are not covered at all (unless you count the “import” button that automatically “integrates” different profiles into your database.)  That’s a big loss, as the model of the game implies that things like political views are existing attributes that can be harvested by (for instance) monitoring what books you buy at a bookstore. The bookstores already hold your political views in this model, and you have to buy them from there. That’s not AT ALL how political views are inferred by data mining companies, and this gameplay model falsely creates the idea that my political views remain private if I avoid loyalty cards in bookstores.

A variety of the causal claims made in the game just don’t work in real life. A health insurance company’s best source for private health information about you is not mining online dating profiles for your stated weight. By emphasizing these unlikely paths for private data disclosure, the game obscures the real process and seems to be teaching those concerned about privacy to take useless and irrelevant precautions.

The crucial missing link is the absence of any depiction of the combination of disparate data to produce new insights or situations. That’s the topic the ACLU video tackles head-on. Although the game developers know that this is important (integration is what your vacuum-cleaner is supposed to be doing), that process doesn’t exist as part of the gameplay. Data aggregation in the game is simply shopping for profiles from a batch of blue sources and selling them to different orange clients (like the NSA or a supermarket chain). Yet combination of databases is the meat of the issue.

By presenting the algorithmic combination of data invisibly, the game implies that a corporate data aggregator is like a wholesaler that connects suppliers to retailers. But this is not the value data aggregation provides, that value is all about integration.

Finally, the game is strangely interested in the criminal underworld, promoting hackers as a route that a legitimate data mining corporation would routinely use. This is just bizarre. In my game, a real estate conglomerate wanted to buy personal data so I gathered it from a hacker who tapped into an Xbox Live-like platform. I also got some from a corrupt desk clerk at a tanning salon. This completely undermines the game as a corporate critique, or as educational.

In sum, it’s great to see these hard problems tackled at all, but we deserve a better treatment of them. To be fair, this is only the demo and it may be that the missing narratives of personal data will be added. A promised addition is that you can create your own social media platform (Tracebook) although I did not see this in my demo game. I hope the missing pieces are added. (It seems more unlikely that the game’s current flawed narratives will be corrected.)

My major reaction to the game is that this situation highlights the hard problems that educational game developers face. They want to make games for change, but effective gameplay and effective education are such different goals that they often conflict. For the sake of a salable experience the developers here clearly felt they had to stake their hopes on the former and abandon the latter, abandoning reality.

(This post was cross-posted at multicast.)

SOPA and the strategy of forced invisibility

Since I supported the blacking out of the MSR Social Media Collective blog to which I sometimes contribute, and the blacking out of Culture Digitally, which I co-organize, in order to join the SOPA protest led by the “Stop American Censorship” effort, the Electronic Frontier Foundation, Reddit, and Wikipedia, I though I should weigh in with my own concerns about the proposed legislation. 

While it’s reasonable for Congress to look for progressive, legislative ways to enforce copyrights and discourage flagrant piracy, SOPA (the Stop Online Piracy Act) and PIPA (the Protect IP Act) now under consideration are a fundamentally dangerous way to go about it. Their many critics have raised many compelling reasons for why [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]. But in my eyes, they are most dangerous because of their underlying logic: policing infringement by rendering sites invisible.

Under SOPA and PIPA, if a website is even accused of hosting or enabling infringing materials, the Attorney General can order search engines to delete that site from their listings, require ISPs to block users’  access to it, and demand payment services (like PayPal) and advertising networks to cancel their accounts with it. (This last step can even be taken by copyright holders themselves, with only a good faith assertion that the site in question is infringing.) What a tempting approach to policing the Internet: rather than pursuing and prosecuting this site and that site, in an endless game of whack-a-mole, just turn the large-scale intermediaries, and use their power to make websites available, in order to make them unavailable. It shows all too plainly that the Internet is not some wide open, decentralized, unregulatable space, as some have believed. But, it undercuts the longstanding American tradition of how to govern information, which has always erred on the side of letting information, even abhorrent or criminal information, be accessible to citizens, so we can judge for ourselves. Making it illegal to post something is one thing, but wiping the entire site clean off the board as if it never existed is another.

Expunging an infringing site from being found is problematic in itself, a clear form of “prior restraint.” But it is exacerbated by the fact that whole sites might be rendered invisible on the basis of just bits of infringing content they may host. This is a particular troubling to sites that host user-generated content, where one infringing thread, post, or community might co-exist amidst a trove of other legitimate content. Under SOPA and PIPA, a court order  could remove not just the offending thread, but the entire site from Google’s search engine, from ISPs, and from ad networks, all in a blink.

These are the same strategies, not only that China, Iran, and Vietnam currently use to restrict political speech (as prominent critics have charged), but that were recently used against Wikileaks right here at home. When Amazon kicked Wikileaks off its cloud computing servers, when Wikileaks was de-listed by one DNS operator, when Mastercard and Paypal refused to take donations for the organization, they were attempting to render Wikileaks invisible before a court ever determined, or even alleged, that Wikileaks had broken any laws. So it is not a hypothetical that this tactic of rendering invisible will not only be dangerous for commercial speech, or the expressive rights of individual users, but for vital, contested, political speech. SOPA and PIPA would simply organize these tactics into a concerted, legally-enforced effort to erase, to which all search engines and ISP would be obligated to impose.

A lighthearted aside: In the film Office Space, the soulless software company chose not to fire the hapless Milton. Instead, they took away his precious stapler, moved him to the basement, and simply stopped sending him paychecks. We laughed at the blank-faced cruelty, because we recognized how tempting this solution would be, a deft way to avoid having to someone to their face. Congress is considering the same “Bobs” strategy here. But while it may be fine for comedy, this is hardly the way to address complex legal challenges around the distribution of information that should be dealt with in the clear light of a court room. And it risks rendering invisible elements of the web that might deserve to remain.

We are at a point of temptation. The Internet is both so powerful and so unruly because anyone can add their site to it (be it noble or criminal, informative or infringing) and it will be found./ It depends on, and presumes, a principle of visibility. Post the content, and it is available. Request it, from anywhere in the world, and the DNS servers will find it. Search for it in Google, and it will appear. But, as those who find this network most threatening come calling, with legitimate (at least in the abstract) calls to protect children / revenue / secrets / civility, we will be sorely tempted to address these challenges simply by wiping them clean off the network.

This is why the response to SOPA and PIPA, most prominently in the January 18 blackouts by Reddit, Wikipedia, and countless blogs, are so important. Removing their content, even for a day, is meant to show how dangerous this forced invisibility could be. It should come as no surprise that, while many other Internet companies have voiced their concerns about SOPA, it is Wikipedia and Reddit that have gone the farthest in challenging the law. Not only do they host, i.e. make visible, an enormous amount of user-generated content. But they are themselves governed in important ways by their users. Their decisions to support a blackout were themselves networked affairs, that benefited from all of their users having an ability to participate — and recognized that commitment to openness as part of their fundamental mission.

Whether you care about the longstanding U.S. legal tradition of information freedoms, or the newly emergent structural logic of the Internet as a robust space of public expression, both require a new and firm commitment in our laws: to ensure that the Internet remains navigable, that sites remain visible, that pointers point and search engines list, regardless of the content. Sites hosting or benefitting from illegal or infringing content should be addressed directly by courts and law enforcement, armed with a legal scalpel that’s delicate enough to avoid carving off huge swaths of legitimate expression. We might be able to build a coalition of content providers and technology companies willing to partner on anti-piracy legislation, if copyright holders could admit that they need to go after the determined, underground piracy networks bent on evading regulation, and not in the same gesture put YouTube at risk for a video of a kid dancing to a Prince tune — there is a whole lot of middle ground there. But a policy premised on rendering parts of the web invisible is not going to accomplish that. And embracing this strategy of forced invisibility is too damaging to what the Internet is and could be as a public resource.

(Cross-posted at Culture Digitally.)