The OKCupid data release fiasco: It’s time to rethink ethics education

In mid 2016, we confront another ethical crisis related to personal data, social media, the public internet, and social research. This time, it’s a release of some 70,0000 OKCupid users’ data, including some very intimate details about individuals. Responses from several communities of practice highlight the complications of using outdated modes of thinking about ethics and human subjects when considering new opportunities for research through publicly accessible or otherwise easily obtained data sets (e.g., Michael Zimmer produced a thoughtful response in Wired and Kate Crawford pointed us to her recent work with Jacob Metcalf on this topic). There are so many things to talk about in this case, but here, I’d like to weigh in on conversations about how we might respond to this issue as university educators.

The OKCupid case is just the most recent of a long list of moments that reveal how doing something because it is legal is no guarantee that it is ethical. To invoke Kate Crawford’s apt Tweet from March 3, 2016:

This is a key point of confusion, apparently. Michael Zimmer, reviewing multiple cases of ethical problems emerging when large datasets are released by researchers emphasizes the flaw in this response, noting:

This logic of “but the data is already public” is an all-too-familiar refrain used to gloss over thorny ethical concerns (in Wired).

In the most recent case, the researcher in question, Emil Kirkegaard, uses this defense in response to questions asking if he anonymized the data: “No. Data is already public.” I’d like to therefore add a line to Crawford’s simple advice:

Data comes from people. Displaying it for the world to see can cause harm.

A few days after this data was released, it was removed from the Open Science Framework, after a DMCA claim by OKCupid. Further legal action could follow. All of this is a good step toward protecting the personal data of users, but in the meantime, many already downloaded and are now sharing the dataset in other forms. As Scott Weingart, digital humanities specialist at Carnegie Mellon, warns:

As a long term university educator, a faculty member at the same university where Kirkegaard is pursuing his Masters degree, and a researcher of digital ethics, this OKCupid affair frustrates me: How is it possible that we continue to reproduce this logic, despite the multiple times “it’s publicly accessible therefore I can do whatever I want with it” has proved harmful? We must attribute some responsibility to existing education systems. Of course, the problem doesn’t start there and “education system” can be a formal institution or simply the way we learn as everyday knowledge is passed around in various forms. So there are plenty of arenas where we learn (or fail to learn) to make good choices in situations fraught with ethical complexity. Let me offer a few trajectories of thought:

What data means to regulators

The myth of “data is already public, therefore ethically fine to use for whatever” persists because traditional as well as contemporary legal and regulatory statements still make a strong distinction between public and private. This is no longer a viable distinction, if it ever was. When we define actions or information as being either in the private or the public realm, this sets up a false binary that is not true in practice or perception. Information is not a stable object that emerges in and remains located in a particular realm or sphere. Data becomes informative or is noticed only when it becomes salient for some reason. On OKCupid or elsewhere, people publish their picture, religious affiliation, or sexual preference in a dating profile as part of a performance of their identity for someone else to see. This placement of information is intended to be part of an expected pattern of interaction — someone is supposed to see and respond to this information, which might then spark conversation or a relationship. This information is not chopped up into discrete units in either a public or private realm. Rather, it is performative and relational. When we only access regulatory language, the more nuanced subtleties of context are rendered invisible.

What data means to people who produce it

Whether information or data is experienced or felt as something public or private is quite different from the information itself. Violation of privacy can be an outcome at any point. This is not related to the data, but the ways in which the data is used. From this standpoint, data can only logically exist as part of continual flows of timespace contexts; therefore, to extract data as points from one or the other static sphere is illogical. Put more simply, the expectation of privacy about one’s profile information comes into play when certain information is registered and becomes meaningful for others. Otherwise, the information would never enter into a context where ‘public’, ‘private’, ‘intimate’, ‘secret’, or any other adjective operates as a relevant descriptor.

This may not be the easiest idea for us to understand, since we generally conceptualize data as static and discrete informational units that can be observed, collected, and analyzed. In experience, this is simply not true. The treatment of personal data is important. It requires sensitivity to the context as well as an understanding of the tools that can be used to grapple with this complexity.

What good researchers know about data and ethics

Reflexive researchers know that regulations may be necessary, but they are insufficient guides for ethics. While many lessons from previous ethical breaches in scientific research find their way into regulatory guidelines or law, unique ethical dilemmas arise as a natural part of any research of any phenomenon. According to the ancient Greeks, doing the right thing is a matter of phronesis or practical wisdom whereby one can discern what would constitute the most ethical choice in any situation, an ability that grows stronger with time, experience, and reflection.

This involves much more than simply following the rules or obeying the letter of the law. Phronesis is a very difficult thing to teach, since it is a skill that emerges from a deep understand of the possible intimacy others have with what we outsiders might label ‘data.’ This reflection requires that we ask different questions than what regulatory prescriptions might require. In addition to asking the default questions such as “Is the data public or private?” or “does this research involve a ‘human subject’?” we should be asking “What is the relationship between a person and her data?” Or “How does the person feel about his relationship with his data?” These latter questions don’t generally appear in regulatory discussions about data or ethics. These questions represent contemporary issues that have emerged as a result of digitization plus the internet, an equation that illustrates information can be duplicated without limits and is swiftly and easily separated from its human origins once it disseminates or moves through the network. In a broader sense, this line of inquiry highlights the extent to which ‘data’ can be mischaracterized.

Where do we learn the ethic of accountability?

While many scholars concerned with data ethics discuss complex questions, the complexity doesn’t often end up traditional classrooms or regulatory documents. We learn to ask the tough questions when complicated situations emerge, or when a problem or ethical dilemma arises. At this point, we may question and adjust our mindset. This is a process of continual reflexive interrogation of the choices we’re making as researchers. And we get better at it over time and practice.

We might be disappointed but we shouldn’t be surprised that many people end up relying on outdated logic that says ‘if data is publicly accessible, it is fair game for whatever we want to do with it’. This thinking is so much easier and quicker than the alternative, which involves not only judgment, responsibility, and accountability, but also speculation about the potential future impact of one’s research.

Learning contemporary ethics in a digitally-saturated and globally networked epoch involves considering the potential impact of one’s decisions and then making the best choice possible. Regulators are well aware of this, which is why they (mostly) include exceptions and specific case guidance in statements about how researchers should treat data and conduct research involving human subjects.

Teaching ethics as ‘levels of impact’

So, how might we change the ways we talk and teach about ethics to better prepare researchers to take the extra step of reflecting on how their research choices matter in the bigger picture? First, we can make this an easier topic to broach by addressing ethics as being about choices we make at critical junctures; choices that will invariably have impact.

We make choices, consciously or unconsciously, throughout the research process. Simply stated, these choices matter. If we do not grapple with natural and necessary change in research practices our research will not reflect the complexities we strive to understand. — Annette Markham, 2003.

Ethics can be thus considered a matter of methods. “Doing the right thing” is an everyday activity, as we make multiple choices about how we might act. Our decisions and actions transform into habits, norms, and rules over time and repetition. Our choices carry consequences. As researchers, we carry more responsibility than users of social media platforms. Why? Because we hold more cards when we present findings of studies and make knowledge statements intended to present some truth -big or little T- about the world to others.

To dismiss our everyday choices as being only guided by extant guidelines is a naïve approach to how ethics are actually produced. Beyond our reactions to this specific situation, as Michael Zimmer emphasizes in his recent Wired article, we must address the conceptual muddles present in big data research.

This is quite a challenge when the terms are as muddled as the concepts. Take the word ‘ethics.’ Although it’s an important term that operates as an important foundation in our work as researchers, it is also abstract, vague, and daunting because it can feel like you ought to have philosophy training to talk about it. As educators, we can lower the barrier to entry into ethical concepts by taking a ‘what if’ impact approach, or discussing how we might assess the ‘creepy’ factor in our research design, data use, or technology development.

At the most basic level of an impact approach, we might ask how our methods of data collection impact humans, directly. If one is interviewing, or the data is visibly connected to a person, this is easy to see. But a distance principle might help us recognize that when the data is very distant from where it originated, it can seem disconnected from persons, or what some regulators call ‘human subjects.’ At another level, we can ask how our methods of organizing data, analytical interpretations, or findings as shared datasets are being used — or might be used — to build definitional categories or to profile particular groups in ways that could impact livelihoods or lives. Are we contributing positive or negative categorizations? At a third level of impact, we can consider the social, economic, or political changes caused by one’s research processes or products, in both the short and long term. These three levels raise different questions than those typically raised by ethics guidelines and regulations. This is because an impact approach is targeted toward the possible or probable impact, rather than the prevention of impact in the first place. It acknowledges that we change the world as we conduct even the smallest of scientific studies, and therefore, we must take some personal responsibility for our methods.

Teaching questions rather than answers

Over the six years I spent writing guidelines for the updated ‘Ethics and decision making in internet research” document for the International Association of Internet Researchers (AoIR), I realized we had shifted significantly from statements to questions in the document. This shift was driven in part by the fact that we came from many different traditions and countries and we couldn’t come to consensus about what researchers should do. Yet we quickly found that posing these questions provided the only stable anchor point as technologies, platforms, and uses of digital media were continually changing. As situations and contexts shifted, different ethical problems would arise. This seemingly endless variation required us to reconsider how we think about ethics and how we might guide researchers seeking advice. While some general ethical principles could be considered in advance, best practices emerged through rigorous self-questioning throughout the course of a study, from the outset to well after the research was completed. Questions were a form that also allowed us to emphasize the importance of active and conscious decision-making, rather than more passive adherence to legal, regulatory, or disciplinary norms.

A question-based approach emphasizes that ethical research is a continual and iterative process of both direct and tacit decision making that must be brought to the surface and consciously accounted for throughout a project. This process of questioning is most obvious when the situation or direction is unclear and decisions must be made directly. But when the questions as well as answers are embedded in and produced as part of our habits, these must be recognized for what they once were — choices at critical junctures. Then, rather than simply adopting tools as predefined options, or taking analytical paths dictated by norm or convention, we can choose anew.

This recent case of the OKCupid data release provides an opportunity for educators to revisit our pedagogical approaches and to confront this confusion head on. It’s a call to think about options that reach into the heart of the matter, which means adding something to our discussions with junior researchers to counteract the depersonalizing effects of generalized top down requirements, forms with checklists, and standardized (and therefore seemingly irrelevant) online training modules.

  • This involves questioning as well as presenting extant ethical guidelines, so that students understand more about the controversies and ongoing debates behind the scenes as laws and regulations are developed.
  • It demands that we stop treating IRB or ethics boards requirements as bureaucratic hoops to jump through, so that students can appreciate that in most studies, ethics require revisiting.
  • It means examining the assumptions underlying ethical conventions and reviewing debates about concepts like informed consent, anonymizing data, or human subjects, so that students better appreciate these as negotiable and context-dependent, rather than settled and universal concepts.
  • It involves linking ethics to everyday logistic choices made throughout a study, including how questions are framed, how studies are designed, and how data is managed and organized. In this way students can build a practice of reflection on and engagement around their research decisions as meaningful choices rather than externally prescribed procedures.
  • It asks that we understand ethics as they are embedded in broader methodological processes — perhaps by discussing how analytical categories can construct cultural definitions, how findings can impact livelihoods, or how writing choices and styles can invoke particular versions of stories. In this way, students can understand that their decisions carry over into other spheres and can have unintended or unanticipated results.
  • It requires adding positive examples to the typically negative cases, which tend to describe what we should not do, or how we can get in trouble. In this way, students can consider the (good and important) ethics of conducting research that is designed to make actual and positive transformations in the broader world.

This list is intended to spark imagination and conversation more than to explain what’s currently happening (for that, I would point to Metcalf’s 2015 review of various pedagogical approaches to ethics in the U.S.). There are obviously many ways to address or respond to this most recent case, or any of the dozens of cases that pose ethical problems.

I, for one, will continue talking more in my classrooms about how, as researchers, our work can be perceived as creepy, stalking, or harassing; exploring how our research could cause harm in the short or long term; and considering what sort of futures we are facilitating as a result of our contributions in the here and now.

For more about data and ethics, I recommend the annual Digital Ethics Symposium at Loyola University-Chicago; the growing body of work emerging from the Council for Big Data, Ethics, & Society; and the internationalAssociation of Internet Studies (AoIR) ethics documents and the work of their longstanding ethics committee members. For current discussions around how we conceptualize data in social research, one might take a look at special issues devoted to the topic, like the 2013 issue on Making Data: Big data and beyond in First Monday, or the 2014 issue on Critiquing Big Data in the International Journal of Communication.These are just the first works off the top of my head that have inspired my own thinking and research on these topics.

How Do Users Take Collective Action Against Online Platforms? CHI Honorable Mention

What factors lead users in an online platform to join together in mass collective action to influence those who run the platform? Today, I’m excited to share that my CHI paper on the reddit blackout has received a Best Paper Honorable Mention! (Read the pre-print version of my paper here)

When users of online platforms complain, we’re often told to leave if we don’t like how a platform is run. Beyond exit or loyalty, digital citizens sometimes take a third option, organizing to pressure companies for change. But how does that come about?

I’m seeking reddit moderators to collaborate on the next stage of my research: running experiments together with subreddits to test theories of moderation. If you’re interested, you can read more here. Also, I’m presenting this work as part of larger talks at the Berkman Center on Feb 23 and the Oxford Internet Institute on March 16. I would love to see you there!

Having a formalized voice with online platforms is rare, though it has happened with San Francisco drag queens, the newly-announced Twitter Trust and Safety Council or the EVE player council, where users are consulted about issues a platform faces. These efforts typically keep users in positions of minimal power on the ladder of citizen participation, but they do give some users some kind of voice.

Another option is collective action, leveraging the collective power of users to pressure a platform to change how that platform works. To my knowledge, this has only happened four times on major U.S. platforms: when AOL community leaders settled a $15 million class action lawsuit for unpaid wages, when DailyKos writers went on strike in 2008, the recent Uber class action lawsuit, and the reddit blackout of July 2015, when moderators of 2,278 subreddits shut down their communities to pressure the company for better coordination and better moderation tools. They succeeded.

What factors lead communities to participate in such a large scale collective action? That’s the question that my paper set out to answer, combining statistics with the “thick data” of qualitative research.

The story of how I answered this question is also a story about finding ways to do large-scale research that include the voices and critiques of the people whose lives we study as researchers. In the turmoil of the blackout, amidst volatile and harmful controversies around hate speech, harassment, censorship, and the blackout itself, I made special effort to do research that included redditors themselves.

Theories of Social Movement Mobilization

Social movement researchers have been asking how movements come together for many decades, and there are two common schools, responding to early work to quantify collective action (see Olson, Coleman):

Political Opportunity Theories argue that social movements need the right people and the right moment. According to these theories, a movement happens when grievances are high, when social structure among potential participants is right, and when the right opportunity for change arises. For more on political opportunity theory, see my Atlantic article on the Facebook Equality Meme this past summer.

Resource Mobilization Theories argue that successful movements are explained less by grievances and opportunities and more by the resources available to movement actors. In their view, collective action is something that groups create out of their resources rather than something that arises out of grievances. They’re also interested in social structure, often between groups that are trying to mobilize people (read more).

A third voice in these discussions are the people who participate in movements themselves, voices that I wanted to have a primary role in shaping my research.

How Do You Study a Strike As It Unfolds?

I was lucky enough to be working with moderators and collecting data before the blackout happened. That gave me a special vantage for combining interviews and content analysis with statistical analysis of the reddit blackout.

Together with redditors, I developed an approach of “participatory hypothesis testing,” where I posed ideas for statistics on public reddit threads and worked together with redditors to come up with models that they agreed were a fair and accurate analysis of their experience. Grounding that statistical work involved a whole lot of qualitative research as well.

If you like that kind of thing, here are the details:

In the CHI paper, I analyzed 90 published interviews with moderators from before the blackout, over 250 articles outside reddit about the blackout, discussions in over 50 subreddits that declined to join the blackout, public statements by over 200 subreddits that joined the blackout, and over 150 discussions in blacked out subreddits after their communities were restored. I also read over 100 discussions in communities that chose not to join. Finally, I conducted 90 minute interviews with 13 subreddit moderators of subreddits of all sizes, including those that joined and declined to join the blackout.

To test hypotheses developed with redditors, I collected data from 52,735 non-corporate subreddits that received at least one comment in June 2015, alongside a list of blacked-out subreddits. I also collected data on moderators and comment participation for the period surrounding the blackout.

So What’s The Answer? What Factors Predict Participation in Action Against Platforms?

In the paper, I outline major explanations offered by moderators and translate them into a statistical model that corresponds to major social movement theories. I found evidence confirming many of redditor’s explanations across all subreddits, including aspects of classic social movement theories. These findings are as much about why people choose *not* to participate as much as they are about what factors are involved in joining:

    • Moderator Grievances were important predictors of participation. Subreddits with greater amounts of work, and whose work was more risky were more likely to join the blackout
    • Subreddit Resources were also important factors. Subreddits with more moderators were more likely to join the blackout. Although “default” subreddits played an important role in organizing and negotiating in the blackout, they were no more or less likely to participate, holding all else constant.
    • Relations Among Moderators were also important predictors, and I observed several cases where “networks” of closely-allied subreddits declined to participate.
    • Subreddit Isolation was also an important factor, with more isolated subreddits less likely to join, and moderators who participate in “metareddits” more likely to join.
    • Moderators Relations Within Their Groups were also important; subreddits whose moderators participated more in their groups were less likely to join the blackout.

Many of my findings go into details from my interviews and observations, well beyond just a single statistical model; I encourage you to read the pre-print version of my paper.

What’s Next For My reddit Research?

The reddit blackout took me by surprise as much as anyone, so now I’m back to asking the questions that brought me to moderators in the first place:

THANK YOU REDDIT! & Acknowledgments

CHI_Banner

First of all, THANK YOU REDDIT! This research would not have been possible without generous contributions from hundreds of reddit users. You have been generous all throughout, and I deeply appreciate the time you invested in my work.

Many other people have made this work possible; I did this research during a wonderful summer internship at the Microsoft Research Social Media Collective, mentored by Tarleton Gillespie and Mary Gray. Mako Hill introduced me to social movement theory as part of my general exams. Molly Sauter, Aaron Shaw, Alex Leavitt, and Katherine Lo offered helpful early feedback on this paper. My advisor Ethan Zuckerman remains a profoundly important mentor and guide through the world of research and social action.

Finally, I am deeply grateful for family members who let me ruin our Fourth of July weekend to follow the reddit blackout closely and set up data collection for this paper. I was literally sitting at an isolated picnic table ignoring everyone and archiving data as the weekend unfolded. I’m glad we were able to take the next weekend off! ❤

Followup: 10 Factors Predicting Participation in the Reddit Blackout. Building Statistical Models of Online Behavior through Qualitative Research

Three weeks ago, I shared dataviz and statistical models predicting participation in the Reddit Blackout in July 2015. Since then, many moderators have offered feedback and new ideas for the data analysis, alongside their own stories. Earlier today, I shared this update with redditors.

UPDATE, Sept 16, 9pm ET: Redditors brilliantly spotted an important gap in my dataset and worked with me to resolve it. After taking the post down for two days, I am posting the corrected results. Thanks to their quick work, the graphics and findings in this post are more robust.


This July, moderators of 2,278 subreddits joined a “blackout,” demanding better communication and improved moderator tools. As part of my wider research on the work and position of moderators in online communities, I have also been asking the question: who joined the July blackout, and what made some moderators and subs more likely to participate?

Reddit Moderator Network July 2015, including NSFW Subs, with Networks labeled

Academic research on the work of moderators would expect that the most important predictor of blackout participation would be the workload, which creates common needs across subs. Aaron Shaw and Benjamin Mako Hill argue, based on evidence from Wikia, that as the work of moderating becomes more complex within a community, moderators grow in their own sense of common identity and common needs as distinct from their community (read Shaw and Hill’s Wikia paper here). Postigo argues something similar in terms of moderators’ relationship to a platform: when moderators feel like they’re doing huge amounts of work for a company that’s not treating them well, they can develop common interests and push back (read my summary of Postigo’s AOL paper here).

Testing Redditors’ Explanations of The Blackout

After posting an initial data analysis to reddit three weeks ago, dozens of moderators generously contacted me with comments and offers to let me interview them. In this post, I test hypotheses straight from redditors’ explanations of what led different subreddits to join the blackout. By putting all of these hypotheses into one model, we can see how important they were across reddit, beyond any single sub. (see my previous post) (learn more about my research ethics and my promises to redditors)

TLDR:

  • Subs who shared mods with other blackout subs were more likely to join the blackout, but controlling for that:
  • Default subs were more likely to join the blackout
  • NSFW subs were more likely to join the blackout
  • Subs with more moderators were slightly more likely to join the blackout
  • More active subs were more likely to join the blackout
  • More isolated subs were less likely to join the blackout
  • Subs whose mods participate in metareddits were more likely to join the blackout
  • Subs whose mods get and give help in moderator-specific subs were no more or less likely to join the blackout

In my research I have read over a thousand reddit threads, interviewed over a dozen moderators, archived discussions in hundreds of subreddits, and collected data from the reddit API— starting before the blackout. Special thanks to everyone who has spoken with me and shared data.

Improving the Blackout Dataset With Comment Data

Based on conversations with redditors, I collected more data:

  • Instead of the top 20,000 subreddits by subscribers, I now focus on the top subreddits by number of comments in June 2015, thanks to a comment dataset collected by /u/Stuck_In_the_Matrix
  • I updated my /u/GoldenSights amageddon dataset to include 400 additional subs, after feedback from redditors on /r/TheoryOfReddit
  • I include “NSFW” subreddits intended for people over 18
  • I account for more bots thanks to redditor feedback
  • I account for changes in subreddit leadership (with some gaps for subreddits that have experienced substantial leadership changes since July) In this dataset, half of the 10 most active subs joined the blackout, 24% of the 100 most active, 14.2% of the 1,000 most active, and 4.7% of the 20,000 most active subreddits.

To illustrate the data, here are two charts of the top 52,754 most active subreddits as they would have stood at the end of June. The font size and node size are related to the log-transformed number of comments from June. Ties between subreddits represent shared moderators. The charts are laid out using the ForceAtlas2 layout on Gephi, which has separated out some of the more prominent subreddit networks, including the ImaginaryNetwork, the “SFW Porn” Network, and several NSFW networks (I’ve circled notable networks in the network graph at the top of this post).

Reddit Blackout July 2015: Top 20,000 Subreddits by comments

Redditors’ Explanations Of Blackout Participation

With 2,278 subreddits joining the blackout, redditors have many theories for what experiences and factors led subs to join the blackout. In the following section, I share these theories and then test one big logistic regression model that accounts for all of the theories together. In these tests, I consider 52,745 subreddits that had at least one comment in June 2015. A total of 1,342 of these subreddits joined the blackout.

The idea of blacking out had come up before. According to one moderator, blacking out was first discussed by moderators three years ago as a way to protest Gawker’s choice to publish details unmasking a reddit moderator. Although some subs banned Gawker URLs from being posted to their communities, the blackout didn’t take off. While some individual subreddits have blacked out in the intervening years, this was the first time that many subs joined together.

I tested these hypotheses with the set of (firth) logistic regression models shown below. The final model (on the right) offers the best fit of all the models, with a McFadden R2 of 0.123, which is pretty good.

PREDICTING PARTICIPATION IN THE REDDIT BLACKOUT JULY 2015
Preliminary logistic regression results, J. Nathan Matias, Microsoft Research
Published on September 14, 2015
More info about this research: bit.ly/1V7c9i4
Contact: /u/natematias

N = top 52,745 subreddits in terms of June 2015 comments, including NSFW, for subreddits still available on July 2
Comment dataset: https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/
List of subreddits "going private": https://www.reddit.com/r/GoldTesting/wiki/amageddon 
Moderator network queried in June 2015, with gap filling in July 2015 and September 2015

==================================================================================================================
                                                                  Dependent variable:                             
                                      ----------------------------------------------------------------------------
                                                                        blackout                                  
                                         (1)        (2)        (3)        (4)        (5)        (6)        (7)    
------------------------------------------------------------------------------------------------------------------
default sub                             3.161***   1.065***   1.070***   0.814**    0.720**    0.693**    0.705**  
                                       (0.294)    (0.305)    (0.317)    (0.336)    (0.337)    (0.337)    (0.339)  
                                                                                                                  
NSFW sub                                0.179*     0.235**    0.268***   0.291***   0.288***   0.314***   0.313*** 
                                       (0.098)    (0.099)    (0.099)    (0.101)    (0.101)    (0.102)    (0.102)  
                                                                                                                  
log(comments in june 2015)                         0.263***   0.268***   0.246***   0.258***   0.256***   0.257*** 
                                                  (0.009)    (0.010)    (0.011)    (0.011)    (0.011)    (0.011)  
                                                                                                                  
moderator count                                               0.066***   0.055***   0.053***   0.051***   0.051*** 
                                                             (0.007)    (0.008)    (0.008)    (0.008)    (0.008)  
                                                                                                                  
log(comments):moderator count                                -0.006***  -0.005***  -0.005***  -0.004***  -0.004*** 
                                                             (0.001)    (0.001)    (0.001)    (0.001)    (0.001)  
                                                                                                                  
log(mod roles in other subs)                                            -0.293***  -0.328***  -0.334***  -0.332*** 
                                                                        (0.033)    (0.033)    (0.033)    (0.033)  
                                                                                                                  
log(mod roles in blackout subs)                                          2.163***   2.134***   2.134***   2.133*** 
                                                                        (0.096)    (0.096)    (0.096)    (0.096)  
                                                                                                                                                                                                                              
log(mod roles in other subs):log(mod roles in blackout subs)            -0.255***  -0.248***  -0.254***  -0.254*** 
                                                                        (0.017)    (0.017)    (0.017)    (0.017)  

log(sub isolation, by comments)                                                    -2.608***  -2.568***  -2.569*** 
                                                                                   (0.347)    (0.345)    (0.345)  
                                                                                                                  
log(metareddit participation per mod in june 2015)                                             0.100***   0.103*** 
                                                                                              (0.036)    (0.036)  
                                                                                                                  
log(mod-specific sub participation per mod in june 2015)                                                 -0.024  
                                                                                                         (0.063)  
                                                                                                                  
Constant                               -3.608***  -4.517***  -4.677***  -4.655***  -4.467***  -4.469***  -4.469*** 
                                       (0.028)    (0.050)    (0.054)    (0.058)    (0.060)    (0.060)    (0.060)  
                                                                                                                  
------------------------------------------------------------------------------------------------------------------
Observations                            52,745     52,745     52,745     52,745     52,745     52,745     52,745  
Log Likelihood                        -6,520.505 -6,171.874 -6,130.725 -5,861.099 -5,806.916 -5,803.188 -5,803.098
Akaike Inf. Crit.                     13,047.010 12,351.750 12,273.450 11,740.200 11,633.830 11,628.380 11,630.200
==================================================================================================================
Note:                                                                                  *p<0.1; **p<0.05; ***p

 

The network of moderators who moderate blackout subs is the strongest predictor in this model. At a basic level, it makes sense that moderators who participated in the blackout in one subreddit might participate in another. Making sense of this kind of network relationship is a hard problem in network science, and this model doesn’t include time as a dimension, so we don’t consider which subs went dark before which others. If I had data on the time that subreddits went dark, it might be possible to better research this interesting question, like Bogdan State and Lada Adamic did with their paper on the Facebook equality meme.

Hypothesis 1: Default subs were more likely to join the blackout

In interviews, some moderators pointed out that “most of the conversation about the blackout first took place in the default mod irc channel.” Moderators of top subs described the blackout as mostly an issue concerning default or top subreddits.

This hypothesis supported in the final model. For example, while a non-default subreddit with 4 million monthly comments had a 32.9% chance of joining the blackout (holding all else at their means), a default subreddit of the same size had a 48.6% chance of joining the blackout, on average in the population of subs.

Hypothesis 2: Subs with more comment activity were more likely to join the blackout

Moderators of large, non-default subreddits also had plenty of reasons to join the blackout, either because they also shared the need for better moderating tools, or because they had more common contact and sympathy with other moderators as a group.

Even among subreddits that declined to joint the blackout, many moderators described feeling obligated to make a decision one way or an other. This surprised moderators of large subreddits, who saw it as an issue for larger groups. Size was a key issue in the hundreds of smaller groups that discussed the possibility, with many wondering if they had much in common with larger subs, or whether blacking out their smaller sub would make any kind of dent in reddit’s advertising revenue.

In the final model, larger subs were more likely to join the blackout, a logarithmic relationship that is mediated by the number of moderators. When we set everything else to its mean, we can observe how this looks for subs of different sizes. In the 50th percentile, subreddits with 6 comments per month had a 1.6% chance of joining the blackout — a number that adds up with so many small subs. In the 75th percentile, subs with 46 comments a month had a 2.5% chance of joining the blackout. Subs with 1,000 comments a month had a 5.4% chance of joining, while subs with 100,000 comments a month had a 15.8% chance of joining the blackout, on average in the population of subs, holding all else constan.

Hypothesis 3: NSFW subs were more likely to join the blackout

In interviews, some moderators said that they declined to join the blackout because they saw it as something associated with support for hate speech subreddits taken down by the company in June or other parts of reddit they preferred not to be associated with. Default moderators denied this flatly, describing the lengths they went to dissociate from hate speech communities and sentiment against then-CEO Ellen Pao. Nevertheless, many journalists drew this connection, and moderators were worried that they might become associated with those subs despite their efforts.

Another possibility is that NSFW subs have to do more work to maintain subs that offer high quality NSFW conversations without crossing lines set by reddit and the law. Perhaps NSFW subs just have more work, so they were more likely to see the need for better tools and support from reddit.

In the final model, NSFW subs were more likely to join the blackout than non-NSFW subs. For example, while a non-default, non-NSFW subreddit with 22,800 of comments had a 11.4% chance of joining the blackout (holding all else at their means), an NSFW subreddit of the same size had a 15.3% chance of joining the blackout, on average in the population of subs. Among less popular subs, a non-NSFW sub with 1,000 comments per month had a 5.4% chance of joining the blackout, while an NSFW sub of the same size had a 7.5% chance of joining, holding all else constant, on average in the population of subs.

Hypothesis 4: More isolated subs were less likely to join the blackout

In the interviews I conducted, as well as the 90 or so interviews I read on /r/subredditoftheday, moderators often contrasted their communities with “the rest of reddit.” When I asked one moderator of a support-oriented subreddit about the blackout, they mentioned that “a lot of the users didn’t really identify with the rest of reddit.” Subscribers voted against the blackout, describing it as “a movement we didn’t identify with,” this moderator said.

To test hypotheses about more isolated subs, I parsed all comments in every public subreddit in June 2015, generating an “in/out” ratio. This ratio consists of the total comments within the sub divided by the total comments made elsewhere by the sub’s commenters. A subreddit whose users stayed in one sub would have a ratio above 1, while a subreddit whose users commented widely would have a ratio below 1. I tested other measures, such as the average of per-user in/out ratios, but the overall in/out ratio seems the best.

In the final model, more isolated subs were less likely to join the blackout, on a logarithmic scale. Most subreddit’s commenters participate actively elsewhere on reddit, at a mean in/out ratio of 0.24. That means that on average, a subreddit’s participants make 4 times more comments outside a sub than within it. At this level, holding everything else at their means, a subreddit with 1,000 comments a month had a 4.0% chance of joining the blackout. A similarly-sized subreddit whose users made half of their comments within the sub (in/out ratio of 1.0) had just a 1% chance of joining the blackout. Very isolated subs whose users commented twice as much in-sub had a 0.3% chance of joining the blackout, on average in the population of subs, holding all else constant.

Hypothesis 5: Subs with more moderators were more likely to join the blackout

This one was my hypothesis, based on a variety of interview details. Subs with more moderators tend to have more complex arrangements for moderating and tend to encounter limitations in mod tools. Sums with more mods also have more people around, so their chances of spotting the blackout in time to participate was also probably higher. On the other hand, subs with more activity tend to have more moderators, so it’s important to control for the relationship between mod count and sub activity.

I was wrong. In the final model, subs with more moderators were LESS likely to join the blackout. There is a very small relationship here, and the relationship is mediated by the number of comments. For a sub with 1000 comments per month, with everything else at its average, a subreddit with 3 moderators (the average) had 5.4% chance of joining the blackout. A subreddit with 8 moderators had a 6% chance of joining the blackout, on average in the population of subs.

Hypothesis 6: Subs with admins as mods were more (or less) likely to join the blackout

I heard several theories about admins. During the blackout, some redditors claimed that admins were preventing subs from going private. In interviews, moderators tended to voice the opposite opinion. They argued that subs with admin contact were joining the blackout in order to send a message to the company, urging it to pay more attention to employees who advocated for moderator interests. Moderators at smaller subs said, “we felt 100% independent from admin assistance so it really wasn’t our fight.”

None of my hypothesis tests showed any statistically significant relationship between current or past admin roles as moderators and participation in the blackout, either way. For that reason, I omit it from my final model.

Hypothesis 7: Subs with moderators who moderated other subs were more likely to join the blackout

I’ve been wondering if moderators with multiple mod roles elsewhere on reddit would be more likely to join the blackout, perhaps because they had greater “solidarity” with other subreddits, or because they were more likely to find out about the blackout.

In the final model, the reverse is supported. Subs that shared moderators with other subs were actually less likely to join the blackout, a relationship that is mediated by the by the number of moderators who also modded blackout subs. Holding blackout sub participation constant, a sub of 1,000 comments per month and 3 moderator roles shared with other subs had a 5.7% chance of joining the blackout, while a more connected sub with 6 shared moderator roles (in the 4th quantile) had a 4.2% chance of joining the blackout, on average in the population of subs, holding all else constant.

Hypothesis 8: Subreddits with mods who also moderate other blackout subs were more likely to join the blackout.

This hypothesis is also a carry-over from my previous analysis, where I found a statistically-significant relationship. Note that making sense of this kind of network relationship is a hard problem in network science, and that we can’t use this to test “influence.”

In the final model, subreddits with mods with roles in other blackout subs were more likely to join the blackout, a relationship on a log scale that is mediated by the number of moderator roles shared with other subs more generally. 19% of subs in the sample share at least one moderator with a blackout sub, after removing moderator bots. A sub with 1,000 comments per month that didn’t have any overlapping moderators with blackout subs had a 3.2% chance of joining the blackout, while a sub with one overlapping moderator had an 11.1% chance to join, and a sub with 2 overlapping moderators had a 21.1% chance to join. For a sub with 6 overlapping moderators with blackout subs, a sub had a 57.2% chance of joining the blackout.

I tend to see the network of co-moderation as a control variable. We can expect that moderators who joined the blackout would be likely to support it across the many subs they moderate. By accounting for this in the model, we get a clearer picture on the other factors that were important.

Hypothesis 9: Subs with moderators who participate in metareddits were more likely to join the blackout

In interviews, several moderators described learning about the blackout from “meta-reddits” which cover major events on the site, and which mostly stayed up during the blackout. Just like we might expect more isolated subs to stay out of the blackout, we might expect moderators who get involved in reddit-wide meta-discussion to join the blackout. I took my list of metareddits from this TheoryOfReddit wiki post.

In the final model, subs with moderators who participate in metareddits were more likely to join the blackout, on a logarithmic scale. Most moderators on the site do not participate in metareddits. A sub of 1,000 comments per month with no metareddit participation by its moderators had a 5.3% chance of joining the blackout, while a similar sub whose moderators made 5 comments on any metareddit per month had a 6.3% chance of joining the blackout.

Hypothesis 10: Subs with mods participating in moderator-focused subs were more likely to join the blackout

Although key moderator subs like /r/defaultmods and /r/modtalk are private and inaccessible to me, I could still test a “solidarity” theory. Perhaps moderators who participate in mod-specific subs, who have helped and been helped by other mods, would be more likely to join the blackout?

Although this predictor is significant in a single-covariate model, when you account for all of the other factors, mod participation in moderator-focused subs is not a significant predictor of participation in the blackout.

This surprises me. I wonder: since moderator-specific subs tend to have low volume, one month of comments may just not be enough to get a good sense of which moderators participate in those subs. Also, this dataset doesn’t include IRC discussions (nor will it ever), where moderators seem mostly to hang out with and help each other. But from the evidence I have, it looks like help from moderator-focused subs played no part to sway moderators to join the blackout.

So, how DID solidarity develop in the blackout?

The question is still open, but from these statistical models, it seems clear that factors beyond moderator workload had a big role to play, even when controlling for mods of multiple subs that joined the blackout.

In further analysis in the next week, I’m hoping to include:

  • Activity by mods in each sub (comments, deletions)
  • Comment karma, as another measure of activity (still making sense of the numbers to see if they are useful here)
  • The complexity of the subreddit, as measured by things in the sidebar (possibly)

Building Statistical Models of Online Behavior through Qualitative Research

The process of collaborating with redditors on my statistical models has been wonderful. As I continue this work, I’m starting to think more and more about the idea of participatory hypothesis testing, in parallel with work we do at MIT around a Freire-inflected practices of “popular data“. If you’ve seen other examples of this kind of thing, do send them my way!

The Facebook “It’s Not Our Fault” Study

Today in Science, members of the Facebook data science team released a provocative study about adult Facebook users in the US “who volunteer their ideological affiliation in their profile.” The study “quantified the extent to which individuals encounter comparatively more or less diverse” hard news “while interacting via Facebook’s algorithmically ranked News Feed.”*

  • The research found that the user’s click rate on hard news is affected by the positioning of the content on the page by the filtering algorithm. The same link placed at the top of the feed is about 10-15% more likely to get a click than a link at position #40 (figure S5).
  • The Facebook news feed curation algorithm, “based on many factors,” removes hard news from diverse sources that you are less likely to agree with but it does not remove the hard news that you are likely to agree with (S7). They call news from a source you are less likely to agree with “cross-cutting.”*
  • The study then found that the algorithm filters out 1 in 20 cross-cutting hard news stories that a self-identified conservative sees (or 5%) and 1 in 13 cross-cutting hard news stories that a self-identified liberal sees (8%).
  • Finally, the research then showed that “individuals’ choices about what to consume” further limits their “exposure to cross-cutting content.” Conservatives will click on only 17% a little less than 30% of cross-cutting hard news, while liberals will click 7% a little more than 20% (figure 3).

My interpretation in three sentences:

  1. We would expect that people who are given the choice of what news they want to read will select sources they tend to agree with–more choice leads to more selectivity and polarization in news sources.
  2. Increasing political polarization is normatively a bad thing.
  3. Selectivity and polarization are happening on Facebook, and the news feed curation algorithm acts to modestly accelerate selectivity and polarization.

I think this should not be hugely surprising. For example, what else would a good filter algorithm be doing other than filtering for what it thinks you will like?

But what’s really provocative about this research is the unusual framing. This may go down in history as the “it’s not our fault” study.

Facebook: It’s not our fault.

I carefully wrote the above based on my interpretation of the results. Now that I’ve got that off my chest, let me tell you about how the Facebook data science team interprets these results. To start, my assumption was that news polarization is bad.  But the end of the Facebook study says:

“we do not pass judgment on the normative value of cross-cutting exposure”

This is strange, because there is a wide consensus that exposure to diverse news sources is foundational to democracy. Scholarly research about social media has–almost universally–expressed concern about the dangers of increasing selectivity and polarization. But it may be that you do not want to say that polarization is bad when you have just found that your own product increases it. (Modestly.)

And the sources cited just after this quote sure do say that exposure to diverse news sources is important. But the Facebook authors write:

“though normative scholars often argue that exposure to a diverse ‘marketplace of ideas’ is key to a healthy democracy (25), a number of studies find that exposure to cross-cutting viewpoints is associated with lower levels of political participation (22, 26, 27).”

So the authors present reduced exposure to diverse news as a “could be good, could be bad” but that’s just not fair. It’s just “bad.” There is no gang of political scientists arguing against exposure to diverse news sources.**

The Facebook study says it is important because:

“our work suggests that individuals are exposed to more cross-cutting discourse in social media they would be under the digital reality envisioned by some

Why so defensive? If you look at what is cited here, this quote is saying that this study showed that Facebook is better than a speculative dystopian future.*** Yet the people referred to by this word “some” didn’t provide any sort of point estimates that were meant to allow specific comparisons. On the subject of comparisons, the study goes on to say that:

“we conclusively establish that…individual choices more than algorithms limit exposure to attitude-challenging content.”

compared to algorithmic ranking, individuals’ choices about what to consume had a stronger effect”

Alarm bells are ringing for me. The tobacco industry might once have funded a study that says that smoking is less dangerous than coal mining, but here we have a study about coal miners smoking. Probably while they are in the coal mine. What I mean to say is that there is no scenario in which “user choices” vs. “the algorithm” can be traded off, because they happen together (Fig. 3 [top]). Users select from what the algorithm already filtered for them. It is a sequence.**** I think the proper statement about these two things is that they’re both bad — they both increase polarization and selectivity. As I said above, the algorithm appears to modestly increase the selectivity of users.

The only reason I can think of that the study is framed this way is as a kind of alibi. Facebook is saying: It’s not our fault! You do it too!

Are we the 4%?

In my summary at the top of this post, I wrote that the study was about people “who volunteer their ideological affiliation in their profile.” But the study also describes itself by saying:

“we utilize a large, comprehensive dataset from Facebook.”

“we examined how 10.1 million U.S. Facebook users interact”

These statements may be factually correct but I found them to be misleading. At first, I read this quickly and I took this to mean that out of the at least 200 million Americans who have used Facebook, the researchers selected a “large” sample that was representative of Facebook users, although this would not be representative of the US population. The “limitations” section discusses the demographics of “Facebook’s users,” as would be the normal thing to do if they were sampled. There is no information about the selection procedure in the article itself.

Instead, after reading down in the appendices, I realized that “comprehensive” refers to the survey research concept: “complete,” meaning that this was a non-probability, non-representative sample that included everyone on the Facebook platform. But out of hundreds of millions, we ended up with a study of 10.1m because users were excluded unless they met these four criteria:

  1. “18 or older”
  2. “log in at least 4/7 days per week”
  3. “have interacted with at least one link shared on Facebook that we classified as hard news”
  4. “self-report their ideological affiliation” in a way that was “interpretable”

That #4 is very significant. Who reports their ideological affiliation on their profile?

add your political views

It turns out that only 9% of Facebook users do that. Of those that report an affiliation, only 46% reported an affiliation in a way that was “interpretable.” That means this is a study about the 4% of Facebook users unusual enough to want to tell people their political affiliation on the profile page. That is a rare behavior.

More important than the frequency, though, is the fact that this selection procedure confounds the findings. We would expect that a small minority who publicly identifies an interpretable political orientation to be very likely to behave quite differently than the average person with respect to consuming ideological political news.  The research claims just don’t stand up against the selection procedure.

But the study is at pains to argue that (italics mine):

“we conclusively establish that on average in the context of Facebook, individual choices more than algorithms limit exposure to attitude-challenging content.”

The italicized portion is incorrect because the appendices explain that this is actually a study of a specific, unusual group of Facebook users. The study is designed in such a way that the selection for inclusion in the study is related to the results. (“Conclusively” therefore also feels out of place.)

Algorithmium: A Natural Element?

Last year there was a tremendous controversy about Facebook’s manipulation of the news feed for research. In the fracas it was revealed by one of the controversial study’s co-authors that based on the feedback received after the event, many people didn’t realize that the Facebook news feed was filtered at all. We also recently presented research with similar findings.

I mention this because when the study states it is about selection of content, who does the selection is important. There is no sense in this study that a user who chooses something is fundamentally different from the algorithm hiding something from them. While in fact the the filtering algorithm is driven by user choices (among other things), users don’t understand the relationship that their choices have to the outcome.

not sure if i hate facebook or everyone i know
In other words, the article’s strange comparison between “individual’s choices” and “the algorithm,” should be read as “things I choose to do” vs. the effect of “a process Facebook has designed without my knowledge or understanding.” Again, they can’t be compared in the way the article proposes because they aren’t equivalent.

I struggled with the framing of the article because the research talks about “the algorithm” as though it were an element of nature, or a naturally occurring process like convection or mitosis. There is also no sense that it changes over time or that it could be changed intentionally to support a different scenario.*****

Facebook is a private corporation with a terrible public relations problem. It is periodically rated one of the least popular companies in existence. It is currently facing serious government investigations into illegal practices in many countries, some of which stem from the manipulation of its news feed algorithm. In this context, I have to say that it doesn’t seem wise for these Facebook researchers to have spun these data so hard in this direction, which I would summarize as: the algorithm is less selective and less polarizing. Particularly when the research finding in their own study is actually that the Facebook algorithm is modestly more selective and more polarizing than living your life without it.

Update: (6pm Eastern)

Wow, if you think I was critical have a look at these. It turns out I am the moderate one.

Eszter Hargittai from Northwestern posted on Crooked Timber that we should “stop being mesmerized by large numbers and go back to taking the fundamentals of social science seriously.” And (my favorite): “I thought Science was a serious peer-reviewed publication.”

Nathan Jurgenson from Maryland and Snapchat wrote on Cyborgology (“in a fury“) that Facebook is intentionally “evading” its own role in the production of the news feed. “Facebook cannot take its own role in news seriously.” He accuses the authors of using the “Big-N trick” to intentionally distract from methodological shortcomings. He tweeted that “we need to discuss how very poor corporate big data research gets fast tracked into being published.”

Zeynep Tufekci from UNC wrote on Medium that “I cannot remember a worse apples to oranges comparison” and that the key take-away from the study is actually the ordering effects of the algorithm (which I did not address in this post). “Newsfeed placement is a profoundly powerful gatekeeper for click-through rates.”

Update: (5/10)

A comment helpfully pointed out that I used the wrong percentages in my fourth point when summarizing the piece. Fixed it, with changes marked.

Update: (5/15)

It’s now one week since the Science study. This post has now been cited/linked in The New York Times, Fortune, Time, Wired, Ars Technica, Fast Company, Engaget, and maybe even a few more. I am still getting emails. The conversation has fixated on the <4% sample, often saying something like: "So, Facebook said this was a study about cars, but it was actually only about blue cars.” That’s fine, but the other point in my post is about what is being claimed at all, no matter the sample.

I thought my “coal mine” metaphor about the algorithm would work but it has not always worked. So I’ve clamped my Webcam to my desk lamp and recorded a four-minute video to explain it again, this time with a drawing.******

If the coal mine metaphor failed me, what would be a better metaphor? I’m not sure. Suggestions?

 

 

Notes:

* Diversity in hard news, in their study, would be a self-identified liberal who receives a story from FoxNews.com, or a self-identified conservative who receives one from the HuffingtonPost.com, where the stories are about “national news, politics, [or] world affairs.” In more precise terms, for each user “cross-cutting content” was defined as stories that are more likely to be shared by partisans who do not have the same self-identified ideological affiliation that you do.

** I don’t want to make this even more nitpicky, so I’ll put this in a footnote. The paper’s citations to Mutz and Huckfeldt et al. to mean that “exposure to cross-cutting viewpoints is associated with lower levels of political participation” is just bizarre. I hope it is a typo. These authors don’t advocate against exposure to cross-cutting viewpoints.

*** Perhaps this could be a new Facebook motto used in advertising: “Facebook: Better than one speculative dystopian future!”

**** In fact, algorithm and user form a coupled system of at least two feedback loops. But that’s not helpful to measure “amount” in the way the study wants to, so I’ll just tuck it away down here.

***** Facebook is behind the algorithm but they are trying to peer-review research about it without disclosing how it works — which is a key part of the study. There is also no way to reproduce the research (or do a second study on a primary phenomenon under study, the algorithm) without access to the Facebook platform.

****** In this video, I intentionally conflate (1) the number of posts filtered and (2) the magnitude of the bias of the filtering. I did so because the difficulty with the comparison works the same way for both, and I was trying to make the example simpler. Thanks to Cedric Langbort for pointing out that “baseline error” is the clearest way of explaining this.

(This was cross-posted to multicast and Wired.)

Why We Like Pinterest for Fieldwork

(written up with Nikki Usher, GWU)

Anyone tackling fieldwork these days can chose from a wide selection of digital tools to put in their methodological toolkit.  Among the best of these tools are platforms that let you archive, analyze, and disseminate at the same time.  It used to be that these were fairly distinct stages of research, especially for the most positivist among us.  You came up with research questions, chose a field site, entered the field site, left the field site, analyzed your findings, got them published, and shared your research output with friends and colleagues.

But the post-positivist approach that many of us like involves adapting your research questions—reflexively and responsively—while doing fieldwork.  Entering and leaving your field site is not a cool, clean and complete process.  We analyze findings as we go, and involve our research subjects in the analysis.  We publish, but often in journals or books that can’t reproduce the myriad digital artifacts that are meaningful in network ethnography.  Actor network theory, activity theory, science and technology studies and several other modes of social and humanistic inquiry approach research as something that involves both people and devices. (Yes yes we know but these wikipedia entries aren’t bad.) Moreover, the dissemination of work doesn’t have to be something that happens after publication or even at the end of a research plan.

Nikki’s work involves qualitative ethnographic work at field sites where research can last from five months to a brief week visit to a quick drop in day. She learned the hard way from her research for Making News at The New York Times that failing to find a good way to organize and capture images was a missed opportunity post-data collection. Since then, Nikki’s been using Pinterest for fieldwork image gathering quite a bit.  Phil’s work on The Managed Citizen was set back when he lost two weeks of field notes on the chaotic floor of the Republican National Convention in 2000 (security incinerates all the detritus left by convention goers).  He’s been digitizing field observations ever since.

Some people put together personal websites about their research journey.  Some share over Twitter.  And there are plenty of beta tools, open source or otherwise, that people play with.  We’ve both enjoyed using Pinterest for our research projects.  Here are some points on how we use it and why we like it.

How To Use It

  1. When you start, think of this as your research tool and your resource.   If you dedicate yourself to this as your primary archiving system for digital artifacts you are more likely to build it up over time.  If you think of this as a social media publicity gimmick for your research, you’ll eventually lose interest and it is less likely to be useful for anyone else.
  2. Integrate it with your mobile phone because this amps up your capacity for portable, taggable, image data collection.
  3. Link the board posts to Twitter or your other social media feeds.  Pinterest itself isn’t that lively a place for researchers yet.  The people who want to visit your Pinterest page are probably actively following your activities on other platforms so be sure to let content flow across platforms.
  4. Pin lots of things, and lots of different kinds of things.  Include decent captions though be aware that if you are feeding Twitter you need to fit character limits.
  5. Use it to collect images you have found online, images you’ve taken yourself during your fieldwork, and invite the communities you are working with to contribute.
  6. Backup and export things once in a while for safe keeping.  There is no built-in export function, but there are a wide variety of hacks and workarounds for transporting your archive.

What You Get

  1. Pinterest makes it easy to track the progress of the image data you gather.  You may find yourself taking more photos in the field because they can be easily arranged, saved and categorized.
  2. Using it regularly adds another level of data as photos and documents captured on phone and then added on Pinterest can be quickly field captioned and then re-catalogued, giving you a chance to review the visual and built environment of your field site and interrogate your observations afresh.
  3. Visually-enhanced constant comparative methods: post-data collection, you can go beyond notes to images and captions that are easily scanned for patterns and points of divergence. This may be  going far beyond what Glaser and Strauss had imagined, of course.
  4. Perhaps most important, when you forget what something looks like when you’re writing up your results, you’ve got an instant, easily searchable database of images and clues to refresh your memory.

Why We Like It

  1. It’s great for spontaneous presentations.  Images are such an important part of presenting any research.  Having a quick publically accessible archive of content allows you to speak, on the fly, about what you are up to.  You can’t give a tour of your Pinterest page for a job talk.  But having the resource there means you can call on images quickly during a Q&A period, or quickly load something relevant on a phone or browser during a casual conversation about your work.
  2. It gives you a way to interact with subjects.  Having the Pinterest link allows you to show a potential research subject what you are up to and what you are interested in.  During interviews it allows you to engage people on their interpretation of things.  Having visual prompts handy can enrich and enliven any focus group or single subject interview.  These don’t only prompt further conversation, they can prompt subjects to give you even more links, images, videos and other digital artifacts.
  3. It makes your research interests transparent. Having the images, videos and artifacts for anyone to see is a way for us to show what we are doing.  Anyone with interest in the project and the board link is privy to our research goals. Our Pinterest page may be far less complicated than many of our other efforts to explain our work to a general audience.
  4. You can disseminate as you go.  If you get the content flow right, you can tell people about your research as you are doing it.  Letting people know about what you are working on is always a good career strategy.  Giving people images rather than article abstracts and draft chapters gives them something to visualize and improves the ambient contact with your research community
  5. It makes digital artifacts more permanent. As long as you keep your Pinterest, what you have gathered can become a stable resource for anyone interested in your subjects. As sites and material artifacts change, what you have gathered offers a permanent and easily accessible snapshot of a particular moment of inquiry for posterity.

Pinterest Wish-list

One of us is a Windows Phone user (yes really) and it would be great if there was a real Pinterest app for the Windows Phone. One touch integration from the iPhone, much like Twitter, Facebook, and Flicker from the camera roll would be great (though there is an easy hack).

We wish it would be easier to have open, collaborative boards. Right now, the only person who can add to a board is you, at least at first.  You can invite other people to join a “group board” via email, but Pinterest does not have open boards that allow anyone with a board link to add content.

Here’s a look at our Pinboards: Phil Howard’s Tech + Politics board, and Nikki Usher’s boards on U.S. Newspapers.  We welcome your thoughts…and send us images!

 

Legal Portraits of Web Users

This Summer I became very interested in what I think I will be calling “legal portraits of digital subjects” or something similar. I came to this through doing a study on MOOCs with SMC this summer. The title of the project is “Students as End Users in the MOOC Ecology” (the talk is available online).  In the project I am looking at what the Big 3 MOOC companies are saying publicly about the “student” and “learner” role and comparing it to how the same subject is legally constituted to try to understand the cultural implications of turning students into “end users”.

As I was working through this project, and thinking of implications outside of MOOCs and Higher Ed, I realized these legal portraits are constantly being painted in digital environments. As users of the web/internet/digital tools we are constantly in the process of accepting various clickwrap  and browse-wrap agreements without thinking twice about it, because it has become a standard cultural practice.

In writing this post I’ve already entered numerous binding legal agreements. Here are some of the institutions that have terms I am to follow:

  1. Internet Service Provider

  2. Web Browser

  3. Document Hosting Service (I wrote this in the cloud somewhere else first)

  4. Blog Hosting Company

  5. Blog Platform

  6. Various Companies I’ve Accepted Cookies From

  7. Social Media Sites

I’ve gone through and read some of the Terms (some of them I cannot find). I’ve allowed for the licensing and reproduction of this work in multiple places without even thinking twice about it.  We talk a lot about privacy concerns.  We know that by producing things like blog post, or status updates we are agreeing to being surveilled to various degrees.  I’d love to start a broader conversation on the effects of agreeing to a multitude of Terms though, not just privacy, simply by logging on and opening a browser.

Why Digital Inequality Scholarship Needs Ethnography

Digital inequality scholarship is well-intentioned. It debunks myths about digital media’s inherent egalitarianism and draws attention to the digital dimensions of social inequalities. Digital inequality scholars have shown, for example, that people with access to networked media use those technologies in different ways, some of which are thought to be more beneficial than others. They have highlighted how differences in skills and quality of access shape use. And they have rightly attacked the stereotype of the digital generation. These are important contributions for which we should be grateful.

Yet digital inequality scholarship is also limited in some fundamental, and I believe hazardous, ways. To defend these claims, I will draw on an in-depth ethnographic study of an ambitious attempt to combat digital inequality: a new, well-resourced, and highly touted public middle school in Manhattan that fashions itself as, “a school for digital kids.” It is hard to imagine a more concerted attempt to combat digital inequality, and yet the school paradoxically helped perpetuate many of the very social divisions it hoped to mend. In-depth ethnographic studies can help us understand these outcomes, and they can provide us with tools for forming more accurate conceptions of relations between digital media and social inequalities.

Recruitment flier for the Downtown School
Recruitment flier for the Downtown School

I will call this school, which opened in the fall of 2009, the Downtown School for Design, Media and Technology, or the Downtown School for short. Supported by major philanthropic foundations, and designed by leading scholars and practitioners from the learning sciences as well as media technology design, the Downtown School braided digital media practices, and especially media production activities, throughout its curriculum. They had enviable financial, technological, and intellectual resources, and they recruited an atypically diverse student body for a New York City public school. About half the students came from privileged families where at least one parent worked in a professional field and held an advanced degree. And about 40-percent of students came from less-privileged families that qualified for free or reduced-price lunch; these parents and guardians often had some or no college education and worked in comparatively low-paying service work. All students took a required game design course, and the school’s entire suite of after-school programs were devoted to making, hacking, remixing, and designing media technology.

Digital inequality scholarship played a role in the formation the Downtown School and similar interventions. Concepts such as the digital divide, the “participation gap” (Jenkins et al. 2006), the “digital production gap” (Schradie 2011), or the “participation divide” (Hargittai and Walejko 2008) implicitly, if not explicitly, recommend and legitimate interventions such as the Downtown School. Since digital inequality scholars argue that skill differentials play a large role in producing digital inequalities, educational practitioners understandably craft interventions to reduce these differentials.

Students in the Downtown School’s required game design course.
Students in the Downtown School’s required game design course.

According to such a framework, the Downtown School was successful in many ways. Both boy and girl students from diverse economic and ethnic backgrounds learned to use digital media in new ways. In particular, students learned to use digital tools to be producers, rather than just consumers, of digital media. Through the lens of concepts like the “participation gap,” the school appears successful and should be quickly replicated.

The problem though – and here is why we need ethnography – is that while the Downtown School arguably helped close the participation gap, it also helped perpetuate historical social divisions, especially those rooted in gender and racialized social class. When the Downtown School opened, it attracted three boys for every two girls; three years after opening, the ratio rose to two-to-one. Only one girl student regularly participated in the school’s after-school programs focused on media production; most regular participants were boys from privileged families. By the end of the first year, all of the economically less-privileged boys in one of the school’s main cliques had left the school for larger, less-resourced schools that had a greater diversity of curricular and extra-curricular offerings as well as more of a dating scene. By the end of the second year, many of the less-privileged girls from another of the main cliques had also left the school. While their reasons for leaving were complex, they and their families suggested in interviews with me that the Downtown School was not a ‘good fit.’ By contrast, nearly all of the privileged students remained enrolled, and many of their parents were enthusiast boosters for the school.

Why were many students, and especially many of the less-privileged students, not able or unwilling to take advantage of the purportedly beneficial opportunities afforded by the Downtown School? Digital inequality frameworks do not provide a satisfying way to answer this question. They do not see many of the factors that matter to people in different situations, nor the nexus of conditions and forces that shape what people do, and do not do, with and without digital media. Ethnography, in contrast, casts a much wider net that can help account for these conditions and processes. A few more examples will help clarify this point.

On the ground, I observed and documented what students were doing when they were not taking advantage of the school’s purportedly beneficial activities. It turned out that most of the students spent their afternoon hours in familiar activities that predate the digital age: basketball practices, music lessons, swimming classes, learning a foreign language, dance classes, taking care of siblings and cousins, chores, and so forth. These activities meant a lot to students and their families, and many expressed a desire for the school to offer more diverse curricular and extra-curricular offerings. These activities were also integral to how students navigated and negotiated identity and difference with their peers at school (Sims 2014). This wider ecology of practices, as well as what participation and non-participation meant for those involved, would be invisible if one were to study the Downtown School using the digital inequality framework. And what a digital inequality approach would have captured and championed would have mostly reflected the interests and practices of those who were most privileged.

One can still argue that social scientists, policy makers, and educational practitioners should do all that they can to close digital inequalities such as the participation gap. One can argue that doing so is in the best interest of those currently on the wrong side of the chasm. One can argue that treating digital inequalities is akin to dealing with a public health concern, or, more aptly, that it should be folded into broader efforts to mandate STEM education amongst all contemporary school children. In short, digital inequality scholars can admit that there is a prescriptive character to their efforts and that treatment is justified because it is in the best interest of the public as well as those being treated.

This is a debate that can be had but it is not the debate that digital inequality scholars are currently having. In its current form, the digital inequality debate escapes these issues because it assumes that certain decontextualized “uses” will be universally appealing to people once barriers to participation – lack of quality access, skills, etc. – are removed. There is a sort of technology-focused ethnocentrism to these assumptions that prevents this potentially uncomfortable debate from ever taking place. If digital inequality scholars were to acknowledge the prescriptive character of their scholarship, a host of thorny ethical dilemmas would quickly surface: To what degree should social scientists, policy-makers, and educational practitioners force people to partake in participatory culture? To what extent do the ends justify the means? What exercises of power are legitimate? What liberties should be granted to those identified for treatment? And so on.

These are difficult questions, and my guess is that most digital inequality scholars do not want to address them. My own feeling is that scholars should be extremely cautious in pushing for such treatments, whether domestically or abroad, even if they feel that their medicine would be in the best interest of the treated. The histories of various missionary and colonial endeavors – to name just a few charged examples – make the ethical and political hazards of such an enterprise all too clear.

Note: This was originally posted at Ethnography Matters.

References

Hargittai, E. & Walejko, G., 2008. The Participation Divide: Content Creation and Sharing in the Digital Age. Information, Communication & Society, 11(2), pp.239–256.

Jenkins, H. et al., 2006. Confronting the Challenges of Participatory Culture: Media Education for the 21st Century, Chicago, IL: The John D. and Catherine T. MacArthur Foundation.

Schradie, J., 2011. The Digital Production Gap: The Digital Divide and Web 2.0 Collide. Poetics, 39(2), pp.145–168.

Sims, C., 2012. The Cutting Edge of Fun: Making Work Play at the New American School. University of California, Berkeley. http://www.ischool.berkeley.edu/files/sims_2012_cuttingedgeoffun.pdf

Sims, C., 2014 (forthcoming). From Differentiated Use to Differentiating Practices: Negotiating Legitimate Participation and the Production of Privileged Identities. Information, Communication & Society. http://www.tandfonline.com/doi/full/10.1080/1369118X.2013.808363

The life and death of our research data

At the 2012 iconference, I sat in on a fishbowl about human values and data collection.  Hearing a vibrant discussion about research ethics related to the life of data was actually incredibly timely for me, in that lately I’ve been thinking a lot about the ethics of data gathering.  In particular, I recently came across this research project while perusing a blog on body modification.  Spearheaded by the Centre for Anatomy and Human Identification (CAHID) at the University of Dundee, Scotland, UK, the project intends to collect “images of body modifications to establish a database which may aid in the identification of victims and missing persons, for example in a disaster. By collecting a large number of images of tattoos, piercings and other body modifications, not only can we develop a more uniform way of describing those modifications but also establish how individualistic certain body modifications are within a population, social group or age group.”  Essentially, people with body modification are being asked to submit images of their modifications as well as some personal information in order to generate statistical measures for the prevalence of various body modifications.  In the blog post I read, the researcher emphasizes that “none of the images will be used for policing purposes simply because we don’t have permission to do so.”  Presumably, the researcher felt it was important to emphasize this because one of the partners in the project is Interpol.  Interestingly, in Interpol’s description of the project, there is no explicit mention of the fact that data will not be used to assist law enforcement.
During the conference fishbowl, I raised this project as a case study for thinking about ethical tensions surrounding informed consent, risk/benefit analysis and the preservation of data gathering in social sciences research.  My main question centers on how do we explain to participants the issues of data privacy?  I don’t mean this in a pedantic way, where researchers are instructing hapless laypeople on the complexities of data curation.  I mean, how do we balance a need to gather data from people with a concern for the life of that data?  Can these researchers ensure that the information provided by participants won’t be used for purposes other than identifying bodies after a disaster? If the researchers conclude their involvement with a project, what influence do they have over the database they’ve created and the parties who have access to that database?  IRB forms typically ensure that researchers outline how they will manage the destruction of data and require consent forms to address issues of privacy.  The statement that researchers are prohibited from doing so because they don’t ask for that kind of consent from participants does little to quell my concerns about asking for personal data (moreover, for me, for documentation of bodies) which could then be used in nefarious ways by an international body of policing.
To be fair, I’ve relied on the body modification community to conduct research on secrecy and stigmatized behavior and even with using consent forms and explaining privacy issues I can’t guarantee that all of my participants had thought through every possible contingency of sharing information with me.  Yet to me, there is a qualitative difference between asking participants to share personal experiences with body modification and creating a database of images that is then shared with an agency like Interpol.
My objective isn’t to slam this research project as ethically vacuous.  My objective is to think about this research project as a case that illustrates concerns I have for privacy in the collection of mass information.  Last fall, danah boyd and Kate Crawford wrote a terrific piece on provocations for big data and addressed ethical issues of large data sets.  In addition to their concerns about the ethics of gathering and analyzing “public” data from Facebook or Twitter, boyd and Crawford ask, “Should someone be included as a part of a large aggregate of data? What if someone’s ‘public’ blog post is taken out of context and analyzed in a way that the author never imagined? What does it mean for someone to be spotlighted or to be analyzed without knowing it? Who is responsible for making certain that individuals and communities are not hurt by the research process? What does consent look like?”  These are questions that I would also apply to building repositories of private information that people submit willingly and with consent.
One suggestion that came out of the iconference talk was to think about the metaphors we use to describe data (Is it a mirror?  Is it a window?) and use that as a lens for thinking through some of the issues surrounding the ethics of data collection.  What are the consequences of adhering to a particular set of metaphors about data in terms of how we talk to participants?  These issues also suggest to me that researchers should take a proactive stance with IRBs, suggesting ways of holding ourselves accountable for the privacy and well-being of participants.  I know I’ve been guilty of being a little vague in filling out IRB forms when it came to the benefits my project offers to my participants (I often say something kind of lame like, “It is hoped that participants will benefit from increased understanding of XYZ.”).  For my own work, one thing that comes out of working through some of the issues provoked by the University of Dundee project is a more rigorous consideration about what risks and benefits truly mean for participants in my projects, not only in the process of conducting research, but in the long term of acquiring and sharing information gathered about participants’ lives.

Using Off-the-shelf Software for basic Twitter Analysis

Mary Gray, Mike Ananny and I are writing a paper on queer youth and “Glee” for the American Anthropological Association’s annual meeting (yes, I have the greatest job in the world). This is a multi-methodological study by design, because traditional television viewing practices have become so complex. Besides traditional audience ethnography like interviews and participant observation, we are using textual analysis to analyze episode themes, and collected a large corpus of tweets with Glee-related hashtags. This summer, I worked with my high school intern, Jazmin Gonzales-Rivero, to go through this corpus of tweets and pull out useful information for the paper.

We’ve written and published a basic report on using off-the-shelf tools to see patterns and themes in large Twitter data set quickly and easily.

Abstract:

With the increasing popularity of large social software applications like Facebook and Twitter, social scientists and computer scientists have begun developing innovative approaches to dealing with the vast amounts of data produced and collected in such environments. For qualitative researchers, the methods involved can be daunting and unfamiliar. In this report, we outline some basic procedures for working with a large-scale Twitter data set to answer qualitative inquiries. We use Python, MySQL, and the word-cloud generator Wordle to identify patterns in re-tweets, tweet authors, dates and times of tweets, frequency of hashtags, and frequency of word use. Such data can provide valuable augmentation to qualitative inquiry. This paper is aimed at social scientists and humanities scholars with limited experience with big data and a lack of computing resources to do extensive quantitative research.

Citation:
Marwick, A. and Gonzales-Rivero, J. (2011). Learning to Work with Large-Scale Twitter Data Sets: Using Off-The-Shelf Tools to Quickly and Easily See Tweet Patterns. Microsoft Research Social Media Collective Report, MSR-SMC-11-01, Cambridge, MA. [Download as PDF]

If you’re a seasoned computer scientist or a Big Data aficionado, the information in this paper will seem quite simplistic. But for those of us without programming backgrounds who study Twitter or other forms of social media, the idea of tackling a set of 450,000 tweets can seem quite daunting. In this paper, Jazmin and I walk step-by-step through the methods she used to parse a set of Tweets, using free and easily accessible tools like MySQL, Python, and Wordle. We hope this will be helpful for other legal, humanities, and social science scholars who might want to dip their foot into Big Data to augment more qualitative research findings.

Citation: