Big Data, Context Cultures

The latest issue of Media, Culture, and Society features an open-access discussion section responding to SMC all-stars danah boyd and Kate Crawford‘s “Critical Questions for Big Data.” Though the article is only a few years old, it’s been very influential and a lot has happened since it came out, so editors Aswin Punathambekar and Anastasia Kavada commissioned a few responses from scholars to delve deeper into danah and Kate’s original provocations.

The section features pieces by Anita Chan on big data and inclusion, André Brock on “deeper data,” Jack Qiu on access and ethics, Zizi Papacharissi on digital orality, and one by me, Nick Seaver, on varying understandings of “context” among critics and practitioners of big data. All of those, plus an introduction from the editors, are open-access, so download away!

My piece, titled “The nice thing about context is that everyone has it,” draws on my research into the development of algorithmic music recommenders, which I’m building on during my time with the Social Media Collective this fall. Here’s the abstract:

In their ‘Critical Questions for Big Data’, danah boyd and Kate Crawford warn: ‘Taken out of context, Big Data loses its meaning’. In this short commentary, I contextualize this claim about context. The idea that context is crucial to meaning is shared across a wide range of disciplines, including the field of ‘context-aware’ recommender systems. These personalization systems attempt to take a user’s context into account in order to make better, more useful, more meaningful recommendations. How are we to square boyd and Crawford’s warning with the growth of big data applications that are centrally concerned with something they call ‘context’? I suggest that the importance of context is uncontroversial; the controversy lies in determining what context is. Drawing on the work of cultural and linguistic anthropologists, I argue that context is constructed by the methods used to apprehend it. For the developers of ‘context-aware’ recommender systems, context is typically operationalized as a set of sensor readings associated with a user’s activity. For critics like boyd and Crawford, context is that unquantified remainder that haunts mathematical models, making numbers that appear to be identical actually different from each other. These understandings of context seem to be incompatible, and their variability points to the importance of identifying and studying ‘context cultures’–ways of producing context that vary in goals and techniques, but which agree that context is key to data’s significance. To do otherwise would be to take these contextualizations out of context.

Experiments in Cowriting

We all have preferences for how we work. Maybe you’re the kind of person who likes to work in complete isolation, in which case this blog post is not for you. But if you’re like me, there’s something appealing about being deeply engaged in your own work in proximity to people who are also being productive. This is why I have long struggled to work at home and instead tend to write in coffee shops and libraries. I’ve also experimented with more intentional forms of co-working.  For many years, my most successful attempt was with my friend Stephen. As a DJ, Stephen would work on mixes and set lists, while I would typically revise papers – beyond the fact that we’ve been friends for years and enjoy hanging out, I think we both got a lot out of the gentle pressure/quite support of collocated work. In the last few years, I’ve made several other efforts at co-working, spanning in-person, online and inter-species collaborations (#noclickbait – it’s not as exciting as it sounds), which I thought I’d share below. If you have other ideas for coworking, feel free to share them in the comments!

Continue reading “Experiments in Cowriting”

Presentation; Between Platforms and Community: Moderators on Reddit

Presentation by intern Nathan Matias on the project he worked on during the summer at the SMC. He has continued to work on his research, so in case you have not read it here is a more updated post on his work:

Followup: 10 Factors Predicting Participation in the Reddit Blackout. Building Statistical Models of Online Behavior through Qualitative Research

Below is the presentation he did for MSR earlier this month:


(Part 2)

(Part 3)

(Part 4)

Co-creation and Algorithmic Self-Determination: A study of player feedback on game analytics in EVE Online

We are happy to share SMC’s intern Aleena Chia’s presentation of her summer project titled “Co-creation and Algorithmic Self-Determination: A study of player feedback on game analytics in EVE Online”.  

Aleena’s project summary and the videos of her presentation below:

Digital games are always already information systems designed to respond to players’ inputs with meaningful feedback (Salen and Zimmerman 2004). These feedback loops constitute a form of algorithmic surveillance that have been repurposed by online game companies to gather information about player behavior for consumer research (O’Donnell 2014). Research on player behavior gathered from game clients constitutes a branch of consumer research known as game analytics (Seif et al 2013).[1] In conjunction with established channels of customer feedback such as player forums, surveys, polls, and focus groups, game analytics informs companies’ adjustments and augmentations to their games (Kline et al 2005). EVE Online is a Massively Multiplayer Online Game (MMOG) that uses these research methods in a distinct configuration. The game’s developers assemble a democratically elected council of players tasked with the filtration of player interests from forums to inform their (1) agenda setting and (2) contextualization of game analytics in the planning and implementation of adjustments and augmentations.

This study investigates the council’s agenda setting and contextualization functions as a form of co-creation that draws players into processes of game development, as interlocutors in consumer research. This contrasts with forms of co-creation that emphasize consumers’ contributions to the production and circulation of media content and experiences (Banks 2013). By qualitatively analyzing meeting minutes between EVE Online’s player council and developers over seven years, this study suggests that co-creative consumer research draws from imaginaries of player governance caught between the twin desires of corporate efficiency and democratic efficacy. These desires are darned together through a quantitative public sphere (Peters 2001) that is enabled and eclipsed by game analytics. In other words, algorithmic techniques facilitate collective self-knowledge that players seek for co-creative deliberation; these same techniques also short circuit deliberation through claims of neutrality, immediacy, and efficiency.

The significance of this study lies in its analysis of a consumer public’s (Arvidsson 2013) ambivalent struggle for algorithmic self-determination – the determination by users through deliberative means of how their aggregated acts should be translated by algorithms into collective will. This is not primarily a struggle of consumers against corporations; nor of political principles against capitalist imperatives; nor of aggregated numbers against individual voices. It is a struggle within communicative democracy for efficiency and efficacy (Anderson 2011). It is also a struggle for communicative democracy within corporate enclosures. These struggles grind on productive contradictions that fuel the co-creative enterprise. However, while the founding vision of co-creation gestured towards a win-win state, this analysis concludes that algorithmic self-determination prioritizes efficacy over efficiency, process over product. These commitments are best served by media companies oriented towards user retention rather than recruitment, business sustainability rather than growth, and that are flexible enough to slow down their co-creative processes.

[1] Seif et al (2013) maintain that player behavior data is an important component of game analytics, which includes the statistical analysis, predictive modeling, optimization, and forecasting of all forms of data for decision making in game development. Other data include revenue, technical performance, and organizational process metrics.

(Video 1)

(Video 2)

(Video 3)

(Video 4)

Followup: 10 Factors Predicting Participation in the Reddit Blackout. Building Statistical Models of Online Behavior through Qualitative Research

Three weeks ago, I shared dataviz and statistical models predicting participation in the Reddit Blackout in July 2015. Since then, many moderators have offered feedback and new ideas for the data analysis, alongside their own stories. Earlier today, I shared this update with redditors.

UPDATE, Sept 16, 9pm ET: Redditors brilliantly spotted an important gap in my dataset and worked with me to resolve it. After taking the post down for two days, I am posting the corrected results. Thanks to their quick work, the graphics and findings in this post are more robust.

This July, moderators of 2,278 subreddits joined a “blackout,” demanding better communication and improved moderator tools. As part of my wider research on the work and position of moderators in online communities, I have also been asking the question: who joined the July blackout, and what made some moderators and subs more likely to participate?

Reddit Moderator Network July 2015, including NSFW Subs, with Networks labeled

Academic research on the work of moderators would expect that the most important predictor of blackout participation would be the workload, which creates common needs across subs. Aaron Shaw and Benjamin Mako Hill argue, based on evidence from Wikia, that as the work of moderating becomes more complex within a community, moderators grow in their own sense of common identity and common needs as distinct from their community (read Shaw and Hill’s Wikia paper here). Postigo argues something similar in terms of moderators’ relationship to a platform: when moderators feel like they’re doing huge amounts of work for a company that’s not treating them well, they can develop common interests and push back (read my summary of Postigo’s AOL paper here).

Testing Redditors’ Explanations of The Blackout

After posting an initial data analysis to reddit three weeks ago, dozens of moderators generously contacted me with comments and offers to let me interview them. In this post, I test hypotheses straight from redditors’ explanations of what led different subreddits to join the blackout. By putting all of these hypotheses into one model, we can see how important they were across reddit, beyond any single sub. (see my previous post) (learn more about my research ethics and my promises to redditors)


  • Subs who shared mods with other blackout subs were more likely to join the blackout, but controlling for that:
  • Default subs were more likely to join the blackout
  • NSFW subs were more likely to join the blackout
  • Subs with more moderators were slightly more likely to join the blackout
  • More active subs were more likely to join the blackout
  • More isolated subs were less likely to join the blackout
  • Subs whose mods participate in metareddits were more likely to join the blackout
  • Subs whose mods get and give help in moderator-specific subs were no more or less likely to join the blackout

In my research I have read over a thousand reddit threads, interviewed over a dozen moderators, archived discussions in hundreds of subreddits, and collected data from the reddit API— starting before the blackout. Special thanks to everyone who has spoken with me and shared data.

Improving the Blackout Dataset With Comment Data

Based on conversations with redditors, I collected more data:

  • Instead of the top 20,000 subreddits by subscribers, I now focus on the top subreddits by number of comments in June 2015, thanks to a comment dataset collected by /u/Stuck_In_the_Matrix
  • I updated my /u/GoldenSights amageddon dataset to include 400 additional subs, after feedback from redditors on /r/TheoryOfReddit
  • I include “NSFW” subreddits intended for people over 18
  • I account for more bots thanks to redditor feedback
  • I account for changes in subreddit leadership (with some gaps for subreddits that have experienced substantial leadership changes since July) In this dataset, half of the 10 most active subs joined the blackout, 24% of the 100 most active, 14.2% of the 1,000 most active, and 4.7% of the 20,000 most active subreddits.

To illustrate the data, here are two charts of the top 52,754 most active subreddits as they would have stood at the end of June. The font size and node size are related to the log-transformed number of comments from June. Ties between subreddits represent shared moderators. The charts are laid out using the ForceAtlas2 layout on Gephi, which has separated out some of the more prominent subreddit networks, including the ImaginaryNetwork, the “SFW Porn” Network, and several NSFW networks (I’ve circled notable networks in the network graph at the top of this post).

Reddit Blackout July 2015: Top 20,000 Subreddits by comments

Redditors’ Explanations Of Blackout Participation

With 2,278 subreddits joining the blackout, redditors have many theories for what experiences and factors led subs to join the blackout. In the following section, I share these theories and then test one big logistic regression model that accounts for all of the theories together. In these tests, I consider 52,745 subreddits that had at least one comment in June 2015. A total of 1,342 of these subreddits joined the blackout.

The idea of blacking out had come up before. According to one moderator, blacking out was first discussed by moderators three years ago as a way to protest Gawker’s choice to publish details unmasking a reddit moderator. Although some subs banned Gawker URLs from being posted to their communities, the blackout didn’t take off. While some individual subreddits have blacked out in the intervening years, this was the first time that many subs joined together.

I tested these hypotheses with the set of (firth) logistic regression models shown below. The final model (on the right) offers the best fit of all the models, with a McFadden R2 of 0.123, which is pretty good.

Preliminary logistic regression results, J. Nathan Matias, Microsoft Research
Published on September 14, 2015
More info about this research:
Contact: /u/natematias

N = top 52,745 subreddits in terms of June 2015 comments, including NSFW, for subreddits still available on July 2
Comment dataset:
List of subreddits "going private": 
Moderator network queried in June 2015, with gap filling in July 2015 and September 2015

                                                                  Dependent variable:                             
                                         (1)        (2)        (3)        (4)        (5)        (6)        (7)    
default sub                             3.161***   1.065***   1.070***   0.814**    0.720**    0.693**    0.705**  
                                       (0.294)    (0.305)    (0.317)    (0.336)    (0.337)    (0.337)    (0.339)  
NSFW sub                                0.179*     0.235**    0.268***   0.291***   0.288***   0.314***   0.313*** 
                                       (0.098)    (0.099)    (0.099)    (0.101)    (0.101)    (0.102)    (0.102)  
log(comments in june 2015)                         0.263***   0.268***   0.246***   0.258***   0.256***   0.257*** 
                                                  (0.009)    (0.010)    (0.011)    (0.011)    (0.011)    (0.011)  
moderator count                                               0.066***   0.055***   0.053***   0.051***   0.051*** 
                                                             (0.007)    (0.008)    (0.008)    (0.008)    (0.008)  
log(comments):moderator count                                -0.006***  -0.005***  -0.005***  -0.004***  -0.004*** 
                                                             (0.001)    (0.001)    (0.001)    (0.001)    (0.001)  
log(mod roles in other subs)                                            -0.293***  -0.328***  -0.334***  -0.332*** 
                                                                        (0.033)    (0.033)    (0.033)    (0.033)  
log(mod roles in blackout subs)                                          2.163***   2.134***   2.134***   2.133*** 
                                                                        (0.096)    (0.096)    (0.096)    (0.096)  
log(mod roles in other subs):log(mod roles in blackout subs)            -0.255***  -0.248***  -0.254***  -0.254*** 
                                                                        (0.017)    (0.017)    (0.017)    (0.017)  

log(sub isolation, by comments)                                                    -2.608***  -2.568***  -2.569*** 
                                                                                   (0.347)    (0.345)    (0.345)  
log(metareddit participation per mod in june 2015)                                             0.100***   0.103*** 
                                                                                              (0.036)    (0.036)  
log(mod-specific sub participation per mod in june 2015)                                                 -0.024  
Constant                               -3.608***  -4.517***  -4.677***  -4.655***  -4.467***  -4.469***  -4.469*** 
                                       (0.028)    (0.050)    (0.054)    (0.058)    (0.060)    (0.060)    (0.060)  
Observations                            52,745     52,745     52,745     52,745     52,745     52,745     52,745  
Log Likelihood                        -6,520.505 -6,171.874 -6,130.725 -5,861.099 -5,806.916 -5,803.188 -5,803.098
Akaike Inf. Crit.                     13,047.010 12,351.750 12,273.450 11,740.200 11,633.830 11,628.380 11,630.200
Note:                                                                                  *p<0.1; **p<0.05; ***p


The network of moderators who moderate blackout subs is the strongest predictor in this model. At a basic level, it makes sense that moderators who participated in the blackout in one subreddit might participate in another. Making sense of this kind of network relationship is a hard problem in network science, and this model doesn’t include time as a dimension, so we don’t consider which subs went dark before which others. If I had data on the time that subreddits went dark, it might be possible to better research this interesting question, like Bogdan State and Lada Adamic did with their paper on the Facebook equality meme.

Hypothesis 1: Default subs were more likely to join the blackout

In interviews, some moderators pointed out that “most of the conversation about the blackout first took place in the default mod irc channel.” Moderators of top subs described the blackout as mostly an issue concerning default or top subreddits.

This hypothesis supported in the final model. For example, while a non-default subreddit with 4 million monthly comments had a 32.9% chance of joining the blackout (holding all else at their means), a default subreddit of the same size had a 48.6% chance of joining the blackout, on average in the population of subs.

Hypothesis 2: Subs with more comment activity were more likely to join the blackout

Moderators of large, non-default subreddits also had plenty of reasons to join the blackout, either because they also shared the need for better moderating tools, or because they had more common contact and sympathy with other moderators as a group.

Even among subreddits that declined to joint the blackout, many moderators described feeling obligated to make a decision one way or an other. This surprised moderators of large subreddits, who saw it as an issue for larger groups. Size was a key issue in the hundreds of smaller groups that discussed the possibility, with many wondering if they had much in common with larger subs, or whether blacking out their smaller sub would make any kind of dent in reddit’s advertising revenue.

In the final model, larger subs were more likely to join the blackout, a logarithmic relationship that is mediated by the number of moderators. When we set everything else to its mean, we can observe how this looks for subs of different sizes. In the 50th percentile, subreddits with 6 comments per month had a 1.6% chance of joining the blackout — a number that adds up with so many small subs. In the 75th percentile, subs with 46 comments a month had a 2.5% chance of joining the blackout. Subs with 1,000 comments a month had a 5.4% chance of joining, while subs with 100,000 comments a month had a 15.8% chance of joining the blackout, on average in the population of subs, holding all else constan.

Hypothesis 3: NSFW subs were more likely to join the blackout

In interviews, some moderators said that they declined to join the blackout because they saw it as something associated with support for hate speech subreddits taken down by the company in June or other parts of reddit they preferred not to be associated with. Default moderators denied this flatly, describing the lengths they went to dissociate from hate speech communities and sentiment against then-CEO Ellen Pao. Nevertheless, many journalists drew this connection, and moderators were worried that they might become associated with those subs despite their efforts.

Another possibility is that NSFW subs have to do more work to maintain subs that offer high quality NSFW conversations without crossing lines set by reddit and the law. Perhaps NSFW subs just have more work, so they were more likely to see the need for better tools and support from reddit.

In the final model, NSFW subs were more likely to join the blackout than non-NSFW subs. For example, while a non-default, non-NSFW subreddit with 22,800 of comments had a 11.4% chance of joining the blackout (holding all else at their means), an NSFW subreddit of the same size had a 15.3% chance of joining the blackout, on average in the population of subs. Among less popular subs, a non-NSFW sub with 1,000 comments per month had a 5.4% chance of joining the blackout, while an NSFW sub of the same size had a 7.5% chance of joining, holding all else constant, on average in the population of subs.

Hypothesis 4: More isolated subs were less likely to join the blackout

In the interviews I conducted, as well as the 90 or so interviews I read on /r/subredditoftheday, moderators often contrasted their communities with “the rest of reddit.” When I asked one moderator of a support-oriented subreddit about the blackout, they mentioned that “a lot of the users didn’t really identify with the rest of reddit.” Subscribers voted against the blackout, describing it as “a movement we didn’t identify with,” this moderator said.

To test hypotheses about more isolated subs, I parsed all comments in every public subreddit in June 2015, generating an “in/out” ratio. This ratio consists of the total comments within the sub divided by the total comments made elsewhere by the sub’s commenters. A subreddit whose users stayed in one sub would have a ratio above 1, while a subreddit whose users commented widely would have a ratio below 1. I tested other measures, such as the average of per-user in/out ratios, but the overall in/out ratio seems the best.

In the final model, more isolated subs were less likely to join the blackout, on a logarithmic scale. Most subreddit’s commenters participate actively elsewhere on reddit, at a mean in/out ratio of 0.24. That means that on average, a subreddit’s participants make 4 times more comments outside a sub than within it. At this level, holding everything else at their means, a subreddit with 1,000 comments a month had a 4.0% chance of joining the blackout. A similarly-sized subreddit whose users made half of their comments within the sub (in/out ratio of 1.0) had just a 1% chance of joining the blackout. Very isolated subs whose users commented twice as much in-sub had a 0.3% chance of joining the blackout, on average in the population of subs, holding all else constant.

Hypothesis 5: Subs with more moderators were more likely to join the blackout

This one was my hypothesis, based on a variety of interview details. Subs with more moderators tend to have more complex arrangements for moderating and tend to encounter limitations in mod tools. Sums with more mods also have more people around, so their chances of spotting the blackout in time to participate was also probably higher. On the other hand, subs with more activity tend to have more moderators, so it’s important to control for the relationship between mod count and sub activity.

I was wrong. In the final model, subs with more moderators were LESS likely to join the blackout. There is a very small relationship here, and the relationship is mediated by the number of comments. For a sub with 1000 comments per month, with everything else at its average, a subreddit with 3 moderators (the average) had 5.4% chance of joining the blackout. A subreddit with 8 moderators had a 6% chance of joining the blackout, on average in the population of subs.

Hypothesis 6: Subs with admins as mods were more (or less) likely to join the blackout

I heard several theories about admins. During the blackout, some redditors claimed that admins were preventing subs from going private. In interviews, moderators tended to voice the opposite opinion. They argued that subs with admin contact were joining the blackout in order to send a message to the company, urging it to pay more attention to employees who advocated for moderator interests. Moderators at smaller subs said, “we felt 100% independent from admin assistance so it really wasn’t our fight.”

None of my hypothesis tests showed any statistically significant relationship between current or past admin roles as moderators and participation in the blackout, either way. For that reason, I omit it from my final model.

Hypothesis 7: Subs with moderators who moderated other subs were more likely to join the blackout

I’ve been wondering if moderators with multiple mod roles elsewhere on reddit would be more likely to join the blackout, perhaps because they had greater “solidarity” with other subreddits, or because they were more likely to find out about the blackout.

In the final model, the reverse is supported. Subs that shared moderators with other subs were actually less likely to join the blackout, a relationship that is mediated by the by the number of moderators who also modded blackout subs. Holding blackout sub participation constant, a sub of 1,000 comments per month and 3 moderator roles shared with other subs had a 5.7% chance of joining the blackout, while a more connected sub with 6 shared moderator roles (in the 4th quantile) had a 4.2% chance of joining the blackout, on average in the population of subs, holding all else constant.

Hypothesis 8: Subreddits with mods who also moderate other blackout subs were more likely to join the blackout.

This hypothesis is also a carry-over from my previous analysis, where I found a statistically-significant relationship. Note that making sense of this kind of network relationship is a hard problem in network science, and that we can’t use this to test “influence.”

In the final model, subreddits with mods with roles in other blackout subs were more likely to join the blackout, a relationship on a log scale that is mediated by the number of moderator roles shared with other subs more generally. 19% of subs in the sample share at least one moderator with a blackout sub, after removing moderator bots. A sub with 1,000 comments per month that didn’t have any overlapping moderators with blackout subs had a 3.2% chance of joining the blackout, while a sub with one overlapping moderator had an 11.1% chance to join, and a sub with 2 overlapping moderators had a 21.1% chance to join. For a sub with 6 overlapping moderators with blackout subs, a sub had a 57.2% chance of joining the blackout.

I tend to see the network of co-moderation as a control variable. We can expect that moderators who joined the blackout would be likely to support it across the many subs they moderate. By accounting for this in the model, we get a clearer picture on the other factors that were important.

Hypothesis 9: Subs with moderators who participate in metareddits were more likely to join the blackout

In interviews, several moderators described learning about the blackout from “meta-reddits” which cover major events on the site, and which mostly stayed up during the blackout. Just like we might expect more isolated subs to stay out of the blackout, we might expect moderators who get involved in reddit-wide meta-discussion to join the blackout. I took my list of metareddits from this TheoryOfReddit wiki post.

In the final model, subs with moderators who participate in metareddits were more likely to join the blackout, on a logarithmic scale. Most moderators on the site do not participate in metareddits. A sub of 1,000 comments per month with no metareddit participation by its moderators had a 5.3% chance of joining the blackout, while a similar sub whose moderators made 5 comments on any metareddit per month had a 6.3% chance of joining the blackout.

Hypothesis 10: Subs with mods participating in moderator-focused subs were more likely to join the blackout

Although key moderator subs like /r/defaultmods and /r/modtalk are private and inaccessible to me, I could still test a “solidarity” theory. Perhaps moderators who participate in mod-specific subs, who have helped and been helped by other mods, would be more likely to join the blackout?

Although this predictor is significant in a single-covariate model, when you account for all of the other factors, mod participation in moderator-focused subs is not a significant predictor of participation in the blackout.

This surprises me. I wonder: since moderator-specific subs tend to have low volume, one month of comments may just not be enough to get a good sense of which moderators participate in those subs. Also, this dataset doesn’t include IRC discussions (nor will it ever), where moderators seem mostly to hang out with and help each other. But from the evidence I have, it looks like help from moderator-focused subs played no part to sway moderators to join the blackout.

So, how DID solidarity develop in the blackout?

The question is still open, but from these statistical models, it seems clear that factors beyond moderator workload had a big role to play, even when controlling for mods of multiple subs that joined the blackout.

In further analysis in the next week, I’m hoping to include:

  • Activity by mods in each sub (comments, deletions)
  • Comment karma, as another measure of activity (still making sense of the numbers to see if they are useful here)
  • The complexity of the subreddit, as measured by things in the sidebar (possibly)

Building Statistical Models of Online Behavior through Qualitative Research

The process of collaborating with redditors on my statistical models has been wonderful. As I continue this work, I’m starting to think more and more about the idea of participatory hypothesis testing, in parallel with work we do at MIT around a Freire-inflected practices of “popular data“. If you’ve seen other examples of this kind of thing, do send them my way!