“Metaphors of Data” reading list

With generous contributions from the Social Media Collective extended family, I have put together a list that brings together academic and popular writing on metaphors of data, along with pieces that approach questions of data and commercial/political power. The goal in assembling this list was to catalog resources that are helpful in unpacking and critiquing different metaphors, ranging from the hype around big data as the new oil to less common (and perhaps more curious) formulations, such as data as sweat or toxic waste.

 

Metaphors of Data: a Reading List

 

These resources were originally compiled to support a workshop on data and power (organized at the Mobile Life Centre in Stockholm, Sweden). Sara Watson’s insightful DIS piece on the Industrial Metaphors of Big Data and Maciej Cegłowski’s brilliant talk Haunted By Data turned out to be particularly helpful for provoking conversation among scholars and practitioners. The hope is that the list could be useful also for others in having critical conversations about data.

The list is best seen as an unfinished, non-exhaustive document. We welcome comments and, in particular, recommendations of further work to include. Please use the comment space at the bottom of the page to offer suggestions, and we will try to update the list in light of them.

Big Data, Context Cultures

The latest issue of Media, Culture, and Society features an open-access discussion section responding to SMC all-stars danah boyd and Kate Crawford‘s “Critical Questions for Big Data.” Though the article is only a few years old, it’s been very influential and a lot has happened since it came out, so editors Aswin Punathambekar and Anastasia Kavada commissioned a few responses from scholars to delve deeper into danah and Kate’s original provocations.

The section features pieces by Anita Chan on big data and inclusion, André Brock on “deeper data,” Jack Qiu on access and ethics, Zizi Papacharissi on digital orality, and one by me, Nick Seaver, on varying understandings of “context” among critics and practitioners of big data. All of those, plus an introduction from the editors, are open-access, so download away!

My piece, titled “The nice thing about context is that everyone has it,” draws on my research into the development of algorithmic music recommenders, which I’m building on during my time with the Social Media Collective this fall. Here’s the abstract:

In their ‘Critical Questions for Big Data’, danah boyd and Kate Crawford warn: ‘Taken out of context, Big Data loses its meaning’. In this short commentary, I contextualize this claim about context. The idea that context is crucial to meaning is shared across a wide range of disciplines, including the field of ‘context-aware’ recommender systems. These personalization systems attempt to take a user’s context into account in order to make better, more useful, more meaningful recommendations. How are we to square boyd and Crawford’s warning with the growth of big data applications that are centrally concerned with something they call ‘context’? I suggest that the importance of context is uncontroversial; the controversy lies in determining what context is. Drawing on the work of cultural and linguistic anthropologists, I argue that context is constructed by the methods used to apprehend it. For the developers of ‘context-aware’ recommender systems, context is typically operationalized as a set of sensor readings associated with a user’s activity. For critics like boyd and Crawford, context is that unquantified remainder that haunts mathematical models, making numbers that appear to be identical actually different from each other. These understandings of context seem to be incompatible, and their variability points to the importance of identifying and studying ‘context cultures’–ways of producing context that vary in goals and techniques, but which agree that context is key to data’s significance. To do otherwise would be to take these contextualizations out of context.

A “pay it back tax” on data brokers: a modest (and also politically untenable and impossibly naïve) policy proposal

I’ve just returned from the “Social, Cultural, and Ethical Dimensions of Big Data” event, held by the Data & Society Initiative (led by danah boyd), and spurred by the efforts of the White House Office of Technology and Policy to develop a comprehensive report on issues of privacy, discrimination, and rights around big data. And my head is buzzing. (Oh boy. Here he goes.) There must be something about ma and workshops aimed at policy issues. Even though this event was designed to be wide-ranging and academic, I always get this sense of urgency or pressure that we should be working towards concrete policy recommendations. It’s something I rarely do in my scholarly work (to its detriment, I’d say, wouldn’t you?) But I don’t tend to come up with reasonable, incremental, or politically viable policy recommendations anyway. I get frustrated that the range of possible interventions feels so narrow, so many players that must be untouched, so many underlying presumptions left unchallenged. I don’t want to suggest some progressive but narrow intervention, and in the process confirm and reify the way things are – though believe me, I admire the people who can do this. I long for there to be a robust vocabulary for saying what we want as a society and what we’re willing to change, reject, regulate, or transform to get it. (But at some point, if it’s too pie in the sky, it ceases being a policy recommendation, doesn’t it?) And this is especially true when it comes to daring to restrain commercial actors who are doing something that can be seen as publicly detrimental, but somehow have this presumed right to engage in this activity because they have the right to profit. I want to be able to say, in some instances, “sorry, no, this simply isn’t a thing you get to profit on.”

All that said, I’m going to propose a policy recommendation. (It’s going to be a politically unreasonable one, you watch.)

I find myself concerned about this hazy category of stakeholders that, at our event, were generally called “data brokers.” There are probably different kinds of data brokers that we might think about: companies that buy up and combine data about consumers; companies that scrape public data from wherever it is available and create troves of consumer profiles. I’m particularly troubled by the kind of companies that Kate Crawford discussed in her excellent editorial for Scientific American a few weeks ago — like Turnstyle, a company that has set up dummy wifi transponders in major cities to pick up all those little pings your smartphone gives off when its looking for networks. Turnstyle coordinates those pings into a profile of how you navigated the city (i.e. you and your phone walked down Broadway, spent twenty minutes in the bakery, then drove to the south side), then aggregates those navigation profiles into data about consumers and their movements through the city and sells them to marketers. (OK, that is particularly infuriating.) What defines this category for me is that data brokers do not gather data as part of a direct service they provide to those individuals. Instead they gather at a point once removed from the data subjects: such as purchasing the data gathered by others, scraping our public utterances or traces, or tracking the evidence of our activity we give off. I don’t know that I can be much more specific than that, or that I’ve captured all the flavors, in part because I’ve only begun to think about them (oh good, then this is certain to be a well-informed suggestion!) and because they are a shadowy part of the data industry, relatively far with consumers, with little need to advertise or maintain a particularly public profile.

I think these stakeholders are in a special category, in terms of policy, for a number of reasons. First, they are important to questions of privacy and discrimination in data, as they help to move data beyond the settings in which we authorized its collection and use. Second, they are outside of traditional regulations that are framed around specific industries and their data use (like HIPAA provisions that regulate hospitals and medical record keepers, but not data brokers who might nevertheless traffic in health data). Third, they’re a newly emergent part of the data ecosystem, so they have not been thought about in the development of older legislation. But most importantly, they are a business that offers no social value to the individual or society whose data is being gathered. (Uh oh.) In all of the more traditional instances in which data is collected about individuals, there is some social benefit or service presumed to be offered in exchange. The government conducts a census, but we authorized that, because it is essential to the provision of government services: proportional representation of elected officials, fair imposition of taxation, etc. Verizon collects data on us, but they do so as a fundamental element of the provision of telephone service. Facebook collect all of our traces, and while that data is immensely valuable in its own right and to advertisers it is also an important component in providing their social media platform. I am by no means saying that there are no possible harms in such data arrangements (I should hope not) but at the very least, the collection of data comes with the provision of service, and there is a relationship (citizen, customer) that provides a legally structured and sanctioned space for challenging the use and misuse of that data — class action lawsuit, regulatory oversight, protest, or just switching to another phone company. (Have you tried switching phone companies lately?) Some services that collect data have even voluntarily sought to do additional, socially progressive things with that data: Google looking for signs of flu outbreaks, Facebook partnering with researchers looking to encourage voting behavior, even OK Cupid giving us curious insights about the aggregate dating habits of their customers. (You just love infographics, don’t you.) But the third party data broker who buys data from an e-commerce site I frequent, or scrapes my publicly available hospital discharge record, or grabs up the pings my phone emits as I walk through town, they are building commercial value on my data, but offer me no value to me, my community, or society in exchange.

So what I propose is a “pay it back tax” on data brokers. (Huh?! Does such a thing exist, anywhere?) If a company collects, aggregates, or scrapes data on people, and does so not as part of a service back to those people (but is that distinction even a tenable one? who would decide and patrol which companies are subject to this requirement?), then they must grant access to their data and access 10% of their revenue to non-profit, socially progressive uses of that data. This could mean they could partner with a non-profit, provide them funds and access to data, to conduct research. Or, they could make the data and dollars available as a research fund that non-profits and researchers could apply for. Or, as a nuclear option, they could avoid the financial requirement by providing an open API to their data. (I thought your concern about these brokers is that they aggravate the privacy problems of big data, but you’re making them spread that collected data further?) I think there could be valuable partnerships: Turnstyle’s data might be particularly useful for community organizations concerned about neighborhood flow or access for the disabled; health data could be used by researchers or activists concerned with discrimination in health insurance. There would need to be parameters for how that data was used and protected by the non-profits who received it, and perhaps an open access requirement for any published research or reports.

This may seem extreme. (I should say so. Does this mean any commercial entity in any industry that doesn’t provide a service to customers should get a similar tax?) Or, from another vantage point, it could be seen as quite reasonable: companies that collect data on their own have to spend an overwhelming amount of their revenue providing whatever service they do that justifies this data collection: governments that collect data on us are in our service, and make no profit. This is merely 10% and sharing their valuable resource. (No, it still seems extreme.) And, if I were aiming more squarely at the concerns about privacy, I’d be tempted to say that data aggregation and scraping could simply be outlawed. (Somebody stop him!) In my mind, it at the very least levels back the idea that collecting data on individuals and using that as a primary resource upon which to make profit must, on balance, provide some service in return, be it customer service, social service, or public benefit.

This is cross-posted at Culture Digitally.

Big Data Thoughts

MIT Firehose
MIT Firehose, via wallg on flickr

401 Access Denied , 403 Forbidden , 404 Not Found , 500 Internal Server Error & the Firehose

There is this thing called the firehose. I’ve witnessed mathematicians, game theorists, computer scientist and engineers (apparently there is a distinction), economists, business scholars, and social scientist salivate over it (myself included). The Firehouse, though technically reserved for the twitter API, is all encompassing in the realm of social science for the streams of data that come from social networking sites that are so large that they cannot be processed as they come in. The data are so large, in fact, that coding requires multiple levels of computer aided refinement, as though when we take data from these sources we are drinking from a firehose. While I cannot find the etymology of where the term came from, it seems it either came from twitter terminology bleed, or a water fountain at MIT.

I am blessed with an advisor who has become the little voice that I always have at the back of my head when I am thinking about something. Every meeting he asks the same question, one that should be easy to answer but almost never is, especially when we are invested in a topic, “why does this matter?” To date, outside of business uses or artistic exploration we’ve not made a good case for why big data matters. I think we all want it because we think some hidden truth might be within it. We fetishize big data, and the Firehouse that exists behind locked doors, as though it will be the answer to some bigger question. The problem with this is, there is no question. We, from our own unique, biased, and disciplinary homes, have to come up with the bigger questions. We also have to accept that while data might provide us with some answers, perhaps we should be asking questions that go deeper than that in a research practice that requires more reflexivity than we are seeing right now. I would love to see more nuanced readings that acknowledge the biases, gaps, and holes at all levels of big data curation.

Predictive Power of Patterns

One of my favorite anecdotes that shows the power of big data is the Target incident from February 2012. Target predicted a teenage girl was pregnant and acted as such before she told her family. They sent baby centric coupons to her. Her father called Target very angry then called back later to apologize because there were some things his daughter hadn’t told him. The media storm following the event painted a world both in awe and creeped out by Targets predictive power. How could a seemingly random bit of shopping history point to a pattern that showed that a customer was pregnant? How come I hadn’t noticed that they were doing this to me too? Since the incident went public, and Target shared how they learned how to hide the targeted ads and coupons to minimize the creepy factor I’ve enjoyed receiving the Target coupon books that always come in pairs to my home, one for me and one for my husband, that look the same on the surface but have slight variations on the inside. Apparently target has learned that it the coupons for me go to him they will be used. This is because every time I get my coupon books I complain to him about my crappy coupon for something I need. He laughs at me and shows me his coupon, usually worth twice as much as mine if I just spend a little bit more. It almost always works.

In 2004 Lou Agosta wrote a piece titled “The Future of Data Mining- Predictive Analytics”. With the proliferation of social media, API data access, and the beloved yet mysterious firehose, I think we can say the future is now. Our belief and cyclical relationship with progress as a universal future inevitability turns big data into a universal good. While I am not denying the usefulness of finding predictive patterns, clearly Target knew the girl was pregnant and was able to capitalize on that knowledge, for the social scientist, this pattern identification for outcome prediction followed by verification should not be enough.  Part of our fetishization of big data seems to be in the idea that somehow it will allow us to not just anticipate, but to know, the future. Researchers across fields and industries are working on ways to extract meaningful, predictive data from these nearly indigestible datastreams. We have to remember that even in big data there are gaps, holes, and disturbances. Rather than looking at what big data can tell us, we should be looking towards it as an exploratory method that can help us define different problem sets and related questions.

Big Data as Method?

Recently I went to a talk by a pair of computer scientists. There were people speaking who had access to the entire database of Wikipedia. Because they could, they decided to visualize Wikipedia. After going through slide after slide of pretty colors, they said “who knew there were rainbows in Wikipedia!?”, and then announced that they had moved on from that research. Rainbows can only get me so far. I was stuck asking why this pattern kept repeating itself and wanting to know how people who were creating the data that turned into a rainbow imagined what they were producing. The visualizations didn’t answer anything. If anything, they allowed me to ask clearer, more directed questions. This isn’t to say the work that they did wasn’t beautiful. It is and was. But there is so much more work to do. I hope that as big data continues to become something of a social norm that more people begin to speak across the lines so that we learn how to use this data in meaningful ways everywhere. Right now I think that visualization is still central, but that is one of my biases. The reason I think this is the case because it allows for simple identification of patterns. It also allows us to take in petabytes of data at once, compare different datasets (if similar visualization methods are used) and, to experiment in a way that other forms of data representation do not. When people share visualizations they either show their understandable failure or the final polished product meant for mass consumption. I’ve not heard a lot of conversation about using big data, its curation, and visualization generation as/and method, but maybe I’m not in the right circles? Still, I think until we are willing to share the various steps along the way to turning big data into meaningful bits, or we create an easy to use toolkit for the next generation of big data visualizations, we will continue to all be hacking at the same problem, ending and stopping at different points, without coming to a meaningful point other than “isn’t big data beautiful?”

The Hidden Biases in Big Data

Image

Image credit: Harvard Business Review.

SMC Principal Researcher Kate Crawford reached the number-one slot on the “Most Read” list of the Harvard Business Review this week with her sharp and insightful blog post on the weaknesses of big data.

Debunking the commonly held belief that “numbers speak for themselves” in large data sets, Kate brings the voice of reason to utopian and determinist claims that reams of “raw” data are the solution for a multitude of societal ills:

Data and data sets are not objective; they are creations of human design. We give numbers their voice, draw inferences from them, and define their meaning through our interpretations. Hidden biases in both the collection and analysis stages present considerable risks, and are as important to the big-data equation as the numbers themselves.

Kate goes on to argue that while they may seem abstract, data sets are “intricately linked to physical place and human culture”, and that both qualitative methods and computational social science will need to join forces in order to fulfill the true potential of big data science:  “data with depth”.

To read the full piece, click here.

Six Provocations for Big Data

The era of “Big Data” has begun. Computer scientists, physicists, economists, mathematicians, political scientists, bio-informaticists, sociologists, and many others are clamoring for access to the massive quantities of information produced by and about people, things, and their interactions. Diverse groups argue about the potential benefits and costs of analyzing information from Twitter, Google, Verizon, 23andMe, Facebook, Wikipedia, and every space where large groups of people leave digital traces and deposit data. Significant questions emerge. Will large-scale analysis of DNA help cure diseases? Or will it usher in a new wave of medical inequality? Will data analytics help make people’s access to information more efficient and effective? Or will it be used to track protesters in the streets of major cities? Will it transform how we study human communication and culture, or narrow the palette of research options and alter what ‘research’ means? Some or all of the above?

Kate Crawford and I decided to sit down and interrogate some of the assumptions and biases embedded into the rhetoric surrounding “Big Data.” The resulting piece – “Six Provocations for Big Data” – offers a multi-discplinary social analysis of the phenomenon with the goal of sparking a conversation. This paper is intended to be presented as a keynote address at the Oxford Internet Institute’s 10th Anniversary “A Decade in Internet Time” Symposium.

Feedback is more than welcome!