Discourse Matters: Designing better digital futures

A very similar version of this blog post originally appeared in Culture Digitally on June 5, 2015.

Words Matter. As I write this in June 2015, a United Nations committee in Bonn is occupied in the massive task of editing a document overviewing global climate change. The effort to reduce 90 pages into a short(er), sensible, and readable set of facts and positions is not just a matter of editing but a battle among thousands of stakeholders and political interests, dozens of languages, and competing ideas about what is real and therefore, what should or should not be done in response to this reality.

discoursematters

I think about this as I complete a visiting fellowship at Microsoft Research, where over a thousand researchers worldwide study complex world problems and focus on advancing state of the art computing. In such research environments the distance between one’s work and the design of the future can feel quite small. Here, I feel like our everyday conversations and playful interactions on whiteboards has the potential to actually impact what counts as the cutting edge and what might get designed at some future point.

But in less overtly “future making” contexts, our everyday talk still matters, in that words construct meanings, which over time and usage become taken for granted ways of thinking about the way the world works. These habits of thought, writ large, shape and delimit social action, organizations, and institutional structures.

In an era of web 2.0, networked sociality, constant connectivity, smart devices, and the internet of things (IoT), how does everyday talk shape our relationship to technology, or our relationships to each other? If the theory of social construction is really a thing, are we constructing the world we really want? Who gets to decide the shape of our future? More importantly, how does everyday talk construct, feed, or resist larger discourses?

rhetoric as world-making

From a discourse-centered perspective, rhetoric is not a label for politically loaded or bombastic communication practices, but rather, a consideration of how persuasion works. Reaching back to the most classic notions of rhetoric from ancient Greek philosopher Aristotle, persuasion involves a mix of logical, emotional, and ethical appeals, which have no necessary connection to anything that might be sensible, desirable, or good to anyone, much less a majority. Persuasion works whether or not we pay attention. Rhetoric can be a product of deliberation or effort, but it can also function without either.

When we represent the techno-human or socio-technical relation through words, images, these representations function rhetorically. World making is inherently discursive at some level. And if making is about changing, this process inevitably involves some effort to influence how people describe, define, respond to, or interact with/in actual contexts of lived experience.

I have three sisters, each involved as I am in world-making, if such a descriptive phrase can be applied to the everyday acts of inquiry that prompt change in socio-technical contexts. Cathy is an organic gardener who spends considerable time improving techniques for increasing her yield each year.  Louise is a project manager who designs new employee orientation programs for a large IT company. Julie is a biochemist who studies fish in high elevation waterways.

Perhaps they would not describe themselves as researchers, designers, or even makers. They’re busy carrying out their job or avocation. But if I think about what they’re doing from the perspective of world-making, they are all three, plus more. They are researchers, analyzing current phenomena. They are designers, building and testing prototypes for altering future behaviors. They are activists, putting time and energy into making changes that will influence future practices.

Their work is alternately physical and cognitive, applied for distinct purposes, targeted to very different types of stakeholders.  As they go about their everyday work and lives, they are engaged in larger conversations about what matters, what is real, or what should be changed.

Everyday talk is powerful not just because it has remarkable potential to persuade others to think and act differently, but also because it operates in such unremarkable ways. Most of us don’t recognize that we’re shaping social structures when we go about the business of everyday life. Sure, a single person’s actions can become globally notable, but most of the time, any small action such as a butterfly flapping its wings in Michigan is difficult to link to a tsunami halfway around the world. But whether or not direct causality can be identified, there is a tipping point where individual choices become generalized categories. Where a playful word choice becomes a standard term in the OED. Where habitual ways of talking become structured ways of thinking.

The power of discourse: Two examples

I mention two examples that illustrate the power of discourse to shape how we think about social media, our relationship to data, and our role in the larger political economies of internet related activities. These cases are selected because they cut across different domains of digital technological design and development. I develop these cases in more depth here and here.

‘Sharing’ versus ‘surfing’

The case of ‘sharing’ illustrates how a term for describing our use of technology (using, surfing, or sharing) can influence the way we think about the relationship between humans and their data, or the rights and responsibilities of various stakeholders involved in these activities. In this case, regulatory and policy frameworks have shifted the burden of responsibility from governmental or corporate entities to individuals. This may not be directly caused by the rise in the use of the term ‘sharing’ as the primary description of what happens in social media contexts, but this term certainly reinforces a particular framework that defines what happens online. When this term is adopted on a broad scale and taken for granted, it functions invisibly, at deep structures of meaning. It can seem natural to believe that when we decide to share information, we should accept responsibility for our action of sharing it in the first place.

It is easy to accept the burden for protecting our own privacy when we accept the idea that we are ‘sharing’ rather than doing something else. The following comment seems sensible within this structure of meaning: “If you didn’t want your information to be public, you shouldn’t have shared it in the first place.”  This explanation is naturalized, but is not the only way of seeing and describing this event. We could alternately say we place our personal information online like we might place our wallet on the table. When someone else steals it, we’d likely accuse the thief of wrongdoing rather than the innocent victim who trusted that their personal belongings would be safe.

A still different frame might characterize personal information as an extension of the body or even a body part, rather than an object or possession. Within this definition, disconnecting information from the person would be tantamount to cutting off an arm. As with the definition of the wallet above, accountability for the action would likely be placed on the shoulders of the ‘attacker’ rather than the individual who lost a finger or ear.

‘Data’ and quantification of human experience

With the rise of big data, we have entered (or some would say returned to) an era of quantification. Here, the trend is to describe and conceptualize all human activity as data—discrete units of information that can be collected and analyzed. Such discourse collapses and reduces human experience. Dreams are equalized with body weight; personality is something that can be categorized with a similar statistical clarity as diabetes.

The trouble of using data as the baseline unit of information is that it presents an imaginary of experience that is both impoverished and oversimplified. This conceptualization is coincidental, of course, in that it coincides with the focus on computation as the preferred mode of analysis, which is predicated on the ability to collect massive quantities of digital information from multiple sources, which can only be measured through certain tools.

“Data” is a word choice, not an inevitable nomenclature. This choice has consequence from the micro to macro, from the cultural to the ontological. This is the case because we’ve transformed life into arbitrarily defined pieces, which replace the flow of lived experience with information bits. Computational analytics makes calculations based on these information bits. This matters, in that such datafication focuses attention on that which exists as data and ignores what is outside this configuration. Indeed, data has become a frame for that which is beyond argument because it always exists, no matter how it might be interpreted (a point well developed by many including Daniel Rosenberg in his essay Data before the fact).

We can see a possible outcome of such framing in the emerging science and practice of “predictive policing.” This rapidly growing strategy in large metropolitan cities is a powerful example of how computation of tiny variables in huge datasets can link individuals to illegal behaviors. The example grows somewhat terrifying when we realize these algorithms are used to predict what is likely to occur, rather than to simply calculate what has occurred. Such predictions are based on data compiled from local and national databases, focusing attention on only those elements of human behavior that have been captured in these data sets (for more on this, see the work of Sarah Brayne)

We could alternately conceptualize human experience as a river that we can only step in once, because it continually changes as it flows through time-space. In such a Heraclitian characterization, we might then focus more attention on the larger shape and ecology of the river rather than trying to capture the specificities of the moment when we stepped into it.

Likewise, describing behavior in terms of the chemical processes in the brain, or in terms of the encompassing political situation within which it occurs will focus our attention on different aspects of an individual’s behavior or the larger situation to which or within which this behavior responds. Each alternative discourse provokes different ways of seeing and making sense of a situation.

When we stop to think about it, we know these symbolic interactions matter. Gareth Morgan’s classic work about metaphors of organization emphasizes how the frames we use will generate distinctive perspectives and more importantly, distinctive structures for organizing social and workplace activities.  We might reverse engineer these structures to find a clash of rivaling symbols, only some of which survive to define the moment and create future history. Rhetorical theorist Kenneth Burke would talk about these symbolic frames as myths. In a 1935 speech to the American Writer’s Congress he notes that:

“myth” is the social tool for welding the sense of interrelationship by which [we] can work together for common social ends. In this sense, a myth that works well is as real as food, tools, and shelter are.

These myths do not just function ideologically in the present tense. As they are embedded in our everyday ways of thinking, they can become naturalized principles upon which we base models, prototypes, designs, and interfaces.

Designing better discourses

How might we design discourse to try to intervene in the shape of our future worlds? Of course, we can address this question as critical and engaged citizens. We are all researchers and designers involved in the everyday processes of world-making. Each, in our own way, are produsing the ethics that will shape our future.

This is a critical question for interaction and platform designers, software developers, and data scientists. In our academic endeavors, the impact of our efforts may or may not seem consequential on any grand scale. The outcome of our actions may have nothing to do with what we thought or desired from the outset. Surely, the butterfly neither intends nor desires to cause a tsunami.

butterfly effect comic
Image by J. L. Westover

Still, it’s worth thinking about. What impact do we have on the larger world? And should we be paying closer attention to how we’re ‘world-making’ as we engage in the mundane, the banal, the playful? When we consider the long future impact of our knowledge producing practices, or the way that technological experimentation is actualized, the answer is an obvious yes.  As Laura Watts notes in her work on future archeology:

futures are made and fixed in mundane social and material practice: in timetables, in corporate roadmaps, in designers’ drawings, in standards, in advertising, in conversations, in hope and despair, in imaginaries made flesh.

It is one step to notice these social construction processes. The challenge then shifts to one of considering how we might intervene in our own and others’ processes, anticipate future causality, turn a tide that is not yet apparent, and try to impact what we might become.

Acknowledgments and references

Notably, the position I articulate here is not new or unique, but another variation on a long running theme of critical scholarship, which is well represented by members of the Social Media Collective. I am also indebted to a long list of feminist and critical scholarship.  This position statement is based on my recent interests and concerns about social media platform design, the role of self-learning algorithmic logics in digital culture infrastructures, and the ethical gaps emerging from rapid technological development. It derives from my previous work in digital identity, ethnographic inquiry of user interfaces and user perceptions, and recent work training participants to use auto-ethnographic and phenomenology techniques to build reflexive critiques of their lived experience in digital culture. There are, truly, too many sources and references to list here, but as a short list of what I directly mentioned:

Kenneth L. Burke. 1935. Revolutionary symbolism in America. Speech to the American Writer’s Congress, February 1935. Reprinted in The Legacy of Kenneth Burke. Herbert W. Simons and Trevor Melia (eds). Madison: U of Wisconsin Press, 1989. Retrieved 2 June 2015 from: http://parlormultimedia.com/burke/sites/default/files/Burke-Revolutionary.pdf

Annette N. Markham. Forthcoming. From using to sharing: A story of shifting fault lines in privacy and data protection narratives. In Digital Ethics (2nd ed). Baastian Vanaker, Donald Heider (eds). Peter Lang Press, New York. Final draft available in PDF here

Annette N. Markham. 2014. Undermining data: A critical examination of a core term in scientific inquiry. First Monday, 18(10).

Gareth Morgan. 1986. Images of Organization. Sage Publications, Thousand Oaks, CA.

Daniel Rosenberg. 2013. Data before the fact. In Raw data’ is an oxymoron. Lisa Gitelman (ed). Cambridge, Mass.: MIT Press, pp. 15–40.

Laura Watts. 2015. Future archeology: Re-animating innovation in the mobile telecoms industry. In Theories of the mobile internet: Materialities and imaginaries. Andrew Herman, Jan Hadlaw, Thom Swiss (Eds). Routledge Press,

A Research Agenda for Accountable Algorithms

What should people who are interested in accountability and algorithms be thinking about? Here is one answer: My eleven-minute remarks are now online from a recent event at NYU. I’ve edited them to intersperse my slides.

This talk was partly motivated by the ethics work being done in the machine learning community. That is very exciting and interesting work and I love, love, love it. My remarks are an attempt to think through the other things we might also need to do. Let me know how to replace the “??” in my slides with something more meaningful!

Preview: My remarks contain a minor attempt at a Michael Jackson joke.

 

 

A number of fantastic Social Media Collective people were at this conference — you can hear Kate Crawford in the opening remarks.  For more videos from the conference, see:

Algorithms and Accountability
http://www.law.nyu.edu/centers/ili/algorithmsconference

Thanks to Joris van Hoboken, Helen Nissenbaum and Elana Zeide for organizing such a fab event.

If you bought this 11-minute presentation you might also buy: Auditing Algorithms, a forthcoming workshop at Oxford.

http://auditingalgorithms.wordpress.com

 

 

(This was cross-posted to multicast.)

The Facebook “It’s Not Our Fault” Study

Today in Science, members of the Facebook data science team released a provocative study about adult Facebook users in the US “who volunteer their ideological affiliation in their profile.” The study “quantified the extent to which individuals encounter comparatively more or less diverse” hard news “while interacting via Facebook’s algorithmically ranked News Feed.”*

  • The research found that the user’s click rate on hard news is affected by the positioning of the content on the page by the filtering algorithm. The same link placed at the top of the feed is about 10-15% more likely to get a click than a link at position #40 (figure S5).
  • The Facebook news feed curation algorithm, “based on many factors,” removes hard news from diverse sources that you are less likely to agree with but it does not remove the hard news that you are likely to agree with (S7). They call news from a source you are less likely to agree with “cross-cutting.”*
  • The study then found that the algorithm filters out 1 in 20 cross-cutting hard news stories that a self-identified conservative sees (or 5%) and 1 in 13 cross-cutting hard news stories that a self-identified liberal sees (8%).
  • Finally, the research then showed that “individuals’ choices about what to consume” further limits their “exposure to cross-cutting content.” Conservatives will click on only 17% a little less than 30% of cross-cutting hard news, while liberals will click 7% a little more than 20% (figure 3).

My interpretation in three sentences:

  1. We would expect that people who are given the choice of what news they want to read will select sources they tend to agree with–more choice leads to more selectivity and polarization in news sources.
  2. Increasing political polarization is normatively a bad thing.
  3. Selectivity and polarization are happening on Facebook, and the news feed curation algorithm acts to modestly accelerate selectivity and polarization.

I think this should not be hugely surprising. For example, what else would a good filter algorithm be doing other than filtering for what it thinks you will like?

But what’s really provocative about this research is the unusual framing. This may go down in history as the “it’s not our fault” study.

Facebook: It’s not our fault.

I carefully wrote the above based on my interpretation of the results. Now that I’ve got that off my chest, let me tell you about how the Facebook data science team interprets these results. To start, my assumption was that news polarization is bad.  But the end of the Facebook study says:

“we do not pass judgment on the normative value of cross-cutting exposure”

This is strange, because there is a wide consensus that exposure to diverse news sources is foundational to democracy. Scholarly research about social media has–almost universally–expressed concern about the dangers of increasing selectivity and polarization. But it may be that you do not want to say that polarization is bad when you have just found that your own product increases it. (Modestly.)

And the sources cited just after this quote sure do say that exposure to diverse news sources is important. But the Facebook authors write:

“though normative scholars often argue that exposure to a diverse ‘marketplace of ideas’ is key to a healthy democracy (25), a number of studies find that exposure to cross-cutting viewpoints is associated with lower levels of political participation (22, 26, 27).”

So the authors present reduced exposure to diverse news as a “could be good, could be bad” but that’s just not fair. It’s just “bad.” There is no gang of political scientists arguing against exposure to diverse news sources.**

The Facebook study says it is important because:

“our work suggests that individuals are exposed to more cross-cutting discourse in social media they would be under the digital reality envisioned by some

Why so defensive? If you look at what is cited here, this quote is saying that this study showed that Facebook is better than a speculative dystopian future.*** Yet the people referred to by this word “some” didn’t provide any sort of point estimates that were meant to allow specific comparisons. On the subject of comparisons, the study goes on to say that:

“we conclusively establish that…individual choices more than algorithms limit exposure to attitude-challenging content.”

compared to algorithmic ranking, individuals’ choices about what to consume had a stronger effect”

Alarm bells are ringing for me. The tobacco industry might once have funded a study that says that smoking is less dangerous than coal mining, but here we have a study about coal miners smoking. Probably while they are in the coal mine. What I mean to say is that there is no scenario in which “user choices” vs. “the algorithm” can be traded off, because they happen together (Fig. 3 [top]). Users select from what the algorithm already filtered for them. It is a sequence.**** I think the proper statement about these two things is that they’re both bad — they both increase polarization and selectivity. As I said above, the algorithm appears to modestly increase the selectivity of users.

The only reason I can think of that the study is framed this way is as a kind of alibi. Facebook is saying: It’s not our fault! You do it too!

Are we the 4%?

In my summary at the top of this post, I wrote that the study was about people “who volunteer their ideological affiliation in their profile.” But the study also describes itself by saying:

“we utilize a large, comprehensive dataset from Facebook.”

“we examined how 10.1 million U.S. Facebook users interact”

These statements may be factually correct but I found them to be misleading. At first, I read this quickly and I took this to mean that out of the at least 200 million Americans who have used Facebook, the researchers selected a “large” sample that was representative of Facebook users, although this would not be representative of the US population. The “limitations” section discusses the demographics of “Facebook’s users,” as would be the normal thing to do if they were sampled. There is no information about the selection procedure in the article itself.

Instead, after reading down in the appendices, I realized that “comprehensive” refers to the survey research concept: “complete,” meaning that this was a non-probability, non-representative sample that included everyone on the Facebook platform. But out of hundreds of millions, we ended up with a study of 10.1m because users were excluded unless they met these four criteria:

  1. “18 or older”
  2. “log in at least 4/7 days per week”
  3. “have interacted with at least one link shared on Facebook that we classified as hard news”
  4. “self-report their ideological affiliation” in a way that was “interpretable”

That #4 is very significant. Who reports their ideological affiliation on their profile?

add your political views

It turns out that only 9% of Facebook users do that. Of those that report an affiliation, only 46% reported an affiliation in a way that was “interpretable.” That means this is a study about the 4% of Facebook users unusual enough to want to tell people their political affiliation on the profile page. That is a rare behavior.

More important than the frequency, though, is the fact that this selection procedure confounds the findings. We would expect that a small minority who publicly identifies an interpretable political orientation to be very likely to behave quite differently than the average person with respect to consuming ideological political news.  The research claims just don’t stand up against the selection procedure.

But the study is at pains to argue that (italics mine):

“we conclusively establish that on average in the context of Facebook, individual choices more than algorithms limit exposure to attitude-challenging content.”

The italicized portion is incorrect because the appendices explain that this is actually a study of a specific, unusual group of Facebook users. The study is designed in such a way that the selection for inclusion in the study is related to the results. (“Conclusively” therefore also feels out of place.)

Algorithmium: A Natural Element?

Last year there was a tremendous controversy about Facebook’s manipulation of the news feed for research. In the fracas it was revealed by one of the controversial study’s co-authors that based on the feedback received after the event, many people didn’t realize that the Facebook news feed was filtered at all. We also recently presented research with similar findings.

I mention this because when the study states it is about selection of content, who does the selection is important. There is no sense in this study that a user who chooses something is fundamentally different from the algorithm hiding something from them. While in fact the the filtering algorithm is driven by user choices (among other things), users don’t understand the relationship that their choices have to the outcome.

not sure if i hate facebook or everyone i know
In other words, the article’s strange comparison between “individual’s choices” and “the algorithm,” should be read as “things I choose to do” vs. the effect of “a process Facebook has designed without my knowledge or understanding.” Again, they can’t be compared in the way the article proposes because they aren’t equivalent.

I struggled with the framing of the article because the research talks about “the algorithm” as though it were an element of nature, or a naturally occurring process like convection or mitosis. There is also no sense that it changes over time or that it could be changed intentionally to support a different scenario.*****

Facebook is a private corporation with a terrible public relations problem. It is periodically rated one of the least popular companies in existence. It is currently facing serious government investigations into illegal practices in many countries, some of which stem from the manipulation of its news feed algorithm. In this context, I have to say that it doesn’t seem wise for these Facebook researchers to have spun these data so hard in this direction, which I would summarize as: the algorithm is less selective and less polarizing. Particularly when the research finding in their own study is actually that the Facebook algorithm is modestly more selective and more polarizing than living your life without it.

Update: (6pm Eastern)

Wow, if you think I was critical have a look at these. It turns out I am the moderate one.

Eszter Hargittai from Northwestern posted on Crooked Timber that we should “stop being mesmerized by large numbers and go back to taking the fundamentals of social science seriously.” And (my favorite): “I thought Science was a serious peer-reviewed publication.”

Nathan Jurgenson from Maryland and Snapchat wrote on Cyborgology (“in a fury“) that Facebook is intentionally “evading” its own role in the production of the news feed. “Facebook cannot take its own role in news seriously.” He accuses the authors of using the “Big-N trick” to intentionally distract from methodological shortcomings. He tweeted that “we need to discuss how very poor corporate big data research gets fast tracked into being published.”

Zeynep Tufekci from UNC wrote on Medium that “I cannot remember a worse apples to oranges comparison” and that the key take-away from the study is actually the ordering effects of the algorithm (which I did not address in this post). “Newsfeed placement is a profoundly powerful gatekeeper for click-through rates.”

Update: (5/10)

A comment helpfully pointed out that I used the wrong percentages in my fourth point when summarizing the piece. Fixed it, with changes marked.

Update: (5/15)

It’s now one week since the Science study. This post has now been cited/linked in The New York Times, Fortune, Time, Wired, Ars Technica, Fast Company, Engaget, and maybe even a few more. I am still getting emails. The conversation has fixated on the <4% sample, often saying something like: "So, Facebook said this was a study about cars, but it was actually only about blue cars.” That’s fine, but the other point in my post is about what is being claimed at all, no matter the sample.

I thought my “coal mine” metaphor about the algorithm would work but it has not always worked. So I’ve clamped my Webcam to my desk lamp and recorded a four-minute video to explain it again, this time with a drawing.******

If the coal mine metaphor failed me, what would be a better metaphor? I’m not sure. Suggestions?

 

 

Notes:

* Diversity in hard news, in their study, would be a self-identified liberal who receives a story from FoxNews.com, or a self-identified conservative who receives one from the HuffingtonPost.com, where the stories are about “national news, politics, [or] world affairs.” In more precise terms, for each user “cross-cutting content” was defined as stories that are more likely to be shared by partisans who do not have the same self-identified ideological affiliation that you do.

** I don’t want to make this even more nitpicky, so I’ll put this in a footnote. The paper’s citations to Mutz and Huckfeldt et al. to mean that “exposure to cross-cutting viewpoints is associated with lower levels of political participation” is just bizarre. I hope it is a typo. These authors don’t advocate against exposure to cross-cutting viewpoints.

*** Perhaps this could be a new Facebook motto used in advertising: “Facebook: Better than one speculative dystopian future!”

**** In fact, algorithm and user form a coupled system of at least two feedback loops. But that’s not helpful to measure “amount” in the way the study wants to, so I’ll just tuck it away down here.

***** Facebook is behind the algorithm but they are trying to peer-review research about it without disclosing how it works — which is a key part of the study. There is also no way to reproduce the research (or do a second study on a primary phenomenon under study, the algorithm) without access to the Facebook platform.

****** In this video, I intentionally conflate (1) the number of posts filtered and (2) the magnitude of the bias of the filtering. I did so because the difficulty with the comparison works the same way for both, and I was trying to make the example simpler. Thanks to Cedric Langbort for pointing out that “baseline error” is the clearest way of explaining this.

(This was cross-posted to multicast and Wired.)

Using Off-the-shelf Software for basic Twitter Analysis

Mary Gray, Mike Ananny and I are writing a paper on queer youth and “Glee” for the American Anthropological Association’s annual meeting (yes, I have the greatest job in the world). This is a multi-methodological study by design, because traditional television viewing practices have become so complex. Besides traditional audience ethnography like interviews and participant observation, we are using textual analysis to analyze episode themes, and collected a large corpus of tweets with Glee-related hashtags. This summer, I worked with my high school intern, Jazmin Gonzales-Rivero, to go through this corpus of tweets and pull out useful information for the paper.

We’ve written and published a basic report on using off-the-shelf tools to see patterns and themes in large Twitter data set quickly and easily.

Abstract:

With the increasing popularity of large social software applications like Facebook and Twitter, social scientists and computer scientists have begun developing innovative approaches to dealing with the vast amounts of data produced and collected in such environments. For qualitative researchers, the methods involved can be daunting and unfamiliar. In this report, we outline some basic procedures for working with a large-scale Twitter data set to answer qualitative inquiries. We use Python, MySQL, and the word-cloud generator Wordle to identify patterns in re-tweets, tweet authors, dates and times of tweets, frequency of hashtags, and frequency of word use. Such data can provide valuable augmentation to qualitative inquiry. This paper is aimed at social scientists and humanities scholars with limited experience with big data and a lack of computing resources to do extensive quantitative research.

Citation:
Marwick, A. and Gonzales-Rivero, J. (2011). Learning to Work with Large-Scale Twitter Data Sets: Using Off-The-Shelf Tools to Quickly and Easily See Tweet Patterns. Microsoft Research Social Media Collective Report, MSR-SMC-11-01, Cambridge, MA. [Download as PDF]

If you’re a seasoned computer scientist or a Big Data aficionado, the information in this paper will seem quite simplistic. But for those of us without programming backgrounds who study Twitter or other forms of social media, the idea of tackling a set of 450,000 tweets can seem quite daunting. In this paper, Jazmin and I walk step-by-step through the methods she used to parse a set of Tweets, using free and easily accessible tools like MySQL, Python, and Wordle. We hope this will be helpful for other legal, humanities, and social science scholars who might want to dip their foot into Big Data to augment more qualitative research findings.

Citation: