The OKCupid data release fiasco: It’s time to rethink ethics education

In mid 2016, we confront another ethical crisis related to personal data, social media, the public internet, and social research. This time, it’s a release of some 70,0000 OKCupid users’ data, including some very intimate details about individuals. Responses from several communities of practice highlight the complications of using outdated modes of thinking about ethics and human subjects when considering new opportunities for research through publicly accessible or otherwise easily obtained data sets (e.g., Michael Zimmer produced a thoughtful response in Wired and Kate Crawford pointed us to her recent work with Jacob Metcalf on this topic). There are so many things to talk about in this case, but here, I’d like to weigh in on conversations about how we might respond to this issue as university educators.

The OKCupid case is just the most recent of a long list of moments that reveal how doing something because it is legal is no guarantee that it is ethical. To invoke Kate Crawford’s apt Tweet from March 3, 2016:

This is a key point of confusion, apparently. Michael Zimmer, reviewing multiple cases of ethical problems emerging when large datasets are released by researchers emphasizes the flaw in this response, noting:

This logic of “but the data is already public” is an all-too-familiar refrain used to gloss over thorny ethical concerns (in Wired).

In the most recent case, the researcher in question, Emil Kirkegaard, uses this defense in response to questions asking if he anonymized the data: “No. Data is already public.” I’d like to therefore add a line to Crawford’s simple advice:

Data comes from people. Displaying it for the world to see can cause harm.

A few days after this data was released, it was removed from the Open Science Framework, after a DMCA claim by OKCupid. Further legal action could follow. All of this is a good step toward protecting the personal data of users, but in the meantime, many already downloaded and are now sharing the dataset in other forms. As Scott Weingart, digital humanities specialist at Carnegie Mellon, warns:

As a long term university educator, a faculty member at the same university where Kirkegaard is pursuing his Masters degree, and a researcher of digital ethics, this OKCupid affair frustrates me: How is it possible that we continue to reproduce this logic, despite the multiple times “it’s publicly accessible therefore I can do whatever I want with it” has proved harmful? We must attribute some responsibility to existing education systems. Of course, the problem doesn’t start there and “education system” can be a formal institution or simply the way we learn as everyday knowledge is passed around in various forms. So there are plenty of arenas where we learn (or fail to learn) to make good choices in situations fraught with ethical complexity. Let me offer a few trajectories of thought:

What data means to regulators

The myth of “data is already public, therefore ethically fine to use for whatever” persists because traditional as well as contemporary legal and regulatory statements still make a strong distinction between public and private. This is no longer a viable distinction, if it ever was. When we define actions or information as being either in the private or the public realm, this sets up a false binary that is not true in practice or perception. Information is not a stable object that emerges in and remains located in a particular realm or sphere. Data becomes informative or is noticed only when it becomes salient for some reason. On OKCupid or elsewhere, people publish their picture, religious affiliation, or sexual preference in a dating profile as part of a performance of their identity for someone else to see. This placement of information is intended to be part of an expected pattern of interaction — someone is supposed to see and respond to this information, which might then spark conversation or a relationship. This information is not chopped up into discrete units in either a public or private realm. Rather, it is performative and relational. When we only access regulatory language, the more nuanced subtleties of context are rendered invisible.

What data means to people who produce it

Whether information or data is experienced or felt as something public or private is quite different from the information itself. Violation of privacy can be an outcome at any point. This is not related to the data, but the ways in which the data is used. From this standpoint, data can only logically exist as part of continual flows of timespace contexts; therefore, to extract data as points from one or the other static sphere is illogical. Put more simply, the expectation of privacy about one’s profile information comes into play when certain information is registered and becomes meaningful for others. Otherwise, the information would never enter into a context where ‘public’, ‘private’, ‘intimate’, ‘secret’, or any other adjective operates as a relevant descriptor.

This may not be the easiest idea for us to understand, since we generally conceptualize data as static and discrete informational units that can be observed, collected, and analyzed. In experience, this is simply not true. The treatment of personal data is important. It requires sensitivity to the context as well as an understanding of the tools that can be used to grapple with this complexity.

What good researchers know about data and ethics

Reflexive researchers know that regulations may be necessary, but they are insufficient guides for ethics. While many lessons from previous ethical breaches in scientific research find their way into regulatory guidelines or law, unique ethical dilemmas arise as a natural part of any research of any phenomenon. According to the ancient Greeks, doing the right thing is a matter of phronesis or practical wisdom whereby one can discern what would constitute the most ethical choice in any situation, an ability that grows stronger with time, experience, and reflection.

This involves much more than simply following the rules or obeying the letter of the law. Phronesis is a very difficult thing to teach, since it is a skill that emerges from a deep understand of the possible intimacy others have with what we outsiders might label ‘data.’ This reflection requires that we ask different questions than what regulatory prescriptions might require. In addition to asking the default questions such as “Is the data public or private?” or “does this research involve a ‘human subject’?” we should be asking “What is the relationship between a person and her data?” Or “How does the person feel about his relationship with his data?” These latter questions don’t generally appear in regulatory discussions about data or ethics. These questions represent contemporary issues that have emerged as a result of digitization plus the internet, an equation that illustrates information can be duplicated without limits and is swiftly and easily separated from its human origins once it disseminates or moves through the network. In a broader sense, this line of inquiry highlights the extent to which ‘data’ can be mischaracterized.

Where do we learn the ethic of accountability?

While many scholars concerned with data ethics discuss complex questions, the complexity doesn’t often end up traditional classrooms or regulatory documents. We learn to ask the tough questions when complicated situations emerge, or when a problem or ethical dilemma arises. At this point, we may question and adjust our mindset. This is a process of continual reflexive interrogation of the choices we’re making as researchers. And we get better at it over time and practice.

We might be disappointed but we shouldn’t be surprised that many people end up relying on outdated logic that says ‘if data is publicly accessible, it is fair game for whatever we want to do with it’. This thinking is so much easier and quicker than the alternative, which involves not only judgment, responsibility, and accountability, but also speculation about the potential future impact of one’s research.

Learning contemporary ethics in a digitally-saturated and globally networked epoch involves considering the potential impact of one’s decisions and then making the best choice possible. Regulators are well aware of this, which is why they (mostly) include exceptions and specific case guidance in statements about how researchers should treat data and conduct research involving human subjects.

Teaching ethics as ‘levels of impact’

So, how might we change the ways we talk and teach about ethics to better prepare researchers to take the extra step of reflecting on how their research choices matter in the bigger picture? First, we can make this an easier topic to broach by addressing ethics as being about choices we make at critical junctures; choices that will invariably have impact.

We make choices, consciously or unconsciously, throughout the research process. Simply stated, these choices matter. If we do not grapple with natural and necessary change in research practices our research will not reflect the complexities we strive to understand. — Annette Markham, 2003.

Ethics can be thus considered a matter of methods. “Doing the right thing” is an everyday activity, as we make multiple choices about how we might act. Our decisions and actions transform into habits, norms, and rules over time and repetition. Our choices carry consequences. As researchers, we carry more responsibility than users of social media platforms. Why? Because we hold more cards when we present findings of studies and make knowledge statements intended to present some truth -big or little T- about the world to others.

To dismiss our everyday choices as being only guided by extant guidelines is a naïve approach to how ethics are actually produced. Beyond our reactions to this specific situation, as Michael Zimmer emphasizes in his recent Wired article, we must address the conceptual muddles present in big data research.

This is quite a challenge when the terms are as muddled as the concepts. Take the word ‘ethics.’ Although it’s an important term that operates as an important foundation in our work as researchers, it is also abstract, vague, and daunting because it can feel like you ought to have philosophy training to talk about it. As educators, we can lower the barrier to entry into ethical concepts by taking a ‘what if’ impact approach, or discussing how we might assess the ‘creepy’ factor in our research design, data use, or technology development.

At the most basic level of an impact approach, we might ask how our methods of data collection impact humans, directly. If one is interviewing, or the data is visibly connected to a person, this is easy to see. But a distance principle might help us recognize that when the data is very distant from where it originated, it can seem disconnected from persons, or what some regulators call ‘human subjects.’ At another level, we can ask how our methods of organizing data, analytical interpretations, or findings as shared datasets are being used — or might be used — to build definitional categories or to profile particular groups in ways that could impact livelihoods or lives. Are we contributing positive or negative categorizations? At a third level of impact, we can consider the social, economic, or political changes caused by one’s research processes or products, in both the short and long term. These three levels raise different questions than those typically raised by ethics guidelines and regulations. This is because an impact approach is targeted toward the possible or probable impact, rather than the prevention of impact in the first place. It acknowledges that we change the world as we conduct even the smallest of scientific studies, and therefore, we must take some personal responsibility for our methods.

Teaching questions rather than answers

Over the six years I spent writing guidelines for the updated ‘Ethics and decision making in internet research” document for the International Association of Internet Researchers (AoIR), I realized we had shifted significantly from statements to questions in the document. This shift was driven in part by the fact that we came from many different traditions and countries and we couldn’t come to consensus about what researchers should do. Yet we quickly found that posing these questions provided the only stable anchor point as technologies, platforms, and uses of digital media were continually changing. As situations and contexts shifted, different ethical problems would arise. This seemingly endless variation required us to reconsider how we think about ethics and how we might guide researchers seeking advice. While some general ethical principles could be considered in advance, best practices emerged through rigorous self-questioning throughout the course of a study, from the outset to well after the research was completed. Questions were a form that also allowed us to emphasize the importance of active and conscious decision-making, rather than more passive adherence to legal, regulatory, or disciplinary norms.

A question-based approach emphasizes that ethical research is a continual and iterative process of both direct and tacit decision making that must be brought to the surface and consciously accounted for throughout a project. This process of questioning is most obvious when the situation or direction is unclear and decisions must be made directly. But when the questions as well as answers are embedded in and produced as part of our habits, these must be recognized for what they once were — choices at critical junctures. Then, rather than simply adopting tools as predefined options, or taking analytical paths dictated by norm or convention, we can choose anew.

This recent case of the OKCupid data release provides an opportunity for educators to revisit our pedagogical approaches and to confront this confusion head on. It’s a call to think about options that reach into the heart of the matter, which means adding something to our discussions with junior researchers to counteract the depersonalizing effects of generalized top down requirements, forms with checklists, and standardized (and therefore seemingly irrelevant) online training modules.

  • This involves questioning as well as presenting extant ethical guidelines, so that students understand more about the controversies and ongoing debates behind the scenes as laws and regulations are developed.
  • It demands that we stop treating IRB or ethics boards requirements as bureaucratic hoops to jump through, so that students can appreciate that in most studies, ethics require revisiting.
  • It means examining the assumptions underlying ethical conventions and reviewing debates about concepts like informed consent, anonymizing data, or human subjects, so that students better appreciate these as negotiable and context-dependent, rather than settled and universal concepts.
  • It involves linking ethics to everyday logistic choices made throughout a study, including how questions are framed, how studies are designed, and how data is managed and organized. In this way students can build a practice of reflection on and engagement around their research decisions as meaningful choices rather than externally prescribed procedures.
  • It asks that we understand ethics as they are embedded in broader methodological processes — perhaps by discussing how analytical categories can construct cultural definitions, how findings can impact livelihoods, or how writing choices and styles can invoke particular versions of stories. In this way, students can understand that their decisions carry over into other spheres and can have unintended or unanticipated results.
  • It requires adding positive examples to the typically negative cases, which tend to describe what we should not do, or how we can get in trouble. In this way, students can consider the (good and important) ethics of conducting research that is designed to make actual and positive transformations in the broader world.

This list is intended to spark imagination and conversation more than to explain what’s currently happening (for that, I would point to Metcalf’s 2015 review of various pedagogical approaches to ethics in the U.S.). There are obviously many ways to address or respond to this most recent case, or any of the dozens of cases that pose ethical problems.

I, for one, will continue talking more in my classrooms about how, as researchers, our work can be perceived as creepy, stalking, or harassing; exploring how our research could cause harm in the short or long term; and considering what sort of futures we are facilitating as a result of our contributions in the here and now.

For more about data and ethics, I recommend the annual Digital Ethics Symposium at Loyola University-Chicago; the growing body of work emerging from the Council for Big Data, Ethics, & Society; and the internationalAssociation of Internet Studies (AoIR) ethics documents and the work of their longstanding ethics committee members. For current discussions around how we conceptualize data in social research, one might take a look at special issues devoted to the topic, like the 2013 issue on Making Data: Big data and beyond in First Monday, or the 2014 issue on Critiquing Big Data in the International Journal of Communication.These are just the first works off the top of my head that have inspired my own thinking and research on these topics.

Discourse Matters: Designing better digital futures

A very similar version of this blog post originally appeared in Culture Digitally on June 5, 2015.

Words Matter. As I write this in June 2015, a United Nations committee in Bonn is occupied in the massive task of editing a document overviewing global climate change. The effort to reduce 90 pages into a short(er), sensible, and readable set of facts and positions is not just a matter of editing but a battle among thousands of stakeholders and political interests, dozens of languages, and competing ideas about what is real and therefore, what should or should not be done in response to this reality.

discoursematters

I think about this as I complete a visiting fellowship at Microsoft Research, where over a thousand researchers worldwide study complex world problems and focus on advancing state of the art computing. In such research environments the distance between one’s work and the design of the future can feel quite small. Here, I feel like our everyday conversations and playful interactions on whiteboards has the potential to actually impact what counts as the cutting edge and what might get designed at some future point.

But in less overtly “future making” contexts, our everyday talk still matters, in that words construct meanings, which over time and usage become taken for granted ways of thinking about the way the world works. These habits of thought, writ large, shape and delimit social action, organizations, and institutional structures.

In an era of web 2.0, networked sociality, constant connectivity, smart devices, and the internet of things (IoT), how does everyday talk shape our relationship to technology, or our relationships to each other? If the theory of social construction is really a thing, are we constructing the world we really want? Who gets to decide the shape of our future? More importantly, how does everyday talk construct, feed, or resist larger discourses?

rhetoric as world-making

From a discourse-centered perspective, rhetoric is not a label for politically loaded or bombastic communication practices, but rather, a consideration of how persuasion works. Reaching back to the most classic notions of rhetoric from ancient Greek philosopher Aristotle, persuasion involves a mix of logical, emotional, and ethical appeals, which have no necessary connection to anything that might be sensible, desirable, or good to anyone, much less a majority. Persuasion works whether or not we pay attention. Rhetoric can be a product of deliberation or effort, but it can also function without either.

When we represent the techno-human or socio-technical relation through words, images, these representations function rhetorically. World making is inherently discursive at some level. And if making is about changing, this process inevitably involves some effort to influence how people describe, define, respond to, or interact with/in actual contexts of lived experience.

I have three sisters, each involved as I am in world-making, if such a descriptive phrase can be applied to the everyday acts of inquiry that prompt change in socio-technical contexts. Cathy is an organic gardener who spends considerable time improving techniques for increasing her yield each year.  Louise is a project manager who designs new employee orientation programs for a large IT company. Julie is a biochemist who studies fish in high elevation waterways.

Perhaps they would not describe themselves as researchers, designers, or even makers. They’re busy carrying out their job or avocation. But if I think about what they’re doing from the perspective of world-making, they are all three, plus more. They are researchers, analyzing current phenomena. They are designers, building and testing prototypes for altering future behaviors. They are activists, putting time and energy into making changes that will influence future practices.

Their work is alternately physical and cognitive, applied for distinct purposes, targeted to very different types of stakeholders.  As they go about their everyday work and lives, they are engaged in larger conversations about what matters, what is real, or what should be changed.

Everyday talk is powerful not just because it has remarkable potential to persuade others to think and act differently, but also because it operates in such unremarkable ways. Most of us don’t recognize that we’re shaping social structures when we go about the business of everyday life. Sure, a single person’s actions can become globally notable, but most of the time, any small action such as a butterfly flapping its wings in Michigan is difficult to link to a tsunami halfway around the world. But whether or not direct causality can be identified, there is a tipping point where individual choices become generalized categories. Where a playful word choice becomes a standard term in the OED. Where habitual ways of talking become structured ways of thinking.

The power of discourse: Two examples

I mention two examples that illustrate the power of discourse to shape how we think about social media, our relationship to data, and our role in the larger political economies of internet related activities. These cases are selected because they cut across different domains of digital technological design and development. I develop these cases in more depth here and here.

‘Sharing’ versus ‘surfing’

The case of ‘sharing’ illustrates how a term for describing our use of technology (using, surfing, or sharing) can influence the way we think about the relationship between humans and their data, or the rights and responsibilities of various stakeholders involved in these activities. In this case, regulatory and policy frameworks have shifted the burden of responsibility from governmental or corporate entities to individuals. This may not be directly caused by the rise in the use of the term ‘sharing’ as the primary description of what happens in social media contexts, but this term certainly reinforces a particular framework that defines what happens online. When this term is adopted on a broad scale and taken for granted, it functions invisibly, at deep structures of meaning. It can seem natural to believe that when we decide to share information, we should accept responsibility for our action of sharing it in the first place.

It is easy to accept the burden for protecting our own privacy when we accept the idea that we are ‘sharing’ rather than doing something else. The following comment seems sensible within this structure of meaning: “If you didn’t want your information to be public, you shouldn’t have shared it in the first place.”  This explanation is naturalized, but is not the only way of seeing and describing this event. We could alternately say we place our personal information online like we might place our wallet on the table. When someone else steals it, we’d likely accuse the thief of wrongdoing rather than the innocent victim who trusted that their personal belongings would be safe.

A still different frame might characterize personal information as an extension of the body or even a body part, rather than an object or possession. Within this definition, disconnecting information from the person would be tantamount to cutting off an arm. As with the definition of the wallet above, accountability for the action would likely be placed on the shoulders of the ‘attacker’ rather than the individual who lost a finger or ear.

‘Data’ and quantification of human experience

With the rise of big data, we have entered (or some would say returned to) an era of quantification. Here, the trend is to describe and conceptualize all human activity as data—discrete units of information that can be collected and analyzed. Such discourse collapses and reduces human experience. Dreams are equalized with body weight; personality is something that can be categorized with a similar statistical clarity as diabetes.

The trouble of using data as the baseline unit of information is that it presents an imaginary of experience that is both impoverished and oversimplified. This conceptualization is coincidental, of course, in that it coincides with the focus on computation as the preferred mode of analysis, which is predicated on the ability to collect massive quantities of digital information from multiple sources, which can only be measured through certain tools.

“Data” is a word choice, not an inevitable nomenclature. This choice has consequence from the micro to macro, from the cultural to the ontological. This is the case because we’ve transformed life into arbitrarily defined pieces, which replace the flow of lived experience with information bits. Computational analytics makes calculations based on these information bits. This matters, in that such datafication focuses attention on that which exists as data and ignores what is outside this configuration. Indeed, data has become a frame for that which is beyond argument because it always exists, no matter how it might be interpreted (a point well developed by many including Daniel Rosenberg in his essay Data before the fact).

We can see a possible outcome of such framing in the emerging science and practice of “predictive policing.” This rapidly growing strategy in large metropolitan cities is a powerful example of how computation of tiny variables in huge datasets can link individuals to illegal behaviors. The example grows somewhat terrifying when we realize these algorithms are used to predict what is likely to occur, rather than to simply calculate what has occurred. Such predictions are based on data compiled from local and national databases, focusing attention on only those elements of human behavior that have been captured in these data sets (for more on this, see the work of Sarah Brayne)

We could alternately conceptualize human experience as a river that we can only step in once, because it continually changes as it flows through time-space. In such a Heraclitian characterization, we might then focus more attention on the larger shape and ecology of the river rather than trying to capture the specificities of the moment when we stepped into it.

Likewise, describing behavior in terms of the chemical processes in the brain, or in terms of the encompassing political situation within which it occurs will focus our attention on different aspects of an individual’s behavior or the larger situation to which or within which this behavior responds. Each alternative discourse provokes different ways of seeing and making sense of a situation.

When we stop to think about it, we know these symbolic interactions matter. Gareth Morgan’s classic work about metaphors of organization emphasizes how the frames we use will generate distinctive perspectives and more importantly, distinctive structures for organizing social and workplace activities.  We might reverse engineer these structures to find a clash of rivaling symbols, only some of which survive to define the moment and create future history. Rhetorical theorist Kenneth Burke would talk about these symbolic frames as myths. In a 1935 speech to the American Writer’s Congress he notes that:

“myth” is the social tool for welding the sense of interrelationship by which [we] can work together for common social ends. In this sense, a myth that works well is as real as food, tools, and shelter are.

These myths do not just function ideologically in the present tense. As they are embedded in our everyday ways of thinking, they can become naturalized principles upon which we base models, prototypes, designs, and interfaces.

Designing better discourses

How might we design discourse to try to intervene in the shape of our future worlds? Of course, we can address this question as critical and engaged citizens. We are all researchers and designers involved in the everyday processes of world-making. Each, in our own way, are produsing the ethics that will shape our future.

This is a critical question for interaction and platform designers, software developers, and data scientists. In our academic endeavors, the impact of our efforts may or may not seem consequential on any grand scale. The outcome of our actions may have nothing to do with what we thought or desired from the outset. Surely, the butterfly neither intends nor desires to cause a tsunami.

butterfly effect comic
Image by J. L. Westover

Still, it’s worth thinking about. What impact do we have on the larger world? And should we be paying closer attention to how we’re ‘world-making’ as we engage in the mundane, the banal, the playful? When we consider the long future impact of our knowledge producing practices, or the way that technological experimentation is actualized, the answer is an obvious yes.  As Laura Watts notes in her work on future archeology:

futures are made and fixed in mundane social and material practice: in timetables, in corporate roadmaps, in designers’ drawings, in standards, in advertising, in conversations, in hope and despair, in imaginaries made flesh.

It is one step to notice these social construction processes. The challenge then shifts to one of considering how we might intervene in our own and others’ processes, anticipate future causality, turn a tide that is not yet apparent, and try to impact what we might become.

Acknowledgments and references

Notably, the position I articulate here is not new or unique, but another variation on a long running theme of critical scholarship, which is well represented by members of the Social Media Collective. I am also indebted to a long list of feminist and critical scholarship.  This position statement is based on my recent interests and concerns about social media platform design, the role of self-learning algorithmic logics in digital culture infrastructures, and the ethical gaps emerging from rapid technological development. It derives from my previous work in digital identity, ethnographic inquiry of user interfaces and user perceptions, and recent work training participants to use auto-ethnographic and phenomenology techniques to build reflexive critiques of their lived experience in digital culture. There are, truly, too many sources and references to list here, but as a short list of what I directly mentioned:

Kenneth L. Burke. 1935. Revolutionary symbolism in America. Speech to the American Writer’s Congress, February 1935. Reprinted in The Legacy of Kenneth Burke. Herbert W. Simons and Trevor Melia (eds). Madison: U of Wisconsin Press, 1989. Retrieved 2 June 2015 from: http://parlormultimedia.com/burke/sites/default/files/Burke-Revolutionary.pdf

Annette N. Markham. Forthcoming. From using to sharing: A story of shifting fault lines in privacy and data protection narratives. In Digital Ethics (2nd ed). Baastian Vanaker, Donald Heider (eds). Peter Lang Press, New York. Final draft available in PDF here

Annette N. Markham. 2014. Undermining data: A critical examination of a core term in scientific inquiry. First Monday, 18(10).

Gareth Morgan. 1986. Images of Organization. Sage Publications, Thousand Oaks, CA.

Daniel Rosenberg. 2013. Data before the fact. In Raw data’ is an oxymoron. Lisa Gitelman (ed). Cambridge, Mass.: MIT Press, pp. 15–40.

Laura Watts. 2015. Future archeology: Re-animating innovation in the mobile telecoms industry. In Theories of the mobile internet: Materialities and imaginaries. Andrew Herman, Jan Hadlaw, Thom Swiss (Eds). Routledge Press,