Plotting Data: Acts of Collection and Omission

Although governments and service-providing companies come to rely more and more on algorithms for automated decision making,1 it is only in the cases where they fail, that we see these technologies highlighted in the media. In The Ethnography of Infrastructure, Susan Leigh Star argues that “breakdown becomes the basis of a much more detailed understanding of the relational nature of infrastructure2. In the moments in which machine learning algorithms show their weaknesses and biases, this infrastructural quality becomes visible.

A dataset is the material that a machine learning algorithm uses to learn. In order to critique the use of machine learning algorithms, we look to the education machines undertake in order to emulate certain human behaviours. Just as a human is shaped by what they learn, so can the actions of an algorithm be read with how it was trained. On matters of artificial intelligence (AI), we see a shift in focus of critique from algorithm to dataset, as the inevitability for bias becomes ever more apparent3. This publication takes part in shifting.

Plotting Data discusses the importance that datasets play in contemporary automated decision making through examination of datasets that are used as standards for algorithmic training. Alongside this we have designed three interfaces that allow for unusual explorations of the datasets we critique. Our interfaces intend to let you look closely, and feel intimate with vast amalgamations of data.

The Dataset

A dataset is a collection of data, it is a way of structuring and ordering that partly defines a machine’s horizon of possibilities. This structure is called an ‘information model’. The model determines what gets to count as data and what doesn’t, thus the act of determining defines the subject it creates. To make a model, a cut or choice has to be made when considering how to define the boundaries of the thing being observed.

Datasets and their structures are seen as a given. The act of forming information - collecting, editing, arranging - is hidden from an outside eye. Making visible this usually hidden system fits with the shift towards an infrastructural understanding of machine learning, or what artist Francis Hunger calls the ‘trinity of the information model–data–algorithm’.

In her text The Point of Collection, the New York-based artist Mimi Onuoha mentions a brilliant example which shows the difference it makes when different entities collect data and the different tactics they use. In 2012, The FBI Uniform Crime Reporting Program registered 5,796 hate crimes, however the Department of Justice’s Bureau of Statistics came up with 293,800 reports of such offences, that’s 50 times as much as the first case. While in the first situation law enforcement agencies across the country voluntarily reported cases, in the second collection, the Bureau of Statistics distributed the National Crime Victimization survey directly to the victims of hate crimes4. The difference in numbers can be explained through how the entities collected their data. When perceiving what has been collected one must also consider that which is omitted.

The Design of Datasets

More often than not, datasets are represented as tabular data: a collection of text, numbers, and sometimes images, brought together by the light gray lines of the familiar spreadsheet interface. Each row represents an individual or object; each column represents a property the individual or object holds. Some datasets consist of multiple tables, across which objects are linked together by identification codes - be it an incremental number or a string of random characters. Even when the dataset’s source material is structured in another way, using the dataset requires a parsing of the data into a key-value format. It is the data’s structure —often a grid— that operationalises the data, it turns arbitrary content into a dataset.

In their liking with spreadsheets, datasets and their interfaces can be understood as ‘gray media’5. This term has been devised by media theorist Matthew Fuller, to talk about the phenomenon where pervasive administrative tools become invisible. Considering that any form of visualization, even at the stage of a spreadsheet, is inherently aesthetic, it can be wielded as a rhetorical device. The processes by which a dataset is created, are often obfuscated through the very way in which it is presented, carefully curated and designed. Their persuasive presentation means the intentions behind the acts of collection and the acts of omission fall out of sight. Even once we become aware of this, their infrastructural qualities, as well as seamlessness, make them seem neutral. Their power lies precisely in their capability to not only make themselves gray, but to do the same with whatever information is presented through them. Datasets are data-in-formation, yet grayness makes the act of formatting disappear.

A Perspective of Care

In the aggregation and summarization of data, it is not only the individual story that is lost. It creates a closed world, with its own interpretative system. This occurs particularly when data is compiled visually, the book artist, visual theorist and cultural critic Johanna Drucker explains this further:

the rendering of statistical information into graphical form gives it a simplicity and legibility that hides every aspect of the original interpretative framework on which the statistical data were constructed. The graphical force conceals what the statistician knows very well — that no “data” pre-exist their parameterization. Data are capta, taken not given, constructed as an interpretation of the phenomenal world, not inherent in it.6

The data’s dense mass hides its processes of collection and interpretation. It functions solely within its own domain. When the machine learning model based on it is applied to different places, assumptions beginning in the dataset are unintentionally transferred. This is why Drucker, like us, calls for alternative approaches to data aggregation and visualisation, and to explore forms that account for individual stories.

Individual entries are lost as researchers and developers view the dataset primarily through mathematical operations applied to the whole. We use averages or standard deviations because the sheer vastness of contemporary datasets makes it incredibly difficult to approach individual content. Artist Nicolas Malevé’s work Exhibiting ImageNet explores this vastness by showing all of the 14.197.122 images from the ImageNet dataset. Even though each image is shown for a mere 90 milliseconds, it takes two months to go through all of them - despite being on display 24/7. It shows the impossibility of seeing, let alone carefully considering, each individual entry of the collection.

To dissemble a dataset, a perspective of care towards what is being observed needs to be adopted. As you deconstruct the dataset’s logic, assumptions and intentions foundational to its creation are revealed. This could provide what Lecturer in Critical Infrastructure Studies Jonathan Gray calls Data Infrastructure Literacy, an understanding of data collection practices that moves away from the misleading emphasis on a technical form of data literacy and instead towards an infrastructural and political understanding of datasets.

The Dramaset

Working through Plotting Data we were gridlocked by our desire to consider the individual, as well as the impossibility of doing so. To overcome this, we turned to the concept of ‘data dramatisation’ that artists and engineers Memo Akten and Liam Young put forward7. ‘Data dramatisation’ implies a familiarity with data-driven processes that emerges through story telling:

“To dramatize data, you must first understand it. You analyse it, play with it, try to find relationships, try to infer the events that took place, and extract stories and meaning.

And then you throw it all away. Chuck it in the bin, and wipe your hands clean. All that’s left is your understanding of the processes that gave rise to the data and the events and relationships within.”

It is through working with datasets, and the stories that lie embedded within them, that we generate a greater understanding of their politics. However, expanding on their original concept, we believe these dramatisations have the potential to not only generate a greater understanding for us, as artists and designers, but also for those who engage with this work. ‘Dramatisation’ does not imply a fixed array of strategies, we use it to refer to aesthetic and embodied experiences that allow for unusual interactions with datasets. While all data representation practices succumb to some form of aesthetic experience, we took the act of dramatising as an invitation to engage with datasets through various experiential practices.

“One of the small experiments we made during Plotting Data with saturating the KTH (Royal Institute of Technology in Sweden) action dataset was by playing with the dataset’s 6 classifications: walking, jogging, running, boxing,waving, clapping, while building on its own aesthetics.”

We created interfaces that do not present datasets as ‘matter of fact’, which would imply a singular reading, nor are they optimised towards evoking a specific emotional outcome8. Rather, we designed for a more open form of affective engagement. We focused on the dataset’s individual entries, and the presence of what is left out becomes inescapable. New forms of engagement let an audience perceive the dataset’s characteristics and conflicts, its abstract qualities become tangible and we can consider the placement of individual entries in the vast dataset superstructure.

This publication follows a series of workshops held in 2019, that discussed various datasets. Building on three canonical datasets - COCO, Enron and 20BN SOMETHING-SOMETHING - we made our own interfaces to explore the dataset and its particularities. Using tactics from story-telling and dramaturgy, we engage with the dataset through identification and amplification of specific narratives already present within the collections. To consider the datum of the data, to distinguish the subject from the casus.

Each interface encourages interaction with the dataset’s individual entries through a different tactic of dramatisation: borrowing theatre script reading techniques for the Enron corpus, using scenographic techniques for the COCO dataset and exploring the agency of the annotator through the form of a game for the 20BN SOMETHING-SOMETHING dataset. Like a small performance, each interaction with the interface counters the dataset’s strict notation and proposes serendipitous encounters with its content.

Aside from these experiments to interface datasets and our considerations for them, this publication includes our conversations with four fellow artists who engage with datasets through their work: Mimi Onuoha, Francis Hunger, Caroline Sinders and Nicolas Malevé. It also contains more interface artworks and references that we have encountered along the way.

We would like to end this section by bringing into conversation two references we’ve encountered while researching, that do not immediately relate to the practice of machine learning but highlight the position and perspective of who collects the data, what they choose to include and what they choose to exclude.

The first reference is taken from the scholar, artist/designer and hacker mama Catherine D’Ignazio’s article What would feminist data visualization look like?, in which she gives a poignant example of how different perspectives on the same data can lead to new interpretations. She talks about the 1960’s Detroit Geographical Expedition and Institute project, where academic geographers and some of the inner city youth of Detroit were brought together by Gwendolyn Warren, a then 18-year old black female community activist. The goal of this research was to bring together academic geographers with ‘folk geographers’ in order to create ‘oughtness maps’ – maps of how things are and maps of how things ought to be. Warren created the following map, entitled: Where Commuters Run Over Black Children on the Pointes-Downtown Track.

A map drawn by Gwendolyn Warren in light of the Detroit Geographical Expedition and Institute
A map drawn by Gwendolyn Warren in light of the Detroit Geographical Expedition and Institute

Warren’s map addresses the issue of children from the black community of Detroit being run over by cars driving from predominantly white affluent suburbs to the Downtown area. The explicit title casts a different light over the data visualisation. It reveals the political in what is generally considered as neutral & objective.

The second reference is an example from political scientist and anthropologist James C. Scott’s book Seeing Like a State (1998). Scott describes the American city of Chicago as it is seen from a bird’s eye view, lines upon lines lie arranged neatly into rectangles that make up the street blocks. According to Scott, this flattening superimposition of spatial order and rational control has no necessary relationship to the order of life as it is experienced by its residents. Rather, the tidy geometry of the city planning makes the city legible to municipal and state authorities, but not necessarily to the residents themselves. In fact, he argues, the standardisation benefits the surveyor, the planner, but importantly also the real estate speculator, because with such geometry the cost of land can be easily calculated at a glance.

“Map of downtown Chicago, circa 1893” - from Seeing Like A State (Scott, 1998)
“Map of downtown Chicago, circa 1893” - from Seeing Like A State (Scott, 1998)

When one puts these maps side by side however, what stands out is that both Detroit and Chicago have a similar urban planning. Moreover, it is a planning that they share with many modern cities. The same streets that were optimised for revenue generation and state control also generate the many dangerous intersections that Warren’s map exposes.

Juxtaposing these two stories emphasises the importance of considering who holds the bird’s eye view and how lives are affected by order imposed through infrastructures.

The intention of this publication is to counter the grayness of datasets, to instead saturate them. Just as city planning determines some of the infrastructures that later on become so embedded in daily life, that they are invisible, the dataset too is encoded with a purpose, an intention. Just like the grids of Detroit and Chicago, it is between rigid lines of labelled information that people live their (messy) lives. And while aggregates and statistics provide means to operate on a large scale, one should not lose sight of the origins of the data. The dataset’s goal, its mode of production, the stories it encodes; all contribute to the ways of seeing the dataset imposes. We intend these saturated interfaces to function as invitations for further exploration of a shared practice that makes engaging with datasets all the more common.

Encoding culture: Enron Corpus

The Enron corpus holds a historical position in the machine learning field. Enron was an American energy company that went bankrupt in 2001 while undergoing a fraud investigation. The corpus consists of about half a million email exchanges between 1995 and 2001 from about 150 Enron employees, mostly senior management, organized into folders. It was prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes)9 for use in computer sciences, after the Federal Energy Regulatory Commission released nearly 1.5 million Enron email messages on the web10.

The dataset in itself is a time capsule. The corporate culture of Enron was very specific: US-centric, predominantly male, highly competitive11. It encompasses records from important moments in history, such as the first election of George W. Bush Junior or emails that were sent on September 11th, 2001. To gain insight into one of the largest fraudulent companies at that time, was part of the appeal of this corpus.

[T]he Enron e-mail library posted on FERC’s Web site also contains a remarkable glimpse into the culture of Enron – how the family of Ken Lay lived large in the glory days, how Tom DeLay and other members of Congress used the company as a veritable ATM for campaign contributions, how Enron plotted to place employees in the Bush-Cheney administration, how company executives almost obsessively followed the investigation into price gouging during California’s energy crisis, and ultimately how Enron employees suffered when the company collapsed.” (Tim Grieve)

However, despite its outdatedness, it holds a unique position in the computer sciences as a vast source of ‘natural’ text; text that was written without the intent of being recorded. Even after so many years, it still is a go-to collection to benchmark algorithms12, or prototype hypotheses. In this way, the linguistic expressions of these corporate employees become the resource against which algorithms are assessed for what counts as ‘naturalised language’.

The dataset has been used for multiple purposes, from training algorithms for spam detection, to identifying the idiosyncratic language of American businesspeople –it seems ball-related idioms are a safe bet– experimenting with network analysis, doing studies of e-mail foldering structures, developing ‘compliance bots’ to detect sensitive elements in text and alert writers if a text would get them in trouble, and other experiments that assessed psychological traits from the corpus. 13 Further use cases are mentioned at the bottom of the dataset’s home page14.

It does not require a stretch of the imagination to see how data scientists or programmers could extrapolate relations from datasets like Enron and use them as a standard when, say, assessing prospective employees. In fact, there are already services that automatise managerial tasks by analysing their employees’ email exchanges. One such example is the Dutch company KeenCorp, which uses the emails to create ‘mood indexes’ of various departments of the company. Email is not merely a way of communicating anymore, it becomes a quantifiable social indicator.

There are interfaces available that make it possible to go through the individual emails of the Enron corpus15. However, doing this would be extremely time consuming, as illustrated by artists Tega Brain and Sam Lavigne’s piece The Good Life, which gives subscribers the options to receive Enron emails over a period of 7, 14 or 28 years.

Tega Brain and Sam Lavigne’s The Good Life
Tega Brain and Sam Lavigne’s The Good Life

Inspired by activist platform Xnet’s Become a Banker, a theatre play based on 8.000 leaked emails, that provides insight into the Spanish credit crisis, we set out to explore methods of reading to engage with the corpus’ emails. Theatre has a long history of political engagement through embodiment of text. In Theatre of the Oppressed, pedagogue and theatre maker Augusto Boal proposes multiple tactics of interpreting non-dramatic material into theatrical performances, these he called Newspaper Theatre. Some of them are:

The physical and emotional conditions under which the emails were written are not described in the corpus. Indirectly though, they are present. It is the reading of emails that forces the reader to fill in these gaps. By reading you can guess or speculate on the writer’s intentions and mental state. This deliberate interpretation of the emails adds an extra layer of information, one of affective experience.

But which emails should one read out loud? For us, the reading of the emails had to be focused around the specific company culture of Enron. Therefore, instead of dumping the whole corpus, we made preselections that we intuitively felt might give insight into particular narratives. What kind of emails were sent after 3am? What about emails sent on the 1st of January of every year? What emails contained the word ‘tired’? We queried the corpus for specific dates, times, or words, creating opinionated subsets, which might draw up possible narratives.

We invited artist and educator Amy Pickles to write a script based on the outcomes of these searches. In her work, Amy Pickles uses references as material, making scripts through cut and paste, putting sources in conversation. She also uses references for action, using pedagogical methods in workshops. For these scripts, references included composer Pauline Oliveros’ Sonic Meditations, Raymond Murray Schafer’s sound education exercises, playwright debbie tucker green and the writing of Georges Perec and Brandon LaBelle. She chose the subset of emails sent after 3am, where the employee at the top of the list was Philip_Allen:

“He seemed to send a variety of emails all at 3am, so I imagined him in his various guises (or disguises) in the office. I wanted to try to recreate the tension in the email conversations, some seemed to fit well in the story of Enron’s bankruptcy. I wanted to play with the thought of email as a stunted conversation - like how I can be ‘talking’ to you when you are asleep right now, but I still hold you in mind when writing and it has a sense of immediacy.”16

This script was later performed together by the participants of the workshops.

Photographs from a workshop at DAS Theatre
Photographs from a workshop at DAS Theatre
Photographs from a workshop at DAS Theatre
Photographs from a workshop at DAS Theatre

Amy went on to adapt these emails for two workshops with educator Viki Zioga, which they called Sound Out.

“Sounding out is a way we learn how to connect the written word to the spoken word. In conversation, sounding out is an action to discover someone’s intentions or opinions. Sound Out is a moment to rethink how we pronounce ourselves through our information technologies. In this workshop we will reflect on (mis)communication within our everyday technologies. How do we listen to one another in the online space? What are the rhythmic patterns of technology-mediated collaboration? Sound Out is a structured discussion in combination with pedagogical reading, writing and listening practices that enter the discussion as performative group exercises”.17

The practice of reading the Enron corpus out loud, turning it into a performance, is not tailored towards an audience. Rather, it becomes an exercise of engaging with affective qualities of the dataset by exploring its personalities, tonalities, the circumstances of its creation and publication. They are exercises in interpretation, to discover the specificities of a culture encoded into a collection.

Dataset for 4 year olds: COCO

The COCO dataset is a large-scale object detection, segmentation, and captioning dataset built by Microsoft, in collaboration with Facebook and Google. COCO stands for Common Objects in Context. It has around 2.500.000 labeled instances in 328.000 images divided into 91 categories.

Icons for the categories of the COCO dataset.
Icons for the categories of the COCO dataset.

A quick glance at the COCO description paper18 informs us that the categories were chosen with the purpose of being recognised by a 4 year old. The selection for the vocabulary was made by combining categories from the already existing image dataset PASCAL VOC, with a subset of the 1200 most frequently used words that denote visually identifiable objects, and a list of categories that children aged 4 to 8 put together by naming every object they see in their indoor and outdoor environments. 272 categories resulted from this process, which were then ranked by the researchers by “how commonly they occur, their usefulness for practical applications, and their diversity relative to other categories19.

The particularities of categorisation are very easily observed in COCO: the only clothing item in the categories is a tie, yet there are two categories that refer to baseball: baseball glove and baseball bat. While there are categories for knives, spoons and forks, there is none for chopsticks. Hotdogs are considered as a single whole, while sausages and buns have no category.

What constitutes an object is an age old philosophical discussion, now resurfacing in machine learning practices. What constitutes a common object and under what taxonomy to place it is clearly a cultural question.

COCO’s categorisation system lays out certain relations between words that are amplified when the dataset becomes operationalised to make predictions. Multiple biases are at play in any dataset’s construction, including COCO’s. Starting from the keywords fed to the image and video hosting site Flickr, to the images that the platform perceives as more closely associated to the keywords, to the biases of Amazon Mechanical Turk tagging labourers, all of these steps lead to a particular worldview. The way in which the world is ordered as online images , becomes the way in which algorithms trained on COCO decode and re-encode the world.

The COCO dataset contains pictures of representations of the object-categories.

COCO is considered a benchmark dataset and is used by both academia and industry to compare their technological developments and to see which algorithm is the most ‘performant’. Algorithm programmers gain exposure and prestige the more they can accurately mimic the dataset’s labeling, as well as the speed in which they do this. This system aligns with the dominance of Silicon Valley’s meritocratic world-views. The fact that developers optimise their models to perform well on these public, static, datasets, gives them influence on the output of decision making systems – even when these datasets are not their sole input.

While the datasets and algorithms that are developed in-house by companies and governments often remain closed for examination from the outside, benchmark datasets provide a glimpse into an industry. Examining these datasets is an act of what communications theorists Star and Bowker call Infrastructural Inversion20. By looking at the visible ends of a complex, largely invisible, system, we can get an insight into how it functions.

Different from the ever evolving and accumulating in-house datasets of the large tech companies, benchmark datasets are static entities. As they are created with a specific idea, and from a specific cultural viewpoint, these static collections become capsules of space and time. For example, media artist Hannah Davis points out21 that in the Labeled Faces in the Wild22 (LFW) dataset from 2007, that describes itself as is “a public benchmark for face verification”. The LFW is a collection of more than 13,000 images, gathered from online news article— over 7% of the images are from just 4 people who are heavily involved in the war in Iraq23. Furthermore, if we map out the birthplaces of around 3000 people from the dataset, out of the 5761 people in it, we can see the dataset leans heavily towards America and Europe.

“Map showing the birthplaces of people in the Labeled Faces in the Wild dataset, as found through WikiData. For 3132 out of 5761 people the birthplace was found; of course using a source like WikiData might provide a further skew in the bias of the set. More information can be found here.”

If sets like this provide the norm for machine learning models, that is, if machine learning models are optimised towards these datasets, it should come as no surprise that the outcomes of like LFW systems are biased. When access to certain services is determined by whether one’s face is detected or not, the stakes become much higher. It is only since October 2019 that the LFW provides a disclaimer on its website, mentioning some of its limitations in regards to representation: “we would like to emphasize that LFW was published to help the research community make advances in face verification, not to provide a thorough vetting of commercial algorithms before deployment.24 This comment shows that even within the industry, the crucial position that these datasets take, is becoming ever more apparent.

Growing awareness of the bias within datasets largely follows pushbacks the industry experienced against face datasets in the summer of 2019: when the Financial Times began publishing articles discussing the use of images that were scraped from the internet, where many people’s faces were unknowingly included. The newspaper based themselves largely on the Megapixels research project by artists Adam Harvey and Jules LaPlace, who traced the application of various publicly available face datasets to places including Chinese universities, affiliated with the Chinese military – and obviously universities such as the Massachusetts Institute of Technology are affiliated with the American Department of Defense. As such, these datasets could be used to train algorithms which aid the monitoring of the Uighur Muslim minority in Xinjang25. All of a sudden your party photos may become weaponised without your knowledge or consent of it.26

The fact that this dataset is comfortably used by public and private research institutions to benchmark their algorithms falls in line with the history of sciences: the subject is dehumanised in the name of objectivity. When philosopher and art historian George Didi-Huberman wrote about medical research into hysteria at the closing of the 19th century, he described how the technology of photography enabled the documentor to turn the case into a tableau - a (visual) statement. This generalises the case, binding it to one camera frame, and one frame of reference.

One of the main goals of the COCO dataset is to aid visual scene recognition. In machine learning environments, scene understanding “involves numerous tasks including recognizing what objects are present, localizing the objects in 2D and 3D, determining the objects’ and scene’s attributes, characterizing relationships between objects and providing a semantic description of the scene.”27

The scene’s specified in the COCO dataset.
The scene’s specified in the COCO dataset.

In theater the word ‘scene’ refers to a unit of action, often delineated by time and space. In a scene, actors and objects are put on equal footing., Through carefully directed placement they both function as carriers of meaning and intent in relation to the artistic goal of the production. An amalgamation of scenographic elements can orientate multiple acts of worlding through methods of sound, light, costume and scenery. Anthropologist Kathleen Steward says in relation to scenography that “scenes becoming worlds are singularities of rhythm and attachment. They require and initiate the kind of attention that both thinks through matter and accords it a life of its own.28.

Through scenography, one can explore potentials to “irritate, highlight or reveal29 the orders of the world. This provides the starting point of our interface to the COCO dataset. We juxtaposed the order of a 4 year old’s world, with computer vision applications that could be deployed by any of the thirteen companies who collaboratively developed COCO. Together with illustrator Merijn van Moll, we designed three scenes. One depicts a view from a drone, another from a self-driving car, while the third one is the perspective of an algorithm that assesses product brands. All three of these technologies are currently being developed at either Microsoft30, Google31 or Facebook32.

The algorithmic point of view as being that of a four year old is blown up by presenting the scenes as a children’s picture book. On these backdrops, the user sets out to create their own scenography, using the categorised shapes from COCO’s taxonomy. You explore the scenes and feel the naivety of the set’s categories in stark contrast to their intended application.

The choice of the annotator: Something-Something

When discussing the authorship of datasets for machine learning, the emphasis lies on academic and/or industrial researchers that initiate and manage the set’s development.

In his research on databases, Francis Hunger identifies authorship as being inclusive of administrators, data scientists, managers, programmers, engineers, user interface designers, politicians33. All of them influence the making of the database. In this sense, datasets too, have many authors, including the annotators, the cleaners, the data collectors, in addition to those listed by Hunger.

In the case of 20BN Something-Something, the annotator would be shown an image, and they would describe the contents of the image according to the requirements set by the dataset’s curator. In doing so, the annotator structures and enriches the information in the dataset.

Early machine learning datasets were relatively small —sometimes consisting solely of photographs of some research group members— therefore, the task of annotation was something that would be done by the researchers themselves. As datasets grew larger, the task of annotation grew with it. Likewise, finding the people to do this labour became more complex. In 2006, Amazon was among the first to introduce a solution to this: their platform Mechanical Turk (AMT), on which ‘requesters’ can outsource their tasks to human ‘workers’. Each ‘Human Intelligence Task’ (HIT) is a small bundle of work requests, which is put up on the digital marketplace, and workers from all over the world can opt to do it. For instance, for the COCO dataset, which contains more than 300.000 images, over 70.000 hours of work has been done by workers on the AMT platform.

Amazon Mechanical Turk is named after the Mechanical Turk from 1770, invented by Wolfgang von Kempelen. It was a ‘chess automaton’, that depicted a man wearing Ottoman style clothing playing chess. Within this contraption, an actual human was hidden, who played the chess game and controlled the puppet. The subtitle of AMT is therefore fitting: Artificial Artificial Intelligence.

Many of AMT’s workers come from India, Kenya, Vietnam,Venezuela or from what is often called the Global South. In these countries labour is much cheaper than in the Global North. While Amazon asks requesters to follow the wage norms of the places they work from, a survey around the platform shows that this is often not the case:

“A 2017 paper found that Turkers earn a median wage of approximately US $2 per hour, and only 4 percent earned more than $7.25 per hour”. –IEEE Spectrum

This imbalance perpetuates racist situations instilled since colonisation. The Global North remains reliant on outsourced labour, it continues to develop a technocapitalist economy at the expense of other nations and people. A company similar to AMT, KolotiBablo, has a satellite project named Kolostories. It is a collection of testimonials by workers, and the opportunities they had by working for KolotiBablo. Note however, that the same company owns Anti-Captcha, a company that promotes themselves with the interchangeability of their worker: with the push of a button on their website you can ‘shoot down’ a worker.

The websites Kolostories and Anti-Captcha are owned by the same company. However, they address their workers in a completely different manner.
The websites Kolostories and Anti-Captcha are owned by the same company. However, they address their workers in a completely different manner.

While these platforms for ‘clickworkers’ pride themselves in providing opportunities for workers who demand flexibility or have no access to a healthy labour market, their median wage in 2017 was merely $2,- per hour. Furthermore, these platforms limit communication among their workers. In this light, Migration researchers Manuela Bojadžijev and Mira Wallis wonder if this work should be seen as labour migration, or ‘virtual migration’, migration without physical migration. They see new frictions appearing, “how we struggle across space and borders for better working conditions, or how we organise against different wages for the same type of work performed in different places34.

Platforms such as Turkopticon and Turkerview try to fill this gap by providing browser plugins that allow workers to aggregate and share information about requesters. The online community mturk subreddit also provides a place for workers to discuss their assignments, and their methods to maximise their wage. The project Data Workers Union addresses this issue from a slightly different angle by regarding all humans as data collectors for ‘Big Tech’, which demands unionisation. Similarly, the Pervasive Labour Union, addresses these issues by offering a publishing platform for discussions around data collection on commercial social media.

To gather data for the COCO dataset, a series of interfaces for AMT was developed. Each further refining the annotations for an image.
To gather data for the COCO dataset, a series of interfaces for AMT was developed. Each further refining the annotations for an image.

Nicolas Malevé has conducted research into the optimisation of the interfaces through which these clickworkers work. Developers try to balance the accuracy of the work (which should be high) vs. the worker’s pay (which they want to be as low as possible). More on this can be found in our interview with him.

This became all the more relevant when we started looking into the Something-Something dataset curated by 20 Billion Neurons. The authors of this dataset take an interesting philosophical approach to object recognition, and want to define objects not by their ‘class’ but by their affordances: “Closely related to material properties is the notion of affordance: An object can, by virtue of its properties, be used to perform certain actions, and often this usability is more important than its “true” inherent object class (which may itself be considered a questionable concept)”.

In an attempt to detect the affordances of objects, they created a video dataset based on a set of action templates for interaction between objects: for example “holding [something] behind [something]” or “pushing [something] with [something]”. They asked workers on Amazon Mechanical Turk to record videos in first-person view, while enacting the specified movements with objects of their own choice. In total they collected 220.847 videos of 2-6 seconds.

However, when looking at their Turkerview page one can question the agency that their workers have over their work. For the task of reviewing 50 of these videos—to see if they fit the category that was labelled by their submitter—one gets 10 cents. One of the reviewers puts it clearly:

“Slave wage garbage that shouldn’t be tolerated by anyone. The requester is fully aware that they’re paying garbage, as they have not updated the pay in months and have left their HITs without any quals [worker qualifications], letting slave wage workers complete them.

50 videos to rate for 10 cents, and there’s a 3-4 second timer on each one, so there’s no way to complete this in a manner that would even raise the hourly out of the slave wages”.35

When considering an explorative interface for this dataset, the conflict in agency provided our starting point. The worker has the freedom to pick any object they’d like to create the videos with, but they are still forced, by economic circumstances, to accept a fee for this work that would be considered marginal in the economy where the company activates.

The first-person perspective of (most of) the videos made by the Amazon Turk Workers made us quickly think of the ‘game’ as a fitting device to interweave a narrative explanation and contextualisation, with the excruciatingly slow way of making money. All the videos bear inherently iconic game aesthetics. Inspired by choose your own adventure games or dungeon explorer style games, the player sets out to explore the dataset.

Our interface to the Something-Something addresses the conditions under which clickworkers have to work in order to produce the dataset. At the same time, the individual videos within this collection are brief encounters with the workers. Sometimes they show a lot of care for their recording, while others are as fast as possible. They provide a glimpse of the human in the dataset. They are the ‘Ghost Workers’, or ‘Guest Workers’, the hidden underclass behind machine learning models.

Repository

Activity

Colophon

Text & Interfaces: Cristina Cochior & Ruben van de Ven

Editing: Amy Pickles

Background illustrations for the COCO interface: Merijn van Moll

Enron Script: Amy Pickles


Thanks to:

Creative Coding Utrecht, Annet Dekker, Marloes de Valk, all the participants to the Plotting Data workshops

This project is supported by the Creative Industries Fund NL.

Notes


  1. A list of organisations in the Dutch government that use algorithms compiled by the NOS

  2. The Ethnography of Infrastructure, 1999. Susan Leigh Star

  3. For more writing on this subject, see the work of Ruha Benjamin, Safiya Umoja Noble, Taina Busher, Cathy O’Neil, Mutale Nkonde, Sasha Costanza-Chock or Yeshimabeit Milner.

  4. https://points.datasociety.net/the-point-of-collection-8ee44ad7c2fa

  5. Evil Media, Matthew Fuller.

  6. http://www.digitalhumanities.org/dhq/vol/5/1/000091/000091.html

  7. The term data dramatisation was first conceptualised by Memo Akten in an article on Medium. However, it originated from the architect and artist Liam Young. [https://medium.com/@memoakten/data-dramatization-fe04a57530e4](https://medium.com/@memoakten/data-dramatization-fe04a57530e4)

  8. See ‘The Disenchantment of Affect’ by Sengers et al. for a call to a more open form of affective appeal.

  9. CALO was a DARPA sponsored project by SRI International, an American nonprofit scientific research institute and organization based in Menlo Park, California. The name is inspired from the Latin word “calonis” and means “soldier’s servant”. One of the spring-offs from CALO was Apple’s digital assistant Siri.

  10. Releasing such a large amount of private conversations into the public domain raises significant issues of consent. Although there is a possibility to request the deletion of certain emails, or at least strip it from social security and credit card numbers, this creates further complications when the maintainer must rely on the goodwill of those who already downloaded the dataset to replace it with a new version. This in turn paradoxically generates even more attention and visibility around the employee who wishes to delete their record#. Within the interface, we have tried to obscure the recipient and sender from the email head.

  11. https://www.youtube.com/watch?v=-w6duQhWuVk

  12. Further discussion on benchmark datasets is in the COCO dataset text of this publication.

  13. https://www.newyorker.com/magazine/2017/07/24/what-the-enron-e-mails-say-about-us

  14. https://www.cs.cmu.edu/~enron/

  15. http://www.enron-mail.com/

  16. Quote from an email from Amy addressed to us and the workshop participants who we would be performing with.

  17. http://amypickles.co.uk/research-ing/sound-out?i=1

  18. https://arxiv.org/pdf/1405.0312.pdf

  19. https://arxiv.org/pdf/1405.0312.pdf

  20. in Sorting Things Out.

  21. https://towardsdatascience.com/a-dataset-is-a-worldview-5328216dd44d

  22. http://vis-www.cs.umass.edu/lfw/

  23. 530 images from George W. Bush, 236 of Colin Powell, 144 of Tony Blair and 121 of Donald Rumsfeld. These 1031 images make up 7,78% of the 13.244 images. And then we’re not yet counting eg. the 109 images of Gerhard Schröder.

  24. Following the Internet Archive’s Wayback Machine this disclaimer has been added between September 3 and October 6 2019.

  25. For this, see the great research by Adam harvey and Jules LaPlace on the Brainwash dataset: https://megapixels.cc/brainwash/

  26. It is hard to know whether your face is in one of these facial recognition datasets. For example, American multinational technology company, The International Business Machines Corporation (IBM) has its own dataset for facial recognition, consisting of many images from the public domain, called Diversity in Faces#. However, in order to find out if you are in the set, one needs to email the exact URLs of the photos one wants to inquire upon. Without knowing who or what is part of this dataset, it becomes virtually impossible to request removal from such a collection. The politics of inclusion become even more violent when we consider an example provided by researcher Os Keyes. In The Gardener’s Vision of Data, they describe how the National Institute for Standards and Technology (NIST) curates a series of datasets with which they benchmark commercial systems for facial recognition. However, the contents of these datasets consist of “immigrant visa photos, photos of abused children, and myriad other troubling datasets, without the consent of those within them”. Keyes highlights a specific subset which is available for public download, the ‘Multiple Encounter Dataset’. The MEDS is a collection of mugshots of 380 people, collected and circulated without their consent. An examination of their photos shows many of these people in distress; sometimes they are even wounded.

  27. From the COCO paper.

  28. Beyond Scenography, Rachel Hann

  29. Beyond Scenography, Rachel Hann

  30. https://www.techtimes.com/articles/245918/20191106/microsoft.htm

  31. https://waymo.com/

  32. (Creepy Facebook Patent Uses Image Recognition to Scan Your Personal Photos for Brands - Fastcompany)[https://www.fastcompany.com/90333067/creepy-facebook-patent-uses-image-recognition-to-scan-your-personal-photos-for-brands]

  33. https://aprja.net/epistemic-harvest-the-electronic-database-as-discourse-and-means-of-data-production/

  34. https://www.law.ox.ac.uk/research-subject-groups/centre-criminology/centreborder-criminologies/blog/2018/05/researching

  35. https://turkerview.com/requesters/A12M8Y27IW05FA-20bn