Artificial intelligence (AI) is not as automatic as it is preached. This technology works thanks to powerful computers that execute calculations on extensive databases. But these databases must be fixed and tested, a manual job that companies outsource to legions of workers who are generally paid very little; sometimes, cents on the dollar for each task completed. This reality was described in the book Ghost Work (in Spanish, Ghost Work), published in 2019 by Mary Gray, anthropologist and researcher at Microsoft, and her colleague Siddarth Suri.
At the time of publication of that work, Milagros Miceli (Buenos Aires, 41 years old), sociologist and doctor in Computer Science, had already been researching the topic for two years. When she was hired at the newly created German Internet Institute—named after AI pioneer Joseph Weizenbaum—the social consequences of algorithms were being addressed from a very theoretical point of view. Miceli wanted to go further. “I wondered if anyone was talking to the people behind those algorithms. That’s how I got to the data annotators, who label images of chairs with the word ‘chair’ so that the machine learns to distinguish them, and then to the data workers, a concept that we developed,” he explains.
Since then, Miceli has followed that theme. Today she is one of the leading experts in the little-known field of data workers. She is also a senior researcher at the DAIR Institute, the center founded by Timnit Gebru, Google’s AI ethics officer who was fired shortly after signing a report in which the company did not fare well. This December he participated in the III Conference on thinking about global digital justice, held in Barcelona, to talk about all this.
Ask.What is a data worker?
Answer.A data worker is a person who basically produces data to train AI systems. This task can be carried out by recording your own voice, uploading selfies, labeling and classifying data, interpreting it… What many people sometimes do not imagine is that this is a continuous work, it does not end at the moment. The systems require manual work, such as maintaining data sets, checking them, or fine-tuning them.
Q.What type of qualification is needed to carry out these tasks?
R.There is a myth that those who do this are not qualified workers. But, in practice, these are workers who have completed at least tertiary or higher education. I have met people who even have doctorates and are doing this type of work.
Q.Where are data workers located?
R.They abound in vulnerable or poor populations, with a very high unemployment rate, but who have higher education. The nature of the work is actually very difficult. I have tried it. It not only requires a level of formal knowledge, but also artisanal.
Q.Could you give me an example?
R.Satellite image labeling and segmentation are very common areas. It’s very difficult. First, it is tiring on the eyes and the hand that moves the mouse non-stop. In addition to that, you have to be very careful to separate a tree from a person, or a house from a car, in images that are often blurry. That requires some knowledge of the specific architecture and vegetation of a country. And if you do it wrong, you don’t get paid anything.
Q.What is the situation for data workers?
R.Still unchanged since this started. They are in a situation of total precariousness and lack of protection. Here rules what was once called the uberization of work: they are paid per task completed, and not for the time it takes to complete the task. In the case of data workers, the time it takes to log in, to find an available task, which is not always available, or to understand the instructions, which can be very complex and are almost always in English, is not counted. All with the risk that the client will later say that it was not done as requested, and then not pay. This happens in many cases, and on top of that the client has the right to keep the data that has been given to him.
Some data workers are blocked from platforms for asking questions, for example, about salary. Then, some platforms, such as the largest of all, Amazon Mechanical Turk, do not pay with money, but with vouchers, in this case, to spend on Amazon. This is how a perfect monopoly is made. Of course, when something happens to the worker, such as having consequences from working with content that is psychologically disturbing, no one helps them. In many cases they previously signed a confidentiality agreement that prohibits them from revealing the nature of what they do. Some workers have told us that, for this reason, and despite suffering from post-traumatic stress, they have avoided seeing therapists. Nor can they put on their resume that they have been content moderators for such a great platform.
Q.So, are there content moderators who are not on the payroll, but who enter through this micro-job route?
R.There are content moderators who are not hired, and most content moderators have precarious contracts through third-party companies in Europe, but many also in countries in the global south. Those companies, by the way, are the same ones that used to do image tagging. In fact, many times the same people rotate from one team to the other. On the other hand, content moderation is often done at the same time as data labeling: they decide whether or not what they are monitoring is hate speech, information that is then used to train the algorithms.
Q.Do you know how many data workers there are?
R.It is very difficult to give a number. The World Bank, a conservative institution, says there are between 150 and 420 million in the world. What we do know is that the numbers have grown exponentially in recent years. This idea that work is going to be automated is a lie. AI requires a lot of manual work.
Q.The speech of the uberization He says that microworkers want to do work at specific times, as a complement to their salary. The same thing they argue in Glovo. Is that so? Are data workers full-time or only part-time?
R.It’s another myth, yes. There are hardly any occasional data workers, and that has to do with the complexity and high professionalization required, as we mentioned before, to perform these tasks. The more sophisticated AI models become, the more qualified the workers who operate their databases have to be. It is no longer like ten years ago, when people were asked to identify kittens in a series of photos. That no longer exists. To make some money at this, you need to work every day.
Q.The classic example of digital microwork, as he says, was image tagging. Now what is the most in demand?
R.Seven years ago, when I started in this, the fashion was tagging photographs. The important thing was to have quantity, not quality. In 2019 we did a study in which we analyzed the instructions given to workers and the majority were along those lines. But recently there has been a very marked shift towards tasks that have more to do with linguistics and generative AI: producing data from scratch for a specific purpose. For example, unemployed artists are hired and asked to create images according to certain basic instructions, the so-called prompts. That is then given to the Midjourney algorithm to refine its operation. Or journalists or writers are hired to write chronicles or short stories so that the machine can extract patterns. People reading texts in dialects or minority languages are also recorded to enrich the databases.
Q.Can AI work without this manual work? Do you need human support 24 hours a day?
R.This system is designed to have workers available 24 hours a day, seven days a week, and paying them the essential minimum. And if they don’t like the conditions, companies can move to the next country or town. Large scale rules, and that only works if we have millions of workers. Of course, there is another way to do things. Models perform better if they are trained on smaller, but better curated data sets. For that you don’t need millions of workers, but good professionals and communicating with them. The opposite of anonymity and algorithmic mediation.
Q.The latest generative AI models have already been trained with all the data available on the Internet, so the new generation must include all that and also new synthetic or artificially produced data. Do you think generative AI will skyrocket the demand for data workers?
R.If I had to make a prediction it is that data workers are going to continue to grow. Even those who bet that the future lies in synthetic data, data generated by machines, know that this is difficult from a technical point of view. Without going into too much detail, training an AI with data generated by an AI produces a loop, it ends up repeating the same thing, it’s like an infinite mirror game. So writers, artists, journalists or translators will continue to be needed to generate data that serves to enrich the databases on which the algorithms are applied.
But even if you could train models with synthetic data, you would still need data workers for algorithmic verification tasks, which consists of sitting down, for example, with ChatGPT and asking it questions and saying if what it answers is right or wrong, if there is a better one. option, etc. Continuing with the language, it is something dynamic, which changes. Chatbots must be constantly perfected, and only humans can do that, because we know and understand the contexts.
Q.Why do you think this manual dimension of AI is so opaque?
R.It’s totally intentional. The myth is sold of a technology that is miraculous and incredibly powerful, and which we should fear because it can kill us all. I add to this that this technology is based on black labor, on precarious work, on the exploitative work of millions of workers. But to sell this myth of ultra-powerful and fearsome technology, it is necessary to erase all traces of humanity. However, AI would not work without legions of manual workers. Why continue hiding and making them precarious?