From melodies or voice transcription to assistance for the visually impaired. Sound generative artificial intelligence (AI) has advanced in leaps and bounds to the point where it is now capable of creating high-quality audio. Despite this, the data used to train AI has overlooked bias, offensive language and the use of copyrighted content, a study claims. A team of researchers has carried out an exhaustive review of 175 speech, music and sound data sets, and in preliminary work they warn that there is biased material, similar to what has been found in text and image databases.
For a year, scientists led by William Agnew, from Carnegie Mellon University (USA), studied 680,000 hours of audio from seven platforms and 600 investigations in total to analyze their content, biases and origin. Thus, scientists obtained everything from speech transcriptions to song lyrics, and most of them were in English. The files included voice recordings – sentences read by people – and pieces of music from platforms such as AudioSet and Free Music Archive, as well as two million 10-second long YouTube videos.
The analysis detected, for example, that the word man was related to concepts such as war or history, while the terms linked to the word woman included store or mom, associated with care and family, but in other cases, they detected insults such as bitch. In particular, Free Music Archive and LibriVox contained thousands of racist terms (such as nigger) and discriminatory towards sexual diversity. “The voices queer They are ignored by researchers and that is partly due to how these data sets were constructed,” says Robin Netzorg, a speech researcher at the University of California and co-author of the study.
Researchers believe that if stereotypes are not adequately addressed, audio data sets can generate patterns that “perpetuate or even accelerate” prejudices and distorted conceptions of reality. Julia Barnett, a doctor in computer science from Northwestern University (USA) and a collaborator on the study, assures that people are not aware of biases. “As a consequence, viewing a data set as a reflection of humanity without understanding its true composition will lead to numerous negative effects later on,” he says.
For Andrés Masegosa, an expert in artificial intelligence and associate professor at Aalborg University in Denmark, there is nothing surprising about biases: “This technology manages to extract patterns from a set of data and simply tries to replicate what already exists.” AI works much like human learning, he suggests. “If you expose a child to sexist behavior, he will reproduce that bias simply unconsciously,” says the academic, who did not participate in the research.
“There are many attempts to avoid biases and what is clear is that the models lose capacity. There is a debate in the field of AI that is reflected in the different visions that each society has,” adds Masegosa. The expert recognizes that the study carried out is a large audit, and believes that examining the data sets is quite an expensive job.
Unlike text data, audio data requires more storage, says Sauvik Das, an academic at the Institute for Human-Computer Interaction at Carnegie Mellon University, who participated in the research. This implies that they need much higher processing power to be audited. “We need more data to have higher quality models,” he argues.
The voice is biometric data
The potential harm of generative audio technologies is not yet known. Scientists propose that this type of content will have social and legal implications that range from people’s right of publicity, misinformation and intellectual property, especially when these systems are trained with data used without authorization. The study indicates that at least 35% of the audios analyzed presented content protected by copyright or copyright.
The voice is related to the right to one’s own image, since it is part of the physical characteristics of a person. Borja Adsuara, a lawyer expert in digital law, points out that voice has the same problems as text and images generated with AI, in relation to data protection and intellectual property. “The voice is biometric data and is specially protected like the fingerprint or the iris of the eye. It can be violated if its use is not permitted,” explains this specialist.
Adsuara remembers the well-known controversy that actress Scarlett Johansson was involved in, when in May 2024 the chatbot Sky, from OpenAI, had a tone similar to his voice. AI has also used the voices of musicians to simulate that they sing melodies that they have never performed, as happened to the Puerto Rican Bad Bunny and the Spanish artist Bad Gyal. “Not only does it infringe the image rights to one’s own voice, but also the intellectual property rights to the interpretation. The problems are the same and what generative artificial intelligence does is make it much easier to commit a crime or commit an intrusion,” he explains.