How far does our footprint digital? We know the trail left by our social media participations or by any content we post somewhere accessible — or not so accessible — on the internet. But the contributions we make in these forums are there for everyone to see. We modulate your content taking into account an approximate idea of who will consume it and the image we want to convey. On the other hand, instant messaging platforms, such as WhatsApp, are flour from another bag: “In private messages you reveal yourself more, not only in the content, but also in the way you use the language”, explains Timo Koch, researcher in the Psychology department of the University of Munich (Germany).
After analyzing a set of more than 300,000 WhatsApp messages and training an algorithm capable of recognizing the age and gender of their authors, Koch and his team warn that the experiment makes clear the importance of preserving privacy in these spaces. “End-to-end encryption is an important first step. But beyond that we need to be well informed, and that the platforms are transparent and add labels when the information is not encrypted”, proposes the expert.
Koch and his team’s concerns were heightened by the trend of social media to increasingly favor the use of private messaging spaces. “Facebook is shifting the focus to these conversations, and it’s probably going to want to use the data, so we need to have a conversation about how we want to protect these messages and ensure that if they’re marked private, they really are.”
Support news production like this. Subscribe to EL PAÍS for 30 days for 1 US$
How many messages are needed to identify us? It depends on what part of the process we are considering. Koch and his team based their algorithm on the contents of What’s up, Deutschland?, a corpus of 451,938 WhatsApp conversations provided by 495 German volunteers. After excluding very brief interactions and cases where there was no information about the interlocutors’ age and gender, 226 individuals remained, 309,229 messages, 1,949,518 words. To make the assessments, they used even less.
Similar studies that used social media as a source of content based their analysis on large samples of text with tens of millions of words provided by tens of thousands of volunteers. But if the new study is less comprehensive, it gains in data quality and in the more intimate way in which users express themselves in these environments. “The fact that we have such a small dataset and our predictions work gives us a clue as to how much more could be done. Our results should be considered as a minimum”, state the authors.
Once the algorithm is trained, a sample of 1,000 words is enough to obtain a classification of gender and age with reasonable accuracy. In order to quantify this figure, we did a word count in a moderately active conversation between two people: three days of dialogue leaves a little more than 1,000 words behind. Nevertheless, researchers recognize that with a larger database, the potential for analysis would be much greater. “If we think about personality analysis or other characteristics, we would need more information, because there are more subtle differences,” notes Koch. “When you have a good model, making a prediction is a matter of seconds.”
Tell me who you are, and I’ll tell you how to zap
This identification is possible because our way of expressing ourselves on WhatsApp follows demographic patterns. According to the contents of What’s up, Deutschland?, younger users use more emoticons and express themselves in the first person more often. This characteristic, already observed in the study of content published on other platforms, seems to confirm that we become less individualistic with age.
As far as gender is concerned, Koch and his team have found a wider and more varied use of emojis on the part of women, who also resort more to first person singular pronouns. In the case of men, the use of a more colloquial language and more frequent references to alcohol consumption stand out.
Koch does not rule out that there have been small evolutions in the way we express ourselves in these environments. Not by chance, the contents of the dataset used in your study were compiled between November 2014 and January 2015. Formats such as stickers, incorporated in 2018 —although they were already in other applications, such as Line— or direct access to gifs they could have introduced certain variations.
But accessing a broader and more up-to-date corpus is not easy, at least from the academic environment. “A large technology company has access to much more data”, he points out. Richer and more recent sources of information would allow, for example, to make more complex analyzes of the personality of users or to study how the way we open ourselves through private messages varies, in contrast to what we share on social networks, in different cultures and national contexts .
Another limitation outside English-speaking countries is language. The predominance of English in the development of language processing systems implies that most of the tools available are in that language. “We had to train our own models. Each language is different and has its own signs,” says Koch.
Seeing the wolf’s ears, should we measure the sincerity of the conversations we hold in private messaging apps more? For Koch, it currently depends on how much weight we’ve placed on privacy versus convenience. “There are some good alternatives, such as Signal, which is also encrypted and does not have a corporation behind it that is interested in profiting from the information”, he comments.
sign up on here to receive the daily newsletter of EL PAÍS Brasil: reports, analyses, exclusive interviews and the main information of the day in your e-mail, from Monday to Friday. sign up also to receive our weekly newsletter on Saturdays, with highlights of coverage for the week.