A recent study, supported by the Harvard Data Science Initiative and the Center for Applied AI at the University of Chicago Booth School of Business, could improve the performance of language models in real-world situations. MIT researchers have tried to turn the tables, arguing that because humans decide when to use language models, It is essential to understand how people form their beliefs about the capabilities of these models. Large-scale language models (LLMs) are tools applicable to a wide range of tasks, from writing emails to medical diagnoses.
“These tools are exciting because they are general purpose, but for that very reason they will collaborate with people, so we have to take into account the human role,“ says Ashesh Rambachan, co-author of the study and assistant professor of economics at MIT.
To explore this concept, the researchers created a framework to evaluate an LLM based on its alignment with human beliefs about its performance on certain tasks. They introduced a human generalization function, a model of how people update their beliefs about an LLM’s capabilities after interacting with it.. The results show that when models are misaligned with human generalization function, users may be overconfident or unsure about when to use them, leading to unexpected failures.
“Language patterns often seem so human. We wanted to illustrate that this human power of generalization is also present in how people form beliefs about language patterns,” says Rambachan.
The researchers ran a survey to measure how people generalize when interacting with LLMs and other people. They showed participants questions that a person or LLM had answered correctly or incorrectly, and then asked whether they thought that person or LLM would have answered a related question correctly. Through the survey, they generated a dataset of nearly 19,000 examples of how humans generalize about LLMs’ performance on 79 different tasks..
Studies have shown that people tend to be more influenced by LLMs’ incorrect answers than correct ones, and they believe that LLMs’ performance on simple questions is not indicative of their ability on more complex questions. In situations where people put more weight on incorrect answers, simpler models outperformed very large models like GPT-4.
“
Language models that improve and learn can fool people into thinking they will perform well on related questions when, in fact, they do not.
,” adds Rambachan. “When we train these algorithms or try to update them with human feedback, we need to take human generalization into account in how we think about measuring performance,” concludes Rambachan.