AI’s understanding and reasoning skills can’t be assessed by current tests

Sparks of artificial general intelligence,” “near-human levels of comprehension,” “top-tier reasoning capacities.” All of these phrases have been used to describe large language models, which drive generative AI chatbots like ChatGPT. Since that bot arrived on the scene in late 2022, it seems as if every new generative AI is the next, best iteration — not just producing humanlike content but also approaching near-human cognition (SN: 12/11/23). But what can we really say about any LLM’s ability to reason and understand?

In the AI community, there is no consensus on the definition of machine “intelligence,” nor on how to define the various cognitive capabilities often attributed to LLMs. Such high-level claims about understanding are often based on benchmark datasets, which use many instances of a specific task (say answering questions) to assess aggregate performance (usually based on a metric like accuracy).

Consider, for example, Massive Multitask Language Understanding, or MMLU, a popular benchmark for assessing the knowledge acquired by LLMs. MMLU includes some 16,000 multiple-choice questions covering 57 topics, including anatomy, geography, world history and law. Benchmarks such as BIG-bench (the BIG stands for Beyond the Imitation Game) consist of a more varied collection of tasks. Discrete Reasoning Over Paragraphs, or DROP, claims to test reading comprehension and reasoning. WinoGrande and HellaSwag purport to test commonsense reasoning. Models are pitted against each other on these benchmarks, as well as against humans, and models sometimes perform better than humans.

But “AI surpassing humans on a benchmark that is named after a general ability is not the same as AI surpassing humans on that general ability,” computer scientist Melanie Mitchell pointed out in a May edition of her Substack newsletter.

These evaluations don’t necessarily deliver all that they claim, and they might not be a good match for today’s AI. One study posted earlier this year at arXiv.org tested 11 LLMs and found that just changing the order of the multiple-choice answers in a benchmark like MMLU can affect performance.

Still, industry leaders tend to conflate impressive performance on the tasks LLMs are trained to do, like engaging in conversation or summarizing text, with higher-level cognitive capabilities like understanding, knowledge and reasoning, which are hard to define and harder to evaluate. But for LLMs, generating content is not dependent on understanding it, researchers reported in a study presented in May in Vienna at the International Conference on Learning Representations. When the researchers asked GPT-4 and other AI models to answer questions based on AI-generated text or images, they frequently couldn’t answer correctly.

Nouha Dziri, a research scientist studying language models at the Allen Institute for AI in Seattle and coauthor on that study, calls that “a paradox compared to how humans actually operate.” For humans, she says, “understanding is a prerequisite for the ability to generate the correct text.”

What’s more, as Mitchell and colleagues note in a paper in Science last year, benchmark performance is often reported with aggregate metrics that “obfuscate key information about where systems tend to succeed or fail.” Any desire to look deeper is thwarted because specific details of performance aren’t made publicly available.

Researchers are now imagining how better assessments might be designed. “In practice, it’s hard to do good evaluations,” says Yanai Elazar, also working on language models at the Allen Institute. “It’s an active research area that many people are working on and making better.”

Why cognitive benchmarks don’t always work

Aside from transparency and inflated claims, there are underlying issues with benchmark evaluations.

One of the challenges is that benchmarks are good for only a certain amount of time. There’s a concern that today’s LLMs have been trained on the testing data from the very benchmarks intended to evaluate them. The benchmark datasets are available online, and the training data for LLMs are typically scraped from the entire Web. For instance, a technical report from OpenAI, which developed ChatGPT, acknowledged that portions of benchmark datasets including BIG-bench and DROP were part of GPT-4’s training data. There’s some evidence that GPT-3.5, which powers the free version of ChatGPT, has encountered the MMLU benchmark dataset.

But much of the training data is not disclosed. “There’s no way to prove or disprove it, outside of the company just purely releasing the training datasets,” says Erik Arakelyan of the University of Copenhagen, who studies natural language understanding.

Today’s LLMs might also rely on shortcuts to arrive at the correct answers without performing the cognitive task being evaluated. “The problem often comes when there are things in the data that you haven’t thought about necessarily, and basically the model can cheat,” Elazar says. For instance, a study reported in 2019 found evidence of such statistical associations in the Winograd Schema Challenge dataset, a commonsense reasoning benchmark that predates WinoGrande.

The Winograd Schema Challenge, or WSC, was proposed in 2011 as a test for intelligent behavior of a system. Though many people are familiar with the Turing test as a way to evaluate intelligence, researchers had begun to propose modifications and alternatives that weren’t as subjective and didn’t require the AI to engage in deception to pass the test (SN: 6/15/12).

Instead of a free-form conversation, WSC features pairs of sentences that mention two entities and use a pronoun to refer to one of the entities. Here’s an example pair:

Sentence 1: In the storm, the tree fell down and crashed through the roof of my house. Now, I have to get it removed.

Sentence 2: In the storm, the tree fell down and crashed through the roof of my house. Now, I have to get it repaired.

A language model scores correctly if it can successfully match the pronoun (“it”) to the right entity (“the roof” or “the tree”). The sentences usually differ by a special word (“removed” or “repaired”) that when exchanged changes the answer. Presumably only a model that relies on commonsense world knowledge and not linguistic clues could provide the correct answers.

But it turns out that in WSC, there are statistical associations that offer clues. Consider the example above. Large language models, trained on huge amounts of text, would have encountered many more examples of a roof being repaired than a tree being repaired. A model might select the statistically more likely word among the two options rather than rely on any kind of commonsense reasoning.

In a study reported in 2021, Elazar and colleagues gave nonsensical modifications of WSC sentences to RoBERTa, an LLM that has scored more than 80 percent on the WSC benchmark in some cases. The model got it right at least 60 percent of the time even though humans wouldn’t be expected to answer correctly. Since random guessing couldn’t yield more than a 50 percent score, spurious associations must have been giving away the answer.

To be good measures of progress, benchmark datasets cannot be static. They must be adapted alongside state-of-the-art models and rid of any specious shortcuts, Elazar and other evaluation researchers say. In 2019, after the WSC shortcuts had come to light, another group of researchers released the now commonly used WinoGrande as a harder commonsense benchmark. The benchmark dataset has more than 43,000 sentences with an accompanying algorithm that can filter out sentences with spurious associations.

For some researchers, the fact that LLMs are passing benchmarks so easily simply means that more comprehensive benchmarks need developing. For instance, researchers might turn to a collection of varied benchmark tasks that tackle different facets of common sense such as conceptual understanding or the ability to plan future scenarios. “The challenge is how do we come up with a more adversarial, more challenging task that will tell us the true capabilities of these language models,” Dziri says. “If the model is scoring 100 percent on them, it might give us a false illusion about their capabilities.”

But others are more skeptical that models performing great on the benchmarks necessarily possesses the cognitive abilities in question. If a model tests well on a dataset, it just tells us that it performs well on that particular dataset and nothing more, Elazar says. Even though WSC and WinoGrande are considered tests for common sense, they just test for pronoun identification. HellaSwag, another commonsense benchmark, tests how well a model can pick the most probable ending for a given scenario.

While these individual tasks might require common sense or understanding if constructed correctly, they still don’t make up the entirety of what it means to have common sense or to understand. Other forms of commonsense reasoning, involving social interactions or comparing quantities, have been poorly explored.

Taking a different approach to testing

Systematically digging into the mechanisms required for understanding may offer more insight than benchmark tests, Arakelyan says. That might mean testing AI’s underlying grasp of concepts using what are called counterfactual tasks. In these cases, the model is presented with a twist on a commonplace rule that it is unlikely to have encountered in training, say an alphabet with some of the letters mixed up, and asked to solve problems using the new rule.

Other approaches include analyzing the AI’s ability to generalize from simple to more complex problems or directly probing under what circumstances AI fails. There might also be ways to test for commonsense reasoning, for example, by ruling out unrelated mechanisms like memorization, pattern-matching and shortcuts.

In a study reported in March, Arakelyan and colleagues tested if six LLMs that have scored highly on language understanding benchmarks and thus are said to understand the overall meaning of a sentence can also understand a slightly paraphrased but logically equivalent version of the same sentence.

Language understanding is typically evaluated using a task called natural language inference. The LLM is presented with a premise and hypothesis and asked to choose if the premise is implied by, contradicts or is neutral toward the hypothesis. But as the models become bigger, trained on more and more data, more carefully crafted evaluations are required to determine whether the models are relying on shortcuts that, say, focus on single words or sets of words, Arakelyan says.

To try to get a better sense of language understanding, the team compared how a model answered the standard test with how it answered when given the same premise sentence but with slightly paraphrased hypothesis sentences. A model with true language understanding, the researchers say, would make the same decisions as long as the slight alteration preserves the original meaning and logical relationships. For instance, the premise sentence “There were beads of perspiration on his brow” implies the hypothesis “Sweat built up upon his face” as well as the slightly altered “The sweat had built up on his face.”

The team used a separate LLM, called flan-t5-xl and released by Google, to come up with variations of hypothesis sentences from three popular English natural language inference datasets. The LLMs under testing had encountered one of the datasets during training but not the other two. First, the team tested the models on the original datasets and picked only those sentences that the models classified correctly to be paraphrased. This ensured that any performance difference could be attributed to the sentence variations. On top of that, the researchers fed the original hypothesis sentences and their variations to language models identical to ones tested and capable of evaluating if the pairs were equivalent in meaning. Only those deemed equal by both the model and human evaluators were used to test language understanding.

But for a sizable number of sentences, the models tested changed their decision, sometimes even switching from “implies” to “contradicts.” When the researchers used sentences that did not appear in the training data, the LLMs changed as many as 58 percent of their decisions.

“This essentially means that models are very finicky when understanding meaning,” Arakelyan says. This type of framework, unlike benchmark datasets, can better reveal whether a model has true understanding or whether it is relying on clues like the distribution of the words.

How to evaluate step by step

Tracking an LLM’s step-by-step process is another way to systematically assess whether it uses reasoning and understanding to arrive at an answer. In one approach, Dziri’s team tested the ability of LLMs including GPT-4, GPT-3.5 and GPT-3 (a predecessor of both) to carry out multidigit multiplication. A model has to break down such a task into sub-steps that researchers can examine individually.

After giving the LLMs a problem, like 7 x 29, the researchers checked the answers at each sub-step — after single-digit multiplication, after carrying over and after summation. While the models were perfect at multiplication of single and two-digit numbers, accuracy deteriorated as the number of digits increased. For multiplication problems with four- and five-digit numbers, the models hardly got any answers right. Lower-digit problems “can be easily memorized,” Dziri says, but the LLMs’ performance “starts degrading when we increase the complexity.”

Perhaps the models hadn’t encountered enough examples in the training data to learn how to solve more complex multiplication problems. With that idea, Dziri and colleagues further fine-tuned GPT-3 by training it on almost all the multiplication problems up to four-digits by two-digits, as well as providing step-by-step instructions on how to solve all the multiplication problems up to three-digits by two-digits. The team reserved 20 percent of multiplication problems for testing.

Without access to the models’ original training data and process, the researchers don’t know how the models might be tackling the task, Dziri says. “We have this simple assumption that if we humans follow this algorithm, it should be quite intuitive for the model to follow it, because it’s been trained on human language and human reasoning tasks.”

For humans, carrying out five- or six-digit multiplication is fairly straightforward. The underlying approach is no different from multiplying fewer digits. But though the model performed with near-perfect accuracy on examples it had encountered during training, it stumbled on unseen examples. These results indicate that the model was unable to learn the underlying reasoning needed for multidigit multiplication and apply these steps to new examples.

Surprisingly, when the researchers investigated the models’ answers at each sub-step, they found that even when the final answers were right, the underlying calculations and reasoning — the answers at each sub-step — could be completely wrong. This confirms that the model sometimes relies on memorization, Dziri says. Though the answer might be right, it doesn’t say anything about the LLM’s ability to generalize to harder problems of the same nature — a key part of true understanding or reasoning.

New tests of generative AI will be hard

Even though interest in such nuanced evaluations is gaining steam, it’s challenging to create rigorous tests because of the sheer scale of data and training, plus the proprietary nature of LLMs.

For instance, trying to rule out memorization may require checking millions of data points in huge training datasets to see if the LLM has encountered the example before. It’s harder still when training data aren’t available for scrutiny. “We have to make lots of assumptions, and we have to pick our task very carefully,” Dziri says. Sometimes researchers trying to do an evaluation can’t get access to the training methodology or a version of the model itself (let alone the most updated version).

The cost of computation is another constraint. For instance, Dziri and colleagues found that including five-digit by five-digit multiplication problems in their fine-tuning of GPT-3 would require about 8.1 billion question-and-answer examples, costing a total of over $12 million.

In truth, a perfect AI evaluation might never exist. The more language models improve, the harder tests will have to get to provide any meaningful assessment. The testers will always have to be on their toes. And it’s likely even the latest, greatest tests will uncover only some specific aspects of AI’s capabilities, rather than assessing anything akin to general intelligence.

For now, researchers are hoping at least for more consistency and transparency in evaluations. “Mapping the model’s ability to human understanding of a cognitive capability is already a vague statement,” Arakelyan says. Only evaluation practices that are well thought out and can be critically examined will help us understand what’s actually going on inside AI.