For some time now, companies like OpenAI and Google have been touting advanced “reasoning” capabilities as the next big step in their latest AI models. Now, however, a new study by six Apple engineers shows that the mathematical “reasoning” demonstrated by advanced large language models can be extremely fragile and unreliable in the face of seemingly trivial changes in common benchmark problems.
The instability highlighted in these new results helps support previous research suggesting that the LLM’s use of probabilistic pattern matching lacks the formal understanding of the underlying concepts required for truly reliable mathematical reasoning abilities. “Current LLMs are not capable of true logical reasoning,” the researchers hypothesized based on these results. “Instead, they try to replicate the reasoning steps seen in their training data.”
Mix it up
In “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models” — currently available as a preprint paper — the six Apple researchers start with the standardized set of more than 8,000 commonly used elementary school-level math word problems on GSM8K as a benchmark for the complex reasoning abilities of modern LLMs. They then take the novel approach of modifying part of this test set to replace dynamically determined names and numbers with new values - so a question about Sophie getting 31 building blocks for her nephew in GSM8K can become in a question about Bill getting 19 building blocks for his brother in the new GSM symbol rating.
This approach helps avoid any potential “data pollution” that can result from static GSM8K questions being fed directly into the training data of an AI model. At the same time, these random changes do not change the actual difficulty of the inherent mathematical reasoning at all, meaning that the models should theoretically perform just as well when tested on GSM-Symbolic as GSM8K.
Instead, when the researchers tested more than 20 state-of-the-art GSM-Symbolic LLMs, they found that average accuracy was reduced across the board compared to GSM8K, with a performance drop of between 0.3 percent and 9.2 percent, depending on the model. The results also showed a large deviation in 50 separate series of GSM-Symbolic with different names and values. Differences of up to 15 percent accuracy between the best and worst performances were common within a model, and for some reason changing the numbers resulted in worse accuracy than changing the names.
This kind of variation—both within different GSM-symbol implementations and compared to GSM8K results—is more than a little surprising because, as the researchers point out, “the overall reasoning steps required to solve a question remain the same .” The fact that such small changes lead to such variable results suggests to the researchers that these models do not make any “formal” reasoning, but instead are “experimental[ing] to perform a kind of pattern matching in the distribution, aligning given questions and decision steps with similar ones observed in the training data.’
Don’t get distracted
Yet the overall variance shown for GSM-symbol tests was often relatively small in the grand scheme of things. OpenAI’s ChatGPT-4o, for example, dropped from 95.2 percent accuracy on GSM8K to a still-impressive 94.9 percent on GSM-Symbolic. That’s a pretty high success rate using either metric, regardless of whether or not the model itself uses “formal” reasoning behind the scenes (although overall accuracy for many models dropped off sharply when researchers added just one or two extra logic steps to the problems ).
However, the LLMs tested performed much worse when the Apple researchers modified the GSM-Symbolic benchmark by adding “seemingly relevant but ultimately irrelevant statements” to the questions. For this set of “GSM-NoOp” (short for “no operation”) metrics, the question about how many Kiwis someone picks in a few days can be changed to include the casual detail that “five of them [the kiwis] were slightly smaller than average.”
Adding these red herrings resulted in what the researchers called “catastrophic performance drops” in accuracy compared to GSM8K, ranging from 17.5 percent to a whopping 65.7 percent, depending on the model tested. These massive drops in accuracy highlight the inherent limitations of using simple “pattern matching” to “convert statements into operations without really understanding their meaning,” the researchers wrote.