Key Points
The fragility highlighted in these new results helps support previous research suggesting that LLMs' use of probabilistic pattern matching is missing the formal understanding of underlying concepts needed for truly reliable mathematical reasoning capabilities..
In "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models"currently available as a preprint paperthe six Apple researchers start with GSM8K's standardized set of more than 8,000 grade-school level mathematical word problems, which is often used as a benchmark for modern LLMs' complex reasoning capabilities..
This kind of varianceboth within different GSM-Symbolic runs and compared to GSM8K resultsis more than a little surprising since, as the researchers point out, "the overall reasoning steps needed to solve a question remain the same.".
The fact that such small changes lead to such variable results suggests to the researchers that these models are not doing any "formal" reasoning but are instead attempt[ing] to perform a kind of in-distribution pattern-matching, aligning given questions and solution steps with similar ones seen in the training data...
AI expert Gary Marcus, in his analysis of the new GSM-Symbolic paper, argues that the next big leap in AI capability will only come when these neural networks can integrate true "symbol manipulation, in which some knowledge is represented truly abstractly in terms of variables and operations over those variables, much as we see in algebra and traditional computer programming " Until then, we're going to get the kind of brittle "reasoning" that can lead AI models to fail mathematical tests in ways that calculators never do...
You might be interested in
The bigger-is-better approach to AI is running out of road
24, Jun, 23If AI is to keep getting better, it will have to do more with less