Research
5 minute read

The quest to teach LLMs how to count

IBM researchers propose a variation on a popular LLM architecture to improve their memory and logical reasoning capabilities and make certain kinds of math problems more intuitive.

The transformer architecture underlying today’s large language models has challenged our assumptions of what AI can do. Code-writing, language translation, and synthesizing information from the far corners of the web — these are all tasks that LLMs today can seemingly handle with ease. In a few short years, it’s changed how we work.

But transformer-based models have inherent limitations that researchers are bumping up against as they try to extend their capabilities and lower their computational costs. One of the transformer’s core flaws, at least theoretically, lies in its limited capacity to model sequential data. This can make some kinds of counting problems — like tallying the number of “r”s in “strawberry” — surprisingly difficult.

By breaking the problem into steps through chain-of-thought (CoT) prompting or switching to a “universal” architecture with more sequential processing, a transformer-based model can hit on the right answer. But workarounds like these tend to add a lot of time and cost to an already expensive process.

Transformers revolutionized natural language processing through self-attention, the mechanism that lets it process long blocks of text all at once. Attention can give the transformer a richer sense of how words relate to each other and convey meaning. But by crunching data in parallel, out of order, transformers also struggle with something called state tracking, which involves incrementally updating its worldview as circumstances change.

State tracking is how we keep track of details in a long conversation or evaluate a piece of code or an opponent's last move in a chest match. Transformers memorize each new observation for later retrieval, a strategy AI researcher Albert Gu likens to a database. Transformers can recall facts precisely but also run out of memory when the history gets too long.

Older language models, based on a recurrent neural network (RNN) architecture, work a little differently. They process text word by word, summarizing past inputs into a compressed state they can reference as new text streams in. Their memories may be hazier, but because they've been compressed, they can stretch farther back in time.

To boost inferencing speeds, IBM and other companies have integrated RNN-like processing into their LLMs. IBM’s Granite family of models interleave transformer layers with that of a state space model (SSM), which has an RNN-like structure. Granite’s hybrid Bamba architecture, inspired by Nvidia’s merging of an SSM with a transformer in Mamba2, dramatically improved efficiency but lacked a universal state tracking capability.

In a spotlight poster at NeurIPS 2025, IBM researchers zero in on the algebra and propose a new way of structuring the SSM’s transition matrices to enable state tracking. In experiments, their improved model, which they call PD-SSM, significantly outperformed other SSM variants at universal state tracking tasks. When researchers integrated their method into a hybrid transformer-SSM model, they found it could handle more complex tasks, including having the model predict what comes next in a series of time-ordered events described in natural language.

The work could be relevant beyond the academic exercises in the study. Evidence suggests that state tracking is especially important for code generation, an AI application that IBM and others have intently pursued. “We’re excited to explore these implications further,” said the study’s lead author, Aleksandar Terzic, an IBM researcher focused on AI’s mathematical underpinnings.

The canary in the coal mine

In the 1950s, language theorist Noam Chomsky proposed a way of organizing all languages by the complexity of their “grammars.” The simplest could be modeled by something called a finite state machine, which has limited memory and, like a traffic light or turnstile, transitions from one state to another in response to defined inputs — a timer for the light, or token for the turnstile. The highest rung of Chomsky’s hierarchy was reserved for Turing-complete “unrestricted” grammars that could compute anything computable.

Even before ChatGPT took the world by storm, AI researchers were using Chomsky’s methodology to probe the limitations of deep neural networks. In 2020, Stanford researcher Michael Hahn proved mathematically that a transformer had limited ability to perform state tracking tasks at the lowest level of Chomsky’s hierarchy.

Perhaps the best-known of these tasks, familiar to anyone who has taken computation theory 101, is the parity problem: you’re given a string of ones and zeros and asked to compute whether the number of ones add up to an odd or even number. Hahn showed that the transformer would theoretically fail the parity test at some bit length.

“If transformers cannot compute parity, they also cannot evaluate logical formulas accurately,” he wrote, foreshadowing the internet’s fascination with the “strawberry” problem years later.

IBM researcher Shawn Tan was struck by the parity problem as a graduate student at Quebec’s Mila AI Institute, calling it a “canary in the coal mine” on his blog at the time. Now training IBM Granite models, Tan continues to follow the debate. “If a model can’t track two states sequentially, how can we expect it solve more challenging problems?” he said recently.

The state tracking problem inspired other researchers to kick the tires of not only the transformer, but sequential models like RNNs, LSTMs, and modern SSMs. Terzic’s spotlight paper at NeurIPS is the latest in the series.

Restructuring SSMs' transition matrix

Used for decades to model dynamic systems, SSMs crossed over to deep learning in 2021 when Gu and his colleagues at Stanford introduced the S4 model. With its fixed-memory footprint, S4 could process long sequences more skillfully than RNNs and much faster than transformers.

Diagonal matrices, which simplified computation and reduced memory overhead, were introduced later and helped SSMs scale and compete with transformers on complex tasks. Most SSMs today use diagonal matrices, but the performance and efficiency gains typically come at the expense of state tracking. Diagonal SSMs have been shown to have the same limitations as transformers at parity and similar abstract math problems.

If a model can’t track two states sequentially, how can we expect it solve more challenging problems?

In their new paper, IBM researchers propose moving away from diagonal matrices to restore state tracking in SSMs. They tested a variety of SSMs on parity and related state tracking tasks, using strings up to six times longer than those the models trained on to measure their ability to generalize. Their PD-SSM model outperformed the others by at least 15 percentage points. PD-SSM’s average score of 98.5% was topped only by an LSTM, or Long Short-Term Memory, is a type of recurrent neural network (RNN) designed to recognize patterns in sequential data by learning and remembering long-term dependenciesLSTM, which is good at state tracking but difficult to scale.

IBM’s PD-SSM model also did well on real-life tasks involving long time-series datasets — things like classifying heartbeats and forecasting ethanol demand. They even tested a hybrid model, reminiscent of IBM’s Bamba architecture, on a word version of parity and found that it could compute the correct answer out to 40 characters while other SSMs struggled after 15 characters.

The challenge for researchers now is translating these theoretical insights into practical improvements. “Our team is fully focused on transferring the benefits if PD-SSM into Granite-4,” said Abbas Rahimi, the IBM researcher leading the work.

If LLMs can master the strawberry and parity problems, who knows what more they might be able to achieve?

Notes

  1. Note 1LSTM, or Long Short-Term Memory, is a type of recurrent neural network (RNN) designed to recognize patterns in sequential data by learning and remembering long-term dependencies ↩︎

Related posts