🧠 The Illusion of Thinking: What is AI reasoning really?

Apple researchers expose the hidden limitations of reasoning models and what it means for AI's future - Paper

📅 Date June 08, 2025

⏱️ Reading Time 8 minutes

✍️ Author Baltasar Aroso

🏷️ Topics AI Research, Machine Learning

The Mirage of Machine Reasoning

What does it mean to reason? According to Merriam-Webster, reasoning encompasses "the power of comprehending, inferring, or thinking especially in orderly rational ways", essentially, our capacity for intelligence itself. It involves providing rational grounds or motives for our conclusions, offering sufficient explanations, and exercising proper mental faculties in logical defense of our thinking.

When OpenAI's o1 model was released, showing step-by-step reasoning traces that mirrored these very human cognitive processes, it felt like a profound breakthrough. Large Reasoning Models (LRMs) appeared to finally exhibit genuine reasoning, methodically working through problems, providing explanations, and demonstrating what looked like orderly rational thinking. But a new paper from Apple researchers suggests this compelling display might be more sophisticated illusion than authentic intelligence, a bold claim, let's dive in!

The study, "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity," systematically tests what happens when these models encounter problems of varying computational complexity. The results are sobering: beyond a certain threshold, even the most advanced reasoning models don't just struggle, they collapse entirely, revealing fundamental gaps between the appearance of reasoning and its substance.

Key Finding: LRMs face complete accuracy collapse beyond certain complexities, and exhibit a counter-intuitive scaling limit where reasoning effort increases with problem complexity up to a point, then declines despite having adequate computational budget.

Reasoning Models Performance Across Complexity Levels

TLDR: Key Research Findings

Paper abstract distilled into bullet points

Goal – Probe Large Reasoning Models (LRMs) that emit chain-of-thought before answering, to see where they shine or stumble.
Method – Created "puzzle environments" where they can dial up or down problem compositional complexity while keeping logic structure fixed, letting them inspect both internal traces and final answers.
Three performance regimes identified:
1. Low complexity: standard LLMs surprisingly beat LRMs.
2. Medium complexity: added reasoning steps in LRMs give an edge.
3. High complexity: both model types crash (accuracy collapse).
Scaling surprise – LRMs put in more reasoning effort as tasks get harder, but beyond a threshold that effort drops even when token budget is ample.
Key limitations uncovered – LRMs often skip explicit algorithms, produce inconsistent chains, and struggle with exact computation, hinting that "thinking-trace length ≠ true reasoning capability."
Broader takeaway – Evaluations that look only at final accuracy can miss these nuanced failure modes; inspecting the quality of thought traces plus controlled task difficulty gives better insight into model reasoning limits.

Personal note

I genuinely enjoy the fact that I learn a ton just by talking to these models, like right now while setting up this blog post, I'm picking up concepts faster because I'm curious and ask the right questions. It's definitely a blessing and curse depending on how you use it. Great for streamlining tedious stuff I don't love doing (hello, frontend CSS shenanigans 😅), but you need to monitor it carefully. Sometimes these tools overdo it, suggesting unnecessary code complexity, setups, extra steps, redundancies, or breaking things that were working fine. Just the other day I burned 15 bucks in a minute through expensive reasoning tokens (claude 4 opus-thinking) and got basically nothing useful in return, which really drives home the importance of "accountability" in token usage and CoT for these pricey reasoning models.

The three performance regimes really resonate with my daily experience using AI across different professional levels. For low-complexity tasks (entry-level work), junior folks might tend to overdo it, and if not used with critical thinking in mind, it could end up being more of a painful review process than a learning experience career-wise. The concerning reality is that these routine tasks are increasingly being handled by AI, raising questions about entry points into the job market.

For medium-complexity tasks (senior-level work), people with curiosity and logical thinking who question outputs and pay attention to them, trying to learn along the way, extract the most value. They leverage AI's chain of thought while speeding up productivity by replacing tedious tasks and grasping new concepts at high speed. This is where current reasoning models seem to shine.

But for high-complexity tasks (R&D, protocol design, architectural decisions), even the most advanced roles spend more time giving context, explaining things, and correcting outputs than leveraging AI for groundbreaking algorithms, product ideas, or creative ways to dissect problems and build on current state of the art. That's where true thinking and intelligence come into play. I don't technically agree that reasoning is the key aspect here though, reasoning models are still quite useful in their own way, just with clear (and well-known) limitations. AGI is still out there as a mountain to be climbed.

What the Research Revealed

The Exact Computation Problem

Perhaps most revealing was the models' inability to perform exact computation reliably. The researchers found that LRMs:

Struggle with precise algorithmic execution
Show inconsistent application of learned procedures across similar puzzles
Exhibit counter-intuitive scaling behavior, initially allocating more reasoning effort as problems get harder, but then paradoxically reducing effort even when computational budget remains available
Display reasoning patterns that look sophisticated but lack mathematical rigor

This suggests that what appears to be "reasoning" might be more akin to sophisticated pattern matching with explanatory narratives attached. The scaling paradox is particularly troubling, as it indicates that LRMs don't just hit a wall at high complexity, they seem to "give up" trying even when they have the resources to continue.

Beyond Benchmarks

The study critiques current evaluation paradigms that focus primarily on final answer accuracy in established mathematical and coding benchmarks. By examining the actual reasoning traces, the researchers uncovered fundamental limitations invisible in traditional metrics:

Data contamination concerns: Established benchmarks may not reveal true reasoning capabilities
Process vs. outcome: Correct answers can emerge from flawed reasoning processes
Scalability insights: Understanding how reasoning quality degrades with complexity

Study Limitations: What the Authors Acknowledge

Deeper dive, what the authors themselves flag as limits of their study

Toy-puzzle scope – The four deterministic puzzles (Tower of Hanoi, Blocks-World, Checker-Jumping, River Crossing) represent only "a narrow slice of reasoning tasks," so findings may not generalise to messy, knowledge-rich real-world problems.
Black-box constraint – Experiments use closed-source LRMs through API calls; without access to internals the team can't inspect activations, attention, or training data to pinpoint why failures occur.
Simulator assumption – Results rely on step-by-step validation inside perfect puzzle simulators. In open-ended domains such exact verification is impossible, so the same analytical technique may break down.
Training-data coverage bias – Some collapse patterns could stem from data scarcity rather than fundamental reasoning limits (e.g., River Crossing instances with more than two people are rare on the web, so models seldom saw them in training).
Exact-computation gap – Even when given a full recursive algorithm for Tower of Hanoi, LRMs still hit the same failure threshold, showing that execution/verification, not just search, is brittle.
Step-execution brittleness – Providing the algorithm should have reduced compute needs, yet accuracy didn't improve, underscoring limits in following long logical chains faithfully.
Open questions left hanging – Authors highlight unsolved issues such as why reasoning tokens abruptly drop near collapse and how to design tasks that probe self-correction and exact arithmetic more realistically.

These caveats show the team is cautious not to over-claim: they position their findings as a controlled stress-test of LRMs, not a sweeping verdict on every form of LLM reasoning.

Implications for AI Development

The research reveals several critical considerations for the AI community:

Evaluation methodology: Need for benchmarks that test reasoning process quality, not just final accuracy
Reliability concerns: LRMs may fail unpredictably on complex real-world problems
Algorithm integration: Exploring hybrid approaches that combine neural reasoning with explicit algorithmic execution
Capability assessment: Better methods needed to evaluate true reasoning vs. sophisticated mimicry
Human oversight: Importance of human collaboration for complex reasoning tasks
Overconfidence risks: Models may present confident-sounding reasoning for incorrect conclusions

Community Reactions and Perspectives

The Apple research has sparked significant discussion across the AI community, revealing a spectrum of reactions from skeptical commentary to detailed technical analysis. Two notable perspectives capture the broader conversation:

🎭 The Skeptical Take

henry @arithmoquine · View on X

> be apple
> richest company in the world, every advantage imaginable
> go all in on AI, make countless promises
> get immediately lapped by everyone
> 2 years into the race, nothing to show for it
> give up, write a paper about how it's all fake and gay and doesn't matter anyway

A satirical perspective on the contrast between AI industry hype and subsequent research questioning those same capabilities. The irony isn't lost that this comes from Apple, which has notably lagged behind in the AI race.

🔬 The Technical Analysis

elvis @omarsar0 · View on X

The Illusion of Thinking in LLMs

Apple researchers discuss the strengths and limitations of reasoning models.

Apparently, reasoning models "collapse" beyond certain task complexities.

This measured response focuses on the actual research findings, highlighting the key discovery about complexity-driven performance collapse. It represents the more academic, research-oriented perspective that values systematic investigation over hype.

These contrasting reactions, skeptical commentary versus technical appreciation, reflect the broader tension in AI discourse between hype cycles and rigorous evaluation. The Apple paper serves as a valuable reality check that bridges both perspectives, offering the systematic analysis that technical researchers appreciate while validating the skepticism of those weary of inflated AI claims.

The Bigger Picture

🎭 The Illusion of Understanding

The paper's title, "The Illusion of Thinking," captures something profound about current AI development. The step-by-step reasoning traces that make LRMs so compelling might be creating a false sense of machine understanding. As the authors note:

"LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles."

This raises fundamental questions about the nature of artificial reasoning and whether current approaches are building toward genuine intelligence or sophisticated simulation.

🚀 What Comes Next?

The research suggests several promising directions:

Hybrid architectures: Combining neural reasoning with symbolic computation
Algorithmic integration: Teaching models to reliably execute explicit algorithms
Complexity-aware training: Developing models that understand their own limitations
Better evaluation: Creating benchmarks that test reasoning process quality

Final Thoughts

The Apple researchers have done the AI community a valuable service by systematically examining what happens when reasoning models encounter their limits. As we continue developing AI systems that we increasingly rely on for complex decisions, this kind of rigorous evaluation becomes crucial. The illusion of thinking might be compelling, but understanding when and how it breaks down is essential for building AI we can truly trust.

The full paper is available on Apple's Machine Learning Research website and provides detailed experimental results and analysis for those interested in the technical details.