Artificial Intelligence – Reasoning Debate pt. IV

Apple sticks the knife in AGI, but it has an agenda.

  • Apple has a new paper that once again demonstrates LLM and reasoning models’ inability to truly reason, but it is important to remember the biases and weaknesses that are inherent in this publication.
  • Apple’s new paper (see here) builds on its publication from October 2024 (see here) in that it takes the most recent and advanced “reasoning” models and runs them through a series of reasoning tests.
  • Last time Apple took a simple reasoning benchmark, changed some irrelevant data and observed the models floundering, but this time it is looking at real-world puzzles.
  • These include Tower of Hanoi, River Crossing in a boat and so on.
  • These puzzles are all solvable with a single technique that is reused and extrapolated as the task becomes more complicated.
  • This means that if a human can solve the easiest version of the puzzle, then applying the same technique, he or she should be able to solve the most complex version.
  • The number of steps required to solve the puzzle increases exponentially, but crucially, the logic and the process to solve the problem remains unchanged.
  • This is why one can find relatively simple algorithms on the internet that will solve these problems to any level of complexity with 100% accuracy.
  • However, when Claude 3.7 Sonnet, DeepSeek R1, o3-mini (high) and o3-mini (medium) were tested, their ability to solve these problems collapsed after the complexity was increased a few notches.
  • Even when the models were offered the algorithm that could solve the task, they proved incapable of doing so.
  • This is the equivalent of providing a human with a step-by-step guide on how to solve the problem, and the human still failing to get the problem right.
  • To me, this is more evidence that statistical-based models are incapable of true reasoning, further supporting my long-held view that these systems have no understanding of causality, meaning that they will never obtain super intelligence or artificial general intelligence (AGI).
  • I suspect that this happening because the simpler versions of these problems are everywhere and so were very likely to have been included in the datasets upon which these models were initially trained.
  • The more complex versions are much rarer, and so the LLMs had not seen these before and hence, were unable to solve them.
  • If they were able to reason, they should have been able to extrapolate from the simple solution to the complex one, but the data is pretty clear that none of them were able to do this.
  • Hence, I conclude no true reasoning is happening inside these models and that what we are getting is merely a very sophisticated simulation that goes wrong the minute it leaves the zone where it has previously seen data.
  • OpenAI’s own publication on GPT-3 (see here) demonstrated a very similar phenomenon with simple mathematics, leading me to conclude that this flaw is endemic and cannot be designed out of LLMs as it is inherent to their basic design.
  • I continue to view reasoning as a crucial indicator because if I am wrong and somehow these models do start to truly reason, this is a step towards solving the causality problem, which in turn should lead to AGI.
  • As good as the paper is, it has several flaws.
  • First Apple Bias: where the main beneficiary should there be a system-wide crash in LLMs, and the AI bubble pops is Apple.
  • This is because LLM-powered services like ChatGPT and Gemini, and its own inability to come up with something equivalent that is starting to threaten the appeal of iOS to users.
  • If agents become ubiquitous and Apple does not have its own, users will start to question why they are paying a premium for an Apple device, and the whole proposition of premium-priced hardware begins to unravel.
  • Therefore, Apple has everything to gain from poking holes in the AI-takes-over-the-world mantra explaining why it is Apple that has published this research and not Google, Anthropic or OpenAI.
  • Second, unpublished: This is not a scientific paper, even if it looks like one and has little more validity than a press release.
  • Scientific papers are held to a high standard, as when they are properly produced, they are peer reviewed and written to a standard such that their findings are reproducible.
  • This is why a properly produced scientific paper that has been published in a scientific journal has a high degree of credibility and should be treated with respect.
  • This paper has been subjected to none of this rigour and, as a result, should be treated with the same level of scepticism as when everyone else produces a new model and claims it is a step on the road to AGI.
  • Third, Human performance: which is the most obvious flaw.
  • Most humans will also see a collapse in performance when they attempt the more complex versions of these puzzles.
  • This is not because they can’t reason, but because they lack the short-term memory required to remember the steps of the puzzle when it gets very complicated.
  • Critics of the paper will rightly point out that if humans fail the complex versions of the puzzles, then this is not an indication of an inability to reason.
  • It’s a fair point, but I think that humans fail because their short-term memory gets lost in the steps rather than a failure to understand the steps themselves.
  • Despite the paper’s flaws, I think that this is a further indication of LLM and “reasoning” models’ inability to reason further strengthening the view that we are not on the way to AGI.
  • Instead, we have discovered a new technology whose usefulness may rival that of the internet in time, offering the opportunity for a lot of value creation as well as new companies and changes in market leadership.
  • I still think that there will be a correction as many valuations remain far higher than fundamentals support, but given the economic opportunity available now, it will not be as bad or as prolonged as the internet crash between 2000 and 2004.
  • While this is bad news for the valuations of the AI companies, it is good news for the human race, as the machines remain way too stupid to decide that humans are a threat and eliminate us all.
  • A loss for AGI crowd but a win for humans.

RICHARD WINDSOR

Richard is founder, owner of research company, Radio Free Mobile. He has 16 years of experience working in sell side equity research. During his 11 year tenure at Nomura Securities, he focused on the equity coverage of the Global Technology sector.

Leave a Comment