The Meaning of Intelligence and ARC-AGI
Chollet on defining and benchmarking intelligence with ARC-AGI, why AI is really machine reasoning, and colliding views on AI on the path to AGI.
Is AI Actually Intelligent?
With no common definitions of agent, reasoning, copilot, tool use or autonomous, the next year in AI is going to be a lot of people talking past each other!
The discourse on AI is often wrapped around definitions. We vaguely define terms such as reasoning, intelligence, AI, AGI, and Agents, relying on intuition instead of precision. Terms end up overloaded, and disagreement about meanings of terms ends up driving over AI capabilities, whether LLMs can really reason, and when AGI will arrive.
The proliferation of terms and confusion about definitions reflects uncertainty and lack of knowledge. It’s hard to understand among a confusion of terms.
An example of definitions drives perspectives comes from two brilliant AI researchers Francois Chollet and Yann LeCun. Their skeptical perspective on generative AI as being human-level intelligence is grounded in doubts about LLMs to truly reason and their perspective on what it means to be intelligent.
To LeCun, human-level intelligence requires a ‘world model’ that drives human reasoning; it’s a view of practical grounded intelligence, not merely word generation. LeCun said at a recent talk:
“We need machines that understand the world, that can remember things, that have intuition, have common sense, things that can reason and plan to the same level as humans. … current AI systems are not capable of any of this.”
LeCun sees LLMs token prediction as a limited skill that is not even “cat-level intelligence” and is incapable of higher-level reasoning.
Chollet similarly views LLMs as limited. LLMs get training on a lot of data, compress that data into a massive deep learning based functional mapping. When they predict the next token, LLMs are engaged in a kind of pattern recognition, not actual reasoning. LLMs are not intelligent to Chollet because intelligence is about understanding novel situations and LLMs cannot go outside their training data. We’ll describe his perspective below.
Can we clarify our language, establish clearer definitions, and thereby establish better common ground? One hopes that is possible, but we must start with the core term, intelligence, and ask the question:
What is intelligence?
Intelligence, a human psychology view
Even human intelligence is an overloaded term, with explanations and definitions still debated. There are competing psychological theories of intelligence to explain a concept we all think we know but can’t quite define with precision. As stated on this psychology website:
Intelligence has been defined in many ways: higher level abilities (such as abstract reasoning, mental representation, problem solving, and decision making), the ability to learn, emotional knowledge, creativity, and adaptation to meet the demands of the environment effectively.
While it captures the general understanding, it’s vague, general, and admits of multiple definitions.
Asking Gemini to explain human intelligence and you get references to multiple theories of intelligence – Spearman’s g-factor General intelligence, Gardner’s multiple intelligences, which has become one of the pillars of the dominant theory of human cognitive abilities, the Cattell-Horn-Caroll theory (CHC), and Sternberg’s triarchic theory intelligence consisting of analytical, creative and practical intelligence. You’ll also get this summation:
Human intelligence is a multifaceted concept with no single, universally accepted definition. However, it is generally understood to encompass the mental capabilities that allow humans to:
Learn from experience: Acquire, retain, and use knowledge.
Adapt to new situations: Adjust to changing environments and challenges.
Understand and handle abstract concepts: Grasp complex ideas and relationships.
Use knowledge to manipulate one's environment: Solve problems, make decisions, and achieve goals.
AI’s definitional problem is that we cannot precisely define artificial intelligence because human intelligence itself is amorphous and multi-faceted. If we can’t nail down human intelligence, there is even less hope for the artificial kind.
While psychologists cannot clear up all confusion over intelligence, the definition can at least scope down intelligence to be a set of capabilities around learning, adaptation to novel situations, reasoning, and problem-solving.
Chollet’s Measure of Intelligence
Francois Chollet is the author of the Keras deep learning framework and has worked in deep learning at Google for many years, until his recent departure. In 2019, Chollet shared his thoughts about AI and intelligence in the paper “On the Measure of Intelligence,” where he also presented the ARC benchmark, his attempt to measure true intelligence.
Chollet formally defined intelligence in his paper based on Algorithmic Information Theory as skill-acquisition efficiency. He restated it in this tweet:
Intelligence is the efficiency with which you operationalize past information in order to deal with the future. You can interpret it as a conversion ratio, and this ratio can be formally expressed using algorithmic information theory.
He developed an equation to describe the concept quantitatively.
Leaving aside the equation, let’s unpack his rather dense written definition: The term “operationalize past information” is another way to say learning. Learning is taking information, incorporating it into your world understanding, so you can apply it in the future, i.e., “in order to deal with the future.”
He restates this as:
Intelligence is the rate at which a learner turns its experience and priors into new skills at valuable tasks that involve uncertainty and adaptation.
We can simplify it further to be:
Intelligence is the efficiency with which you learn.
Intelligence is the Horsepower of Reasoning
While this definition of intelligence mentions learning, it also describes reasoning. Reasoning and learning are similar; both augment understanding by operating on information priors. Reasoning extrapolates from priors to augment understanding, e.g., via logical deduction, while learning incorporates added information into a world model consisting of priors.
Thus, more broadly: Intelligence is the efficiency with which you learn and/or reason.
Intelligence is a capacity, one that quantifies the quality, depth, and efficiency of thought – planning, learning, and reasoning. As an analogy, we can compare intelligence of thought to the horsepower of the engine. Intelligence is the horsepower of reasoning – that is, the functional capacity for effectively and efficiently learning and reasoning.
The ARC Benchmark versus skill-based AI
Chollet critiques of the definition of AI and benchmarking applied by others. He explains that real intelligence isn't about skill, which can be exhibited via memorizing information or pattern-matching over prior knowledge, it's about being able to reason through novel situations effectively:
… solely measuring skill at any given task falls short of measuring intelligence, because skill is heavily modulated by prior knowledge and experience: unlimited priors or unlimited training data allow experimenters to "buy" arbitrary levels of skills for a system, in a way that masks the system's own generalization power.
He believes current LLMs have "near-zero intelligence" despite their impressive skill-based abilities. To him, sophisticated memory and pattern-matching AI systems are not truly intelligent beings.
Chollett created the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) benchmark to test for “true” AGI, to benchmark novelty-based intelligence in AI, comparing it to human intelligence. While other AI benchmarks, such as MMLU or HumanEval, are skill-based and knowledge-based, ARC-AGI questions require novel spatial reasoning that goes beyond simple pattern-matching or the linear reasoning of an LLM.
The results are stunning. Even though most humans can solve ARC-AGI problems easily, for 5 years, the ARC-AGI benchmark has resisted being conquered by AI models even as other benchmarks have gotten saturated. Leading AI models like GPT-4o and Gemini 1.5 Pro score in single digits. Even the reasoning-based o1 model scores only 21%, well below human-level.
However, the recently published Test-Time Training method achieved a near human-level score of 61% by using inference time to train examples and variations on-the-fly, to help the AI solve challenging problems. It could be that passing these IQ-like visual puzzles may end up being another skill to learn, like the game of Chess and Go before.
ARC-AGI has thus far managed to resist getting solved by LLMs and will require a different kind of AI to conquer, which has been the goal of ARC-AGI.
Does Useful AI Require human-level Intelligence?
“Intelligence is a separate concept from skill and behavior.” – F. Chollet
Reflecting on Chollet’s definition of intelligence yields many questions. One is whether intelligence is the right target to aim at when evaluating AI. Consider this different view of AI:
“AI is the science of making machines capable of performing tasks that would require intelligence if done by humans” – Marvin Minsky
What do you need in an economically useful AI? A lot of economically useful human-level reasoning-based work tasks don’t require dealing with novelty. Most require simple reasoning or can be accomplished via adopted skills that are a sophisticated form of pattern recognition and following standard processes. Think about customer service bots or repair automation.
Machine Reasoning
Perhaps asking “What is AI?” and the hunt for “real” intelligence is the wrong question. Perhaps Collet is right about what intelligence really means, but misses the real question, which is “What does AI actually do?”
What LLMs and similar generative AI models do is map complex inputs into potentially creative, knowledgeable, and reasoning-based outputs in different modalities. A deep learning model is a general structural architecture for abstraction mappings. Transformer-based LLMs are machine reasoning abstraction mapping engines that engage in word manipulation, knowledge compression, and pattern-based reasoning.
Generative AI is about creative generation and reasoning. What AI models do is Machine Reasoning.
Once you define what we call AI by its activity - machine reasoning - it fits into place: Just as machine learning is computer-based learning from data, machine reasoning is computer-based analogs of human reasoning applied to solve complex problems.
Just as machine learning is quite different from how humans learn, so too can machine reasoning be different from human reasoning. Machine reasoning can be developed by training on large datasets and develop search methods and approaches different from how a human would think.
The Road to AGI
What we have seen is that competing definitions of intelligence leads to competing understanding of AI capabilities, and that leads to competing views on what AGI is when we’ll get there. The ARC benchmark webpage expresses the contention:
Defining AGI Consensus but wrong: AGI is a system that can automate the majority of economically valuable work.
Correct: AGI is a system that can efficiently acquire new skills and solve open-ended problems. Definitions are important. We turn them into benchmarks to measure progress toward AGI. Without AGI, we will never have systems that can invent and discover alongside humans.
Sam Altman and others have defined AGI in terms of “AI that can do a broad range of human-level useful work.” This skill-based AGI definition is dubious but operationally useful. Definitionally dubious because a bundle of skills can achieve generality without being itself general. Operationally useful because it’s based on capabilities, which is what we care about in AI.
However, the definitional concern around real intelligence is not theoretical. Chollet proved with the ARC benchmark that there are problems that really do require novel spatial reasoning, and LLMs fail to generally solve them, falling well short of what humans can do. GPT-4 level LLMs do many things but are not general reasoning engines.
This is a challenge to the “scale is all you need” perspective on AI progress. These problems may not impact how an AI agent for customer service works, but if LLMs cannot actually reason with real intelligence, then LLMs will soon hit a ceiling of capability. The latest news of a ‘plateau’ in frontier AI models could be a sign of such limits.
Unlocking general high-level reasoning to get to AGI will take more than scaling. We’ll need algorithmic breakthroughs to get to AGI. The o1 and DeepSeek r1 models are examples of such innovations, and the Test-Time Training shows how such approaches can achieve higher ARC scores. AI progress towards AGI will continue as we achieve those breakthroughs.
People have been rewriting history and saying that "everyone has always believed that LLMs alone wouldn't be AGI and that extensive scaffolding around them would be necessary". No, throughout most of 2023 (the "sparks of AGI" era) the mainstream bay area belief was that LLMs were *already* AGI, and that merely scaling their parameter count and training data size by ~2 OOM without changing anything else would lead to super-intelligence.
Postscript
Francois Chollet shared his views on these topics, discussing Pattern Recognition vs True Intelligence in a recent interview:
Perhaps asking “What is AI?” and the hunt for “real” intelligence is the wrong question. Perhaps Collet is right about what intelligence really means, but misses the real question, which is “What does AI actually do?”
/
I can’t say I agree. Chollet is concerned with the claims of many prominent researchers in the field that what we have today can be called intelligent. He pushes for a more intellectual honest debate and with ARC wants to stimulate open research, in a time where almost everything has become closed-source.
Great article! Although the claim that “Reasoning is a type of learning, and learning is a type of reasoning” seems a bit suspect to me as I can’t see how something that is a subset (a “type”) of something can also be a superset of it. But then I am neither a mathematician or a logician, so maybe I’m missing something. Like, a Cadillac is a type of car, but a car is clearly not a type of Cadillac.