Getting to System 2: LLM Reasoning
Gpt2-chatbot mystery, the 2 paths to LLM reasoning, Think-and-execute, and AI agents for AI reasoning
The mystery of GPT2-chatbot
A new AI model called “gpt2-chatbot” landed on Lmsys Arena recently, and its got a lot of buzz for its mysterious origin and high-quality results. Many AI enthusiasts tried it out, and found it impressive on math, “better at complex coding” than Opus, and “Produces CoT-like answers without explicit prompting for such.”
It’s overall quite impressive, albeit slow. Pietro Schirano says of gpt2-chatbot:
Not only does it seem to show incredible reasoning, but it also gets notoriously challenging AI questions right with a much more impressive tone.
So what to make of this AI model? Where did it come from? It was released without documentation or official sponsorship, but there are hints, even beyond the obvious gpt name, that it’s an OpenAI model. It’s self-declared origin is OpenAI, Sam A himself dropped a trolling hint on X - ‘i do have a soft spot for gpt2.’
The most popular explanation is that the gpt2-chatbot model on lmsys is GPT4.5 getting benchmarked. This fits the evidence:
Improved math, coding and reasoning performance over current LLMs
Consistently claims being made by OpenAI, not by others, which models trained on ChatGPT outputs would say
Very slow, as slow as GPT-4 at release one year ago
Just as quickly as it arrived and was played with for 2 days, gpt2-chatbot was removed. Gone. If indeed OpenAI let out their next GPT in the wild to see how it performed, it would explain the final mystery - it was a little test run.
What further ties this down to a possible future OpenAI model is its improved reasoning skills, which Sam Altman has repeatedly said would be a feature of the next GPT iteration.
Getting LLMs to Reason
OpenAI efforts to improve reasoning made the news last November when OpenAI’s Q* leaked, with hints it was a breakthrough. My article “LLM Reasoning and the Rise of Q*” covered Q* and the state of LLM reasoning at that time. Key takeaways were:
LLM reasoning is a hard problem; there are many challenges With LLM cognition.
Since reasoning and complex problem-solving is a search-exploration problem, algorithms based on search and iterative refinement would be most promising.
These include RL (reinforcement-learning) methods that have been applied to solve other AI problems (like playing Go and path navigation for robots).
Further, we find that “process supervision significantly outperforms outcome supervision” in iterative refinement. Rewarding correct steps in finding a solution is more efficient than just rewarding correct solutions.
Getting LLMs to reason better can be broken into two types of approaches:
Improving the LLM prompt itself to direct the LLM towards better reasoning.
Approaches outside the LLM prompt - reviewing, critiquing, iterating, refining, extending or otherwise manipulating responses - to generate the full final result in the LLM.
Prompting For Reasoning
In the first category, the classic example prompt has been to ask the LLM “Let’s think step by step,” which is known as chain-of-thought. Simply breaking any problem down into logical steps improves reasoning, by reducing the space of possible logical gaps and training the LLM to write out it thought process.
The paper “Reasoning with Language Model Prompting: A Survey” looks broadly at the techniques for prompting for reasoning. They present many methods, useful in various settings or for different types of request (see Figure).
For knowledge-based reasoning, incorporating knowledge can enhance reasoning; for math or logic, there are domain-specific structures.
The authors observe:
… high quality reasoning rationales contained in the input context are the keys for reasoning with LM prompting.
Think And Execute
Outside the prompt, there are various mechanisms to assist in the performance of an LLM for reasoning. One way is to convert a word problem into a more structured form, and then solve that. For pure math problems, like asking a complex math problem, this can be done via writing a small program to solve it and execute it.
In “Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models”, the authors develop Think-and-Execute, a novel framework that decomposes the reasoning process of language models into two steps: First, in Think, they discover the task-level logic defining the problem and write it as pseudo-code; second, in Execute, they simulate the execution of the code.
They found that think-and-execute, even though it is simulated and not actual executed code, results in better reasoning on problems:
Our approach better improves LMs' reasoning compared to several strong baselines performing instance-specific reasoning (e.g., CoT and PoT), suggesting the helpfulness of discovering task-level logic. Also, we show that compared to natural language, pseudocode can better guide the reasoning of LMs, even though they are trained to follow natural language instructions.
Outside the Prompt
This chart below from a survey on AI Reasoning presents a broader high-level view of both the types of reasoning that can be considered: not just traditional logic and math reasoning, but different modes (visual, multi-modal, and embodied) and other types of reasoning (causal, common-sense, decision-making).
One direct approach to improving LLM reasoning is to train or fine-tune it into the model, with data specifically to be better at reasoning. The strongest way to do that is with Reward Model-Based Fine-tuning: Train a “judge” model to assess the quality of LLM reasoning steps. Use that assessment as a reward signal to fine-tune the original LLM, directly optimizing it for reasoning processes.
Besides fine-tuning, newer LLM models have improved on reasoning benchmarks thanks to more refined and better data in their pre-training.
With longer context windows, you can also use it to incorporate examples, so a zero shot request becomes a few-shot prompt request. AI researchers have found that relevant logical process steps in the context improves reasoning, just as relevant knowledge improves recall and factualness.
Reasoning Outside the LLM Box - Agents
The two types of approaches to improving LLM reasoning are both about improving how LLMs perform: Improving the LLM prompt, which you can think of as the inner-loop of the LLM process; and then improving outside the LLM prompt and response, which you can think of as the outer loop of the LLM process.
LLMs have strengths a language processors and calculators, that can embed a lot of knowledge, word play, and expressiveness. However, LLMs are inherently sequential, token (or word) at a time, and that makes them natural “System 1” language processors, but not natural System 2 thinkers.
So why not refactor the challenge and “think outside the LLM”? We can scope out to consider getting an AI system (with LLMs as one component) to achieve System 2. AI systems that can flexibly use LLMs in a framework have a name: AI agents.
AI agents (and swarms) are a general framework for AI problem-solving - the goal is to produce a final text output, and iterative methods for improving AI results through reasoning are freely available: search, iteration, review, self-critique, tool use, code execution and more.
The paper “The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey” presents the current state of AI Agent architectures and capabilities. The AI agent ecosystem and capabilities are evolving and improving rapidly, as startups, hackers, and big tech offer various proprietary and open source solutions.
Getting to real System 2 systems with high-level reasoning will require agentic AI. It may be that a next generation foundation AI model might incorporate some of these capabilities. However, whether it takes the form of an AI agent swarm or a foundation AI model that absorbs that capability, those systems, and the path to AGI, will be agentic.