LLM Reasoning and the Rise of Q*
How Well Can LLMs Reason? Can Q* solve the reasoning challenge?
The Q* Conspiracy
I wanted to write an article solely on the progress in LLM reasoning, asking the question: How Well Can LLMs Reason? But there’s a conspiracy afoot to make everything in AI about OpenAI lately.
Gossip and rumors have been flying about something brewing at OpenAI, leading to a Reuters article that “OpenAI researchers warned board of AI breakthrough ahead of CEO ouster.” OpenAI staff researchers warned the OpenAI board of a powerful AI discovery, which has been internally been called Q* (Q-star).
Reuters reported that Q* is viewed internally in OpenAI as a breakthrough:
Some at OpenAI believe Q* (pronounced Q-Star) could be a breakthrough in the startup's search for what's known as artificial general intelligence (AGI), … the new model was able to solve certain mathematical problems … Though only performing math on the level of grade-school students, acing such tests made researchers very optimistic about Q*’s future success …
So according to some (it’s still unclear if it is true because the Board’s reasoning has been opaque), Altman's ouster was originally precipitated by the discovery of Q* (Q-star), which supposedly was an AGI. The Board (including Ilya) was sufficiently alarmed by both the event and the lack of communication from Altman himself (hence the “not consistently candid”), that the board called the meeting to fire him.
Needless to say, social media is abuzz with what Q* could be and what kind of breakthrough it might be to cause this alarm. What is Q* and has it unlocked the door to AGI?
Before we get into what Q* is, let’s revisit the question motivating Q* in the first place: How well do LLMs reason, and can you get LLMs to develop better reasoning?
How Well Can Today’s LLMs Reason? Not that well
Many AI researchers have assessed reasoning capabilities of current LLMs and have found a number of shortcomings:
LLMs utilize subgraph-matching, not systematic reasoning, to ‘solve’ tasks
Faith and Fate: Limits of Transformers on Compositionality investigated LLMs on complex reasoning by testing them on compositional problems, specifically “multi-digit multiplication, logic grid puzzles, and a classic dynamic programming problem. These tasks require breaking problems down into sub-steps and synthesizing these steps into a precise answer.” They found the following:
Our empirical findings suggest that Transformers solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching, without necessarily developing systematic problem-solving skills.
GPT-4 Vision Lacks Spatial Reasoning
A recent paper by researchers from the Sante Fe Institute, “Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks,” evaluated GPT-4 and GPT4-V with vision, to see if there were signs of emergent problem-solving capabilities. They tested it using visual puzzle tasks from ConceptARC benchmark, akin to visual and spatial reasoning tasks you might find on an IQ test.
What they found was:
GPT-4V … performs substantially worse than the text-only version. These results reinforce the conclusion that a large gap in basic abstract reasoning still remains between humans and state-of-the-art AI systems.
This follows on from prior work by Moskvichev et al. that found that “GPT-4 had substantially worse performance than both humans and the first-place program in the Kaggle-ARC challenge on these tasks.” The bottom line is that there is still a big gap in reasoning at the LLM level.
LLMs fail to exhibit logical deduction, such as “If A is B, then B is same as A”
This comes from the paper “The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A". It finds that models do not generalize a prevalent pattern in their training set, i.e., if "A is B'' occurs, "B is A" is more likely to occur.
Emergent behaviors in LLMs come from ‘quantized’ capabilities in LLMs
In “The Quantization Model of Neural Scaling” Eric J. Michaud and others propose a “Quantization Model of neural scaling laws” to explain power law scaling and emergent capabilities. “We derive this model from what we call the Quantization Hypothesis, where network knowledge and skills are “quantized” into discrete chunks (quanta). We show that when quanta are learned in order of decreasing use frequency, then a power law in use frequencies explains observed power law scaling of loss.”
Challenges With LLM Cognition
Why do LLMs fail to reason better? Why might scaling not be enough? A simplistic answer is that the loss function for ‘find-the-next-token’ is not aligned with long-term planning and complex multi-step reasoning. Eric J. Michaud on X explains the challenge with higher cognition for LLMs:
LLMs are (pre)trained to minimize next-token prediction loss across human text. Ignoring fine-tuning, RLHF, etc., what they learn will be determined both by (1) what is possible to learn and (2) what is optimal to learn under this objective.
… So why might current models have not learned the algorithms associated with higher cognition yet? It could be that simply (1) they are not possible for the models to learn. But one defense of scaling would be that (2) is the bottleneck: It could be possible that the (loss improvement) ÷ (network capacity) of learning the algorithms/circuits we associate with higher cognition is still worse than for learning many additional pieces of knowledge or statistical patterns in language.
As an LLM scales, it can go from learning basic patterns to more complex learning, but the LLM “next token” goal itself prevents learning at larger scales into greater leaps of logic.
Fine-tuning with RLHF (Reinforcement Learning with Human Feedback) can align the LLM response at a higher level. Yet, as the studies on GPT-4 indicate, even this has limitations.
Overcoming the LLM Reasoning Challenge
There have been a number of approaches to get LLMs to do better at reasoning. One simple approach that has been the genesis of much more has been the prompt to the LLM to break down the problem, “Let’s think step-by-step.” Adding that prompt to GPT-4 queries of complex word problems leads to more accurate results.
A whole class of approaches started with Chain-of-Thought, which formalized this into an iterative approach. In May, our article “Tree-of-Thought and Building Reasoning AI” covered approaches to improving LLM reasoning via “step-by-step” approaches like chain-of-thought, including Reprompting and Tree-of-Thought. Iterative multi-step prompting of LLMs for reasoning now includes Graph-of-Thought.
The Genesis of Q*
OpenAI Research was looking at improving reasoning of AI models in their “Math Gen” group. In May, this group published a blog post about “Improving mathematical reasoning with process supervision” sharing their work on improving complex multi-step reasoning, which they published in the paper “Let's Verify Step by Step.”
They took the ‘step-by-step’ reasoning approach a bit further, by putting a reward function on the chain-of-thought itself. To improve reasoning, they provide feedback to the LLM for each intermediate reasoning step it takes, through what they call ‘process supervision',’ and conclude:
… process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set.
This work builds a step-by-step reward function, and through that guides the AI through the reasoning search space.
Sources say that after this work was done, OpenAI combined their "Code Gen" and "Math Gen" teams to focus on AI model reasoning. “Sutskever began working on Superalignment, but other researchers including Jakub Pachocki and Szymon Sidor used the advance to build Q*.”
The term Q* itself seems to be a combination of Q-learning and A* star search, both critical methods in building AI systems that can play games.
Q-learning is a reinforcement learning algorithm to improve the policy decisions in a reinforcement learning framework. It is ‘model-free’ in that it learns about the environment and rewards by learning from ground truth, not as inputs, and it finds an optimal policy for navigating through the problem to maximize final reward.
A* Search is a heuristic search algorithm to determine the best (shortest) path from a source to a goal. This has many applications in computer optimization.
These RL methods have been used in agentic AI models, such as Deep Mind’s MuZero in “Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model” in 2020, which combined tree-based search with a learned model to achieve superhuman performance across 57 games, all from first principles. Critically, MuZero is strong where LLMs are weak, specifically in the area of planning:
MuZero learns a model that, when applied iteratively, predicts the quantities most directly relevant to planning: the reward, the action-selection policy, and the value function.
The Q* algorithm that startled OpenAI is apparently rooted in these above approaches to improving AI model reasoning, starting with a step-by-step approach and a process-oriented reward function. The researchers found a breakthrough in designing this learning optimization that enabled significantly better results on mathematical reasoning.
Takeaways
So, what is Q* ? We don’t know exactly, but here are some takeaways for what it likely means for LLM reasoning and the path to AGI:
LLMs by themselves aren’t good at reasoning. Research on LLM reasoning capabilities has shown surprising cases of both simple failures and interesting successes.
Current frontier AI models, trained on text data that would take 20,000 years for a human to read, have already gotten super-human on some forms of knowledge, but are much weaker in reasoning about what they do know.
There are many challenges to overcome to get LLMs to reason better.
The path to AGI is likely not just scale in LLMs, but breakthroughs in handling reasoning. Scale alone cannot make LLMs reason like humans.
Many approaches have been used to bootstrap to better reasoning. The most promising ones decompose complex problems into simpler tasks “step-by-step.”
Another approach to AI model problem-solving: Connect to tools that can solve problems - calculators, python execution engines, formal reasoning engines. Gorilla: LLMs connected with massive APIs.
Reasoning and complex problem-solving is a search-exploration problem, amenable to RL (reinforcement-learning). Solving a math word problem and path navigation for robots both share an aspect of planning.
RL, reinforcement learning, been the method of choice for reasoning on finite environments and problems, such as game-playing AI that can learn Chess, Go, and other games. RL wrapped around deep-learning is how these were solved.
Q* is likely an RL-based method for developing a better reasoning policy. It could be similar to RLHF, as a fine-tuning method, or it could be an agent-like iterative loop that utilize LLMs as a subroutine to develop a full solution complex reasoning response.
Q* isn’t actually AGI. Nor is AGI imminent. What Q* is likely to be is another incremental step towards better AI reasoning - significant perhaps but partial.
LLMs of today are just the start. AI models of tomorrow will adopt more complex aspects to solve higher-level problems.
Postscript - the Bitter Lesson
In the discussions on X, a Richard Sutton essay called “The Bitter Lesson” has been shared. It’s a reminder that the ultimate AI solution will be one that combines scale with search and learning. Computing keeps scaling while human input is a bottleneck. Whether it is via AI-generated synthetic data, or via more automated feedback mechanisms, AI is building on itself.
“The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. … One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.” - Rich Sutton
> What Q* is likely to be is another incremental step towards better AI reasoning - significant perhaps but partial.
It's probably wise to downplay this given the hype, but it's hard NOT to see Q* resulting in extremely rapid intelligence improvements in the short-term. You may have already seen Nathan Lambert's writeup at https://www.interconnects.ai/p/q-star , but Q* evokes to me AlphaGo's successful superhuman trajectory enabled by search and self-play with RL.
Excellent writeup!