Signs of a Scaling Slowdown
There was a recent report from the Information that OpenAI’s next big LLM, Orion, is not doing much better than GPT-4 in pre-training. These limited gains beyond GPT-4 are prompting OpenAI to bake in reasoning and other tweaks after the initial model training phase to gain performance. The money quote:
"Some OpenAI employees who tested Orion report it achieved GPT-4-level performance after completing only 20% of its training, but the quality increase was smaller than the leap from GPT-3 to GPT-4, suggesting that traditional scaling improvements may be slowing as high-quality data becomes limited.
There has been pushback from OpenAI on this publicly, saying the performance improvements are significant, and Noam Brown saying that he was selectively quoted. Indeed, Noam Brown leads the team in OpenAI that developed the o1 model, which uses test-time compute to advance AI model reasoning. His perspective is that we are scaling reasoning with test-time compute and there’s more gains to come.
As the Editor at The Information Amir Efrati on X put it:
To put a finer point on it, the future seems to be LLMs combined with reasoning models that do better with more inference power. The sky isn’t falling.
From that perspective, then, this isn’t the end of scaling but a deviation in the path. The o1 model and inference-time-compute is the path forward. Or is there more to this?
Reuters is reporting a variation or the same story, suggesting it goes beyond OpenAI:
Behind the scenes, researchers at major AI labs have been running into delays and disappointing outcomes in the race to release a large language model that outperforms OpenAI’s GPT-4 model, which is nearly two years old, according to three sources familiar with private matters.
Reuters mentions that the biggest ‘scale-is-all-you-need’ proponent in AI research, Ilya Sutskever, is looking beyond pre-training:
"Ilya Sutskever, co-founder of AI labs Safe Superintelligence (SSI) and OpenAI, told Reuters recently that results from scaling up pre-training - the phase of training an AI model that uses a vast amount of unlabeled data to understand language patterns and structures - have plateaued.
Yam Peleg writes on X also writes that there are issues beyond OpenAI:
Heard a leak from one of the frontier labs (not OpenAI tbh), they reached an unexpected HUGE wall of diminishing returns trying to brute-force better results by training longer & using more and more data.
Bloomberg contributed their own version of the story along the same lines: OpenAI, Google, and Anthropic are ‘struggling to develop’ advanced AI as the latest AI model fails to meet expectations.
It is not surprising that this is a challenge beyond OpenAI, if it is impacting OpenAI. All the major AI labs are building on similar infrastructure, using similar all-the-web datasets, sharing similar algorithms and ideas, racing towards the same goal. If there is a wall or a steeper climb, they will all hit it eventually.
Explaining the Slowdown
Often what passes for thinking is constructing narratives around a set of facts. When faced with a set of facts that don’t fit a previous narrative, we humans either ignore facts, paper over the narrative, or adjust mental maps to new known realities.
The narrative around the AI revolution has been that scaling drives AI progress: Scaling computing effort and data used in training, and AI model parameters has led to improved generative AI model performance, across many orders of magnitude. As new data points arise that challenge the ‘scaling’ narrative, there are various responses to update the narrative:
“We are hitting data limits.”
Bindu Reddy suggests Orion AI model may have hit a wall due to limits on data, or rather the limits of AI capabilities that can be derived from more data:
Even Open AI employees are saying their new Orion model isn’t a huge leap from o1 and isn’t good at coding. Here is the issue - there is only that much signal in data. This includes raw, synthetic, or manufactured data. There are only that many patterns in the Universe. After that we will hit a wall, and even OpenAI is being forced to acknowledge it.
AI that approximates the inputs will always plateau as it runs out of novel information.
“Inference-time reasoning scaling will replace LLM pre-training scaling.”
Chubby on X summarizes this perspective:
As I understand it, the article rather suggests that normal LLMs without reasoning are slowly reaching their limits. However, this limit does not apply to reasoning methods such as CoT, so we can continue to expect significant improvements such as those from o1 or 2025 from o2.
“What slowdown? We are still making progress!”
The other response has been denial, rebutting the ‘slowdown’ data points with the many signs of continued progress. Haider challenges the lack of progress narrative, pointing out the releases in just this week:
Desktop ChatGPT app for Windows — MacOS app now works with other apps » Gemini model Exp-1114 leads the chatbot arena.
» Anthropic adds Claude prompt improver.
» Qwen releases the flagship Qwen2.5-Coder-32B-Instruct.
» Nous introduces the Forge Reasoning API
» DeepMind open-sources Alphafold 3
Slow Progress Is Still Progress
Where do we stand on this? All the above factors are at play.
LLM performance scales on the log of the data or compute inputs to pre-training. To improve an LLM linearly, you need to increase LLM training effort exponentially. Even if the next generation of AI models is improving less than expected, this is not hitting a wall. Rather, it’s just climbing more slowly as each level of improved performance gets harder to squeeze out than the previous one.
Increasing data quality and quantity is an ongoing challenge, since frontier AI models have already used massive amounts of available data for training, and data curation has already squeezed all the quality it can out of sources. Each incremental additional amount of data has less relative impact than what came before.
We’ve made great progress distilling larger AI models into smaller models, which improves LLM efficiency but does not extend frontier AI model capabilities. This has compressed AI model capability metrics and represents real progress since price-performance of AI models is improved. However, it may lead to what looks like stalling because top-line benchmarks aren’t changing as much,
The last few months have seen every leading AI model provider release their best model yet: OpenAI’s o1 model; Anthropic’s latest Claude 3.5 Sonnet; Google’s newest Gemini 1.5 pro. We shouldn’t be hasty to judge AI scaling over until we see a slowdown in releases. A slowdown hasn’t happened yet.
The most important recent release, OpenAI’s o1 model, introduced using test-time compute effort to improve reasoning performance; this alternate approach to scaling AI performance may become the key to AGI. Pre-training scaling doesn’t deliver reasoning enhancements like iterative feedback-based reasoning methods do, so the way forward may be to pivot from pure scaling to better approaches towards reasoning.
At the same time, some recent papers are questioning AI models’ actual ability to reason. Part of this ‘pause’ speculation stems from concerns about how to get the next leap in AI reasoning improvements, given the questions around LLM reasoning.
Since the release of ChatGPT, we have become used to a rapid rate of AI model improvement. That’s what makes this an AI revolution. The AI revolution will follow a Sigmoid pattern, rapid early progress that is exponential in growth for a while and then gradually slows down.
AI progress will slow down eventually, but not soon. We are still in the early days of the AI revolution. Whether or not OpenAI’s next-GPT is a big leap or a disappointment, AI models have OOMs (orders of magnitude) of AI model improvement still to come.