How Efficient are LLMs?
If the AI revolution is like the industrial revolution in its impact on economy and society, we are in the ‘early steam engine’ days of AI Models.
The early industrial steam engines were enormously valuable and important in shifting work from man to machine, but in the early stages of the industrial revolution, they were far from energy efficient. Starting with inventor James Watt, who made steam engines four times more efficient that previously and kicked off the industrial revolution, the history of the steam engine is a process of continued improvements in efficiency, power and performance that led to greater use of engines across all industries.
As with steam engines, AI is a technology that displaces human labor with mechanization; it is groundbreaking, powerful, and changes economy and society. Just as steam engine technology went through iterative refinement to expand capabilities, AI models are undergoing refinement in is an ongoing process these foundational AI models.
The difference is that whereas with steam engines the technology evolution stretched over many decades, the rapid iterative improvement in AI model efficiency and capability is happening at a vastly accelerated pace, as new papers, releases, and models arrive each week. Here are some recent examples:
Orca -13B model built on the work of prior teacher-student LLMs that began with Alpaca just a few months prior, and by curating more explanatory GPT-4 interactions for fine-tuning the student model - called Explanation Tuning - was able to yield near-ChatGPT-level performance in a model less than a tenth the size, while also addressing limitations and gaps in reasoning capabilities of prior ‘student’ LLMs.
MosiacML MPT-30B outperforms the 175B parameter GPT-3, a five times increase in parameter efficiency compared with the landmark GPT-3 model.
The revelation that GPT-4 is a mixture of experts model suggests mixture of experts makes for more efficient large foundational AI models.
The rise of highly efficient fine-tuning. Techniques such as LoRA (low-rank adaptation) enable the creation of customized LLMs for specific purposes at extremely low cost. LoRA cuts the number of parameters to be fine-tuned by 10,000 times versus full fine-tuning.
Google’s PaLM2 was revealed to be a 340 billion parameter model trained on 3.6 trillion tokens, whereas the original PaLM model (released in 2022) was a 540 billion parameter model trained on 780 billion tokens. This follows the trend, proposed in the Chinchilla paper, to use fewer parameters in AI models and train on larger datasets to make them efficient in using training compute budgets.
In the paper “Textbooks are all you need,” the Phi-1 model was presented showed that, with careful data curation, the Phi-1 coding model could perform just as well on HumanEval coding tests as models with 10 times the parameters and 100 times the training data.
The efficiency improvement of Phi-1 is stunning, and the lesson is clear from all this progress: There is a lot of room to improve LLM efficiency. AI models can be made far more capable with the same or even smaller parameter count.
Scaling versus Efficiency
Increasing AI Model capability can be accomplished in two ways:
Do more: Scale up the AI model inputs, by scaling up the compute or increasing the dataset used to train the model.
Be more efficient: Improve the efficiency of the training process or the AI model itself, by improving or optimizing how the data, parameters and compute are used to make a more efficient and capable model.
If the only focus on increasing capability comes from scaling, we will face limits in data and the costs of scaling through adding more compute. On the other hand, the “do more with less” approach has a lot of room to run. As I said earlier in “GPT-4, Experts, Blenders and AI Platforms:”
Scaling is not all you need. We also need more innovation in AI model architectures and datasets to improve them and make them more efficient. Part of those necessary innovations are creative platform approaches to be more efficient in building and serving AI models.
Limits on scaling
What are the limits on the “do more” side? Scaling inputs yields a corresponding increase in outputs, so then to get more capability we would need a larger dataset and compute.
Scaling compute: The term ‘compute’ refers to the GPUs, servers and cloud computing systems that are used to train AI models. These can be scaled if we have the hardware (GPU chips and computer systems) and the software and algorithms to parallelize across ever-growing systems.
Ultimately, scaling compute is a matter of cost, and companies like Anthropic have raised billions of dollars in funds on the premise that those costs will necessarily scale. The market for AI is large enough, the potential economic value to humanity running into the trillions, to support such investment. However, it may become cost-prohibitive for more than a few players to do the most advanced model training as training-cost ramp-ups for larger models.
Scaling Data: We can scale the data used to train models if we can secure more data. Data is the 'new oil' for these informational technologies and is the ultimate limit on training better large models. Like with oil, we might squeeze more value out of some sources, but it is ultimately a finite resource.
The largest models have used on the order of several trillion tokens or words. For example, Palm used 800B tokens, Chinchilla and LLaMa models 1.4T tokens, and Palm2 3.6 trillion tokens. They got this far by scraping a significant portion of the public web. There is much more data out there in different forms and in records and data sources not on the public internet, another order of magnitude or more of data.
The ‘useful’ text data available for training may be in the range of 100 trillion tokens or words, well above our current training datasets, but there may be diminishing returns, where the additive value of the data is low. Beyond that, it may be that other modalities, such as videos, will become more important and relevant, in particular as the large foundational AI models go multi-modal.
Phi-1 taught us that we can make great models restricting ourselves to high quality data sources like textbooks, and the flip-side of that is that models are more efficient if they avoid low quality data. When it comes to data, quality may become more important than quantity.
Thus, it may be that we never run out of data, but that the sources of high-quality data are quickly utilized to their maximum extent, and we face diminishing returns on the massive low-quality data that remains to be exploited.
Efficiency is All You Need
If the only way to improve AI models is through scaling, then this is our future: Massive flagship AI models, trained on 50 trillion token datasets at a cost of several billion dollars each. Only a few corporations would be able to create such AGI-level models.
However, this is not our future. Large flagship AI models will get developed, but there will also be many smaller, efficient AI models for specific purposes and with specific advantages in the AI ecosystem. Many of these models will be open source.
It’s no accident that most of the real efficiency innovations I shared are in smaller models. These smaller models can be trained faster and more cheaply. The faster turnaround begets faster innovation loops that is a huge advantage to research and development. This advantage is further compounded with open source AI research and open source models, where learning is quickly shared.
Doing more with less through efficiency innovations and improvements throughout the stack of AI models will continue:
More efficient training: Better hardware (H100s vs A100s), better algorithms for training and improving the optimization curves during training.
Higher quality datasets for training. AI may be enlisted for curation of datasets to improve them. Use of explanatory prompt response pairs, as was done in Orca, will be used to improve reasoning skills.
More efficient inference through various techniques: inference-optimal models that use a smaller number of parameters; use of mixture-of-experts; pruning larger models down (such as Wanda technique) and quantizing them for smaller AI model size on edge devices.
The innovations are happening in real-time and the result will be AI models that are surprisingly powerful while remaining relatively small.
Are there limits on AI model efficiency? There must be, but we haven’t hit them yet.
Just as Carnot developed the thermodynamic theories to explain the efficiency of engines and their theoretical limits, so too there are informational theories that could explain the efficiency of AI models and their theoretical limits. Such theories are a topic for another day.