This article provides an overview on the current state of the LLMs. Because this is an ever-evolving field and landscape, I’m not just making this snapshot, but I’ll also create an AI Models page to keep track of the latest developments in Foundational AI Models.
LLM Releases Keep Coming
Since the announcement of GPT-4 a mere 6 weeks ago, there have been a dizzying array of LLM announcements, some but not all of which we’ve reported on here.
MosiacML, a company that provides software and cloud infrastructure for training AI models, stated in 2022 that it costs about $450K to train a model that reaches GPT-3 quality (assuming a 30 billion parameter model is trained in a compute-optimal manner on 600B tokens of data). The great economic value in AI models encourages many entrants. So there continues to be many new LLMs released.
Open Source Models
AI is having its “Linux Moment” with s number of open-source efforts to create open AI models bearing fruit.
I’m very impressed by RedPajama, an open-source project with many collaborating groups participating, including Zurich ETH DS3Lab, Stanford CRFM, and MILA Québec AI Institute. Their goal is to produce a reproducible, fully-open, leading language model, and they started down that path in mid-April by reproducing LLaMA training dataset of over 1.2 trillion tokens.
Their next step is to train a base model on this data. Meta’s LLaMA paper gives them the roadmap to do that. They are also collecting open-source interaction data, to help with instruction fine-tuning.
The release of Alpaca, a small model based on LLaMA fine-tuned on instruction-following examples by Stanford, led to a number of similar models that used variations on fine-tuning to make capable models in 6B - 20B parameter range, most of them open-sourced:
Vicuna: a chat assistant fine-tuned from LLaMA by LMSYS
Koala: a 13B parameter dialogue model for academic research by BAIR
FastChat-T5: a chat assistant fine-tuned from FLAN-T5 by LMSYS
OpenAssistant: an Open Assistant for everyone by LAION
ChatGLM: a 6B open bilingual dialogue language model by Tsinghua University
StableLM: Stability AI language models (work in progress, released alpha models)
Dolly: a 6B parameter instruction-tuned open LLM by Databricks
There have been yet further variations on these, including using GPT-4 inputs to fine-tune AI models. Some have been hampered by the non-commercial-use-only restrictions imposed on the base LLaMA models or other copyright limits, so the efforts that build pure open-source from the ground up, like Red Pajama does, are the most fruitful in the long run.
Running LLMs Locally
One motivation for open-source models is the desire to run AI models locally. The largest AI models like chatGPT are too large and require multiple GPUs, but there is a path for smaller models that can run on personal consumer-grade GPUs or even CPUs.
One use-case for a local-run LLM would be as an AI-enabled browser. The example of Web LLM is showcased by Simon Willison in “Web LLM runs the vicuna-7b Large Language Model entirely in your browser, and it’s very impressive.” He got Vicuna running on Chrome on his M2 MacBook Pro and was impressed:
It’s really, really good. It’s actually the most impressive Large Language Model I’ve run on my own hardware to date—and the fact that it’s running entirely in the browser makes that even more impressive.
Running LLMs in browsers is a big deal. Just as AI is disrupting search, so too will AI-in-the-browser fundamentally change the browser experience.
This blog post on MLC - machine learning compilation - points out that running LLMs on consumer devices locally provides these benefits: Offline support and client server hybrid use-cases; personalization; specialization and application integration. If LLMs can be run locally, they’ll find many use cases. The challenge is squeezing Large Foundational AI models down to fit on consumer devices.
The MLC group proposes MLC-LLM, ML compilation to support the following goal: Enable everyone to develop, optimize and deploy AI models natively on everywhere — including server environments and consumer devices. The MLC approach of model compilation is to “take our ML model as a program, and transform in multiple steps towards the desired form in the target platform.” This is a potentially efficient deployment pipeline for locally-run AI models.
Special-Purpose Models
We have seen a number of models announced recently for different languages, nations, and specific applications. We’ve mentioned the Chinese entrants Baidu’s Ernie Bot, Alibaba’s Tongyi Qianwen and Sense Time’s SenseChat before. Additional entrants are:
IGEL is a 6B instruction-tuned German LLM proof-of-concept, based on BigScience BLOOM and localized for German use-cases.
Pheonix is an “open-source, multilingual, and democratized ChatGPT model”, based on BLOOMZ and fine-tuned on multi-lingual set of English and Chinese instructions, that excels in various languages with limited resources.
John Snow Labs announced a healthcare-specific LLM called BioGPT-JSL tuned specifically for medical use-cases, such as generating clinical report.
Diversification of Foundational AI models
There has been an explosion in both the number and type of AI models released in 2023. This Cambrian explosion of diverse AI models has been building since the development of initial large language models BERT and GPT-1 in 2018, and the release and excitement around chatGPT and GPT-4 this past year is accelerating that.
The Paper “Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond” has a fine illustration of the ‘evolutionary tree’ of LLMs developed in this era, showing the types of each released model and the institution behind it.
What stands out is that the remarkable evolution and scaling of LLMs in just a few short years, as well as the broad diversity of models, even within the restricted domain of LLMs.
In 2022, this TechCrunch article grouped the emerging models into: Large language models; Fine-tuned language models; Edge language models. The diversity of models has only expanded, and I would group the different LLMs and related models this way:
Flag-ship Foundational AI Models - Best-in-class Big LLMs from Big Tech AI companies: OpenAI’s GPT-4, Anthropic’s Claude, Google’s PaLM.
Multi-modal Foundational AI models - extending beyond LLMs to image and embodied input and output: GATO, PaLM, PaLM-e.
Open source LLMs: - Bloom, Bloomz, StabilityAI’s StableLM
Instruction-tuned LLMs - fine-tuning smaller (7B - 20B) models on the outputs of larger models (such as chatGPT or GPT-4): Alpaca, Vicuna, Koala, OpenAssistant, chatGLM, FastChat-T5, Dolly.
Special-purpose vertical LLMs - for finance, medicine, etc.: BloombergGPT (finance), BioGPT-JSL (medical).
Code assistant AI Models: Microsoft Github Copilot (Codex), Replit code complete,
National/other language LLMs - catering to specific countries and languages outside USA/English LLMs: IGEL - German, GigaGPT - Russia, Chinese LLMs include Baidu’s Ernie Bot, Alibaba’s Tongyi Qianwen, Sense Time’s SenseChat.
Architecture of the AI Ecosystem
From an architectural or use-case perspective, we can group Foundational AI models into three broad classes:
A top tier of “flag-ship” general-purpose Foundation AI models, of hundreds of billions to over a trillion parameters, that takes tens of millions if not hundred of millions of dollars to train. Since only a few well-funded Big Tech players can play in this arena, there will be only a few of these very large and capable models. The AI model makers are in an ‘arms race’ to make the most capable AI model they can for broad use.
A second tier of highly-capable special-purpose LLMs that are still quite useful but not as powerful nor as expensive to create and run. In this category is a broad diversity of models from many organizations - including non-US companies, open source groups, academic institutions, and companies with niche applications. These groups will both build models from the ground-up (e.g. open-source efforts like BLOOM), but also use instruction fine-tuning to create parameter-efficient models (Vicuna) or create special-purpose models (BloombergGPT or BioGPT-JSL).
A third tier of Edge LLMs developed to run locally. These param-optimized models for local or edge use - Midsize Language Models or Edge LLMs - need to be in 3B/7B/11B/20B range to fit on consumer hardware, and would be optimized for maximum capability at minimum parameter/memory footprint. As mentioned above, a good use-case is an in-browser AI model. If an Edge LLM can cache or select other tools or models as needed, akin to Hugging-GPT, this can be basis of a local-based user-facing AI assistant. Right now, hobbyists, open source advocates, and AI enthusiasts are paving the way with various interesting solutions.
Will we have one dominant AI model that has 90% market share in the same way that Google dominated search? This is still possible, but I doubt it. There isn’t a ‘network effect’ to AI models like with social media, so it’s likely there will be four or more top competitive players in the ‘big league’ flagship Foundation AI model space. It will be more like the car industry than the search engine space.
When it comes to AI models, you would want the best model possible. But the best model for one purpose is not necessarily the best model for other use-cases. So it’s likely that we will see continued diversity of models filling out that second tier of specialty LLMs, as well as the third tier. We may see dozens of different options in those tiers.
There is an architectural choice in how foundational AI models are used and deployed in larger AI solutions:
One big model to do it all, or
A collective of many models working together
As I have explored in several prior articles, there are gaps in the capabilities of even the best foundational LLMs, even GPT-4. LLMs are strong in language-related tasks but weak in planning, certain math problems, knowledge, etc. Going multi-modal won’t fix most of those weaknesses.
It’s possible larger Foundational AI models could yield emergent improved planning capabilities, but the cost of a larger AI model versus other ways of importing planning capabilities suggests we don’t need nor want to go towards one big monolithic AI model.
Consider the recent result where planning capabilities were enabled for chatGPT via a plug-in, proving that planning was easier to accomplish in a special-purpose. Hugging-GPT and ChatGPT plugins point clearly to the power of leveraging special-purpose tools versus a single prompt response from a single model. Even with the advantages of LLMs generalizing as they grow, there are greater advantages with iteration and collaboration of different specialized models with respect to reliability, efficiency, and other factors.
The bottom line: A collective of many AI models of diverse types - flag-ship Foundational AI Models, special-purpose AI models, Edge LLMs - working together is the likely AI architecture that will be most powerful, reliable and cost-effective.
A robust near-term AI architecture to take advantage of these various AI models in the ecosystem might look like this: An Edge LLM running locally, perhaps in-browser, acts as the ‘front end’ to the user and as the ‘controller’ (as in HuggingGPT); it is backed by a flag-ship Foundational AI model running in the cloud, acting as the main work-horse for difficult queries; these models are also connected to vector database memory, plug-ins for other AI models and tools, including special-purpose AI models, and are capable of iteration, reflection and review.
Postscript
I didn’t go into actual benchmarks on these various models because this overview is already detailed enough, calibrating LLM performance from user perspective is imprecise, and in most cases the data isn’t there. But this article is a helpful review: The Ultimate Battle of Language Models: Lit-LLaMA vs GPT3.5 vs Bloom vs ….