“AI is undergoing a paradigm shift with the rise of model (e.g., BERT, DALL-E, GPT-3) trained on broad data (generally using self-supervision at scale) that can be adapted to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character.” - Bommasani et al, 2021
What are Foundational Models?
A Foundation Model is a large deep learning AI model that is trained on a broad set of data and used for a wide range of downstream tasks. These models are large, broad and general, and while they encompass the Large Language Models - LLMs - using self-supervision at scale) that can be adapted (e.g., fine-tuned) to a wide range of downstream tasks.
The term Foundation Model was coined by researchers at Stanford’s Institute for Human-Centered Artificial Intelligence (HAI)1, and to study these important models in depth, they formed the Center for Research on Foundation Models (CRFM) at Stanford.2
They created this new term because AI researchers have recently created something not seen before when they started scaling up language models massively to create not-seen-before Large Language Models. They recognized that as these models took on other inputs, and became multi-modal, meaning they would have multiple types of inputs such as images and text combined, they would they would be
Several key features they identified in Foundation Models:
Scale - These AI models consist of hundreds of billions of parameters, are trained on datasets of trillions of words or data points, and require orders of magnitude more compute to train than ever before. They are like the Brontosaurus dinosaur of the AI jungle - massive.
Emergence - Scale is a pathway to new emergent capabilities, meaning that as it gets larger, it can do things and exhibit capabilities that are qualitatively better and also new and different. With scale, a capability called transfer learning improves, meaning that making it good at one task makes it easier to learn other related tasks. These models are good zero-shot and few-shot learners, meaning they can solve a novel task with only a little prompting or help.
Homogenization - Foundation models will generalize and solve more problems as they scale. In the process, they become less specialists and more generalists. The analogy might be that they become more like Swiss army knives and less like a scapel. We can expect Foundation Models as a class to become more similar, as they are trained in similar ways with similar goals (predict the next word or image element) and using overlapping data (all the text and other you can get your hands on).
The Eras of AI - A Brief History
We can divide up the history of AI into several eras: Early years (1940s-1950s); Formal Methods (1960s-1980s); Machine Learning (1990s-2000s); Deep Learning (2010s); Foundation Models (2020s-).
Artificial Intelligence emerged as a field in the dawn of the computing age, in the 1940s and 1950s. Early computing pioneers pondered the question: If we could create machines to do computations, could we get them to actually think and think like humans?
The computing pioneer Alan Turing developed a thought experiment to define when a machine achieved Artificial Intelligence: Have a human converse through an interface with an AI; if the human being is unable to tell the difference between the AI and a human, then the AI can be considered to have achieved human-level intelligence. It passed the “Turing Test”.
Formal Methods: 1960s - 1980s
For many decades, the challenge in AI was though to be “How do you get machines to think?” There was a great effort for many decades in the AI field to solve that question, aiming to get computers to go beyond mere calculation to thought with various formal types of logic-proving, methods of reasoning, and other approaches. Expert systems and rule-based systems were developed to encode human knowledge and automate expertise.
However, these methods never seemed to to get us to real AI. We were stuck with humans training, coaxing and coaching fragile systems. It turns out that asking how to get machines to think was the wrong question. The right question was “How do you get machines to learn?” And the answer? Learn from data.
Machine Learning: 1990s - 2000s
The term machine-learning (ML) was first coined in 1959 by Arthur Samuel as “the field of study that gives computers the ability to learn without being explicitly programmed.” Machine learning came into its own in the 1990s and since with advances in methods learning algorithms, such as logistic regression and Support Vector Machines (SVMs)3, that analyze data for classification and regression analysis.
In the internet era, these methods took advantage of more powerful compute and more abundant data to train effectively on large data sets. SVMs in particular were used for recommendation systems used by Amazon, Netflix and others for recommending products.
Deep Learning: 2010s
If you want to get computers to think like humans, why not figure out how humans think and emulate that? That was the idea of perceptrons4, creating an abstraction of the neuron, and of artificial neural networks. The artificial neural network is a mathematical abstraction of connected neurons in a brain. There is a connected set of nodes, each node with a set of inputs and a single output that is a summation-like function of weighted inputs.
The power of the neural network is that while each node is quite simple, the result on a network as it gets bigger is a highly general function over many variables. Any mathematical function of any number of input variables can be defined on a sufficiently large neural network.
The power and generality of the neural network is also its weakness. They were computationally expensive and difficult to train. For many decades, progress in this area was slow because neural networks were limited to tiny use cases. Gradient descent methods addressed the training difficulties.
As computing and data scaled, neural networks were applied to ever greater challenges and began to solve problems better than other approaches. As problems scaled, so did the neural networks used to solve them, creating larger and deeper networks. Machine learning became deep learning as it was observed that scaling things larger, including depth of neural network models, could solve harder problems.
The power of deep learning became abundantly clear with the 2012 breakthrough in image recognition made by AlexNet. Developed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, AlexNet was a deep convolutional neural network (CNN) trained by GPUs, that drastically improved on the ImageNet image recognition challenge. Deep learning was able to solve a seemingly intractable problem. It demonstrated that deep learning was the best-in-class approach to solving image classification, and that GPUs could further scale up and speed up training of models.
This 2012 result kicked off a frenzy of further development in deep learning, applying it across a range of difficult problems. Because deep learning is so powerful, it can with enough scale and training data, solve the most difficult problems. The most difficult nut to crack was understanding language.
Language Models Scale Up
There were many deep learning architectural innovations along the way to make Large Language Models possible, but in a single word, it’s scaling that got us to large, powerful models - scaling on data inputs, scaling compute used in training, and scaling the size of the models themselves.
The LLaMA paper from February 2023 5 recaps the progress that’s been made in the past 7 years in scaling language models:
In the context of neural language models, Jozefowicz et al. (2016) obtained state-of-the-art results on the Billion Word benchmark by scaling LSTMs to 1 billion parameters. Later, scaling transformers lead to improvement on many NLP tasks. Notable models include BERT (Devlin et al., 2018), GPT-2 (Radford et al., 2019), MegatronLM (Shoeybi et al., 2019), and T5 (Raffel et al., 2020). A significant breakthrough was obtained with GPT-3 (Brown et al., 2020), a model with 175 billion parameters.
This lead to a series of Large Language Models, such as Jurassic-1 (Lieber et al., 2021), Megatron-Turing NLG (Smith et al., 2022), Gopher (Rae et al., 2021), Chinchilla (Hoffmann et al., 2022), PaLM (Chowdhery et al., 2022), OPT (Zhang et al., 2022), and GLM (Zeng et al., 2022).
Hestness et al. (2017) and Rosenfeld et al. (2019) studied the impact of scaling on the performance of deep learning models, showing the existence of power laws between the model and dataset sizes and the performance of the system. Kaplan et al. (2020) derived power laws specifically for transformer based language models, which were later refined by Hoffmann et al. (2022), by adapting the learning rate schedule when scaling datasets. Finally, Wei et al. (2022) studied the effect of scaling on the abilities of large language models.
We wrote up on the scaling laws of AI here. As you scale Large Language Models, they improve at a predictable rate, an inverse-power with model size, dataset size, and the amount of compute used for training (Kaplan et al 2020).6
If you were to sum the path in AI through machine learning, deep learning and Foundational models in one word, it would be scaling. Scale up compute, data and parameters and you can learn anything.
Scaling Yields Emergent Capabilities
When does a scaling up change the kind of things that can be done change? When does ‘change in scale’ become ‘change in kind’. This is happening in AI right now. Whole capability classes that did not exist before are coming about with the latest generative AI models.
Where Foundational Models Go From Here
Foundation models are here to stay. They are culmination of work in AI going back over 60 years. They have capabilities that simply cannot be created any other way, and as we scale even more in data and compute, these Foundation Models will get better.
Foundation Models will go Multi-Modal. The first Foundational Models were LLMs with only text input and output. However, large multi-modal models such as GATO, PaLM have been since released. PaLMe (embodied version of PaLM), showed that models built on combining text, images, and embodied data (robotic sensor data) as input could solve the most difficult and general tasks in AI and robotics.
LLMs gain superpowers from going multi-modal. GPT-4 accepts both text and image input, which enables the ‘napkin sketch to coded website’ capability that a text-only model like GPT-3 could not grasp. There’s no turning back: Best-in-class Foundation Models will be multi-modal models because they will be able to do things no single-domain-input model could do.
As stated in “Multimodal Language Models: The Future of Artificial Intelligence (AI)”:
Multimodal LLMs combine other data types, such as images, videos, audio, and other sensory inputs, along with the text. The integration of multimodality into LLMs addresses some of the limitations of current text-only models and opens up possibilities for new applications that were previously impossible.
AI’s next frontier is leveraging Foundation Models to build powerful AI solutions. This mean building and investing in these Large Language Models, to create new applications and to solve big problems.
https://fsi.stanford.edu/publication/opportunities-and-risks-foundation-models
https://crfm.stanford.edu/
Support Vector Machines were developed at Bell Laboratories by Vladimir Vapnik with colleagues (Boser et al., 1992, Guyon et al., 1993, Cortes and Vapnik, 1995).
Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory (pp. 144-152). ACM.
The perceptron was invented in 1957 by Frank Rosenblatt, who was an American psychologist. It is a simple type of artificial neural network used for binary classification.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and Efficient Foundation Language Models. ArXiv:2302.13971 [cs.CL].
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361.