GPT-4 Details Revealed
GPT-4: 1.8T parameter mixture-of-experts model trained on 13T tokens and optimized for inference
Breaking news: In the past 24 hours, details were leaked about GPT-4. The information was analysis by Dylan Patel posted here on SemiAnalysis, but put juicy details behind a paywall. Yam Peleg shared those details on Twitter, but then took down his tweet thread “due to a copyright claim.”
However, his information is still available here, and we will summarize what we know about GPT-4 and what it means. Details are below, but at the top-line, we know:
GPT-4 is a mixture-of-experts model, with 16 experts of 111B parameters each.
It took about 2 x 10^25 FLOPS to train, with 13 trillion token (passes).
Estimated pre-training hardware utilization cost of $63 million, using 25,000 A100s almost 100 days to do the training.
The training and architecture was to optimize it for inference, and inference costs were about 3 times that of GPT-3 / DaVinci.
As leaked information, it’s not official nor confirmed, but both confirms and elaborates on prior leaks, and gives us a helpful guide to what to expect in future large foundational AI models.
GPT-4 is based on Mixture of Experts
GPT-4 has a total of 1.8 trillion parameters across 120 layers, about 10 times size of GPT-3. They utilized a mixture of experts (MoE) model, with 16 experts of about 111 billion parameters each. This is different from the 8 x 222 B parameter leak prior, but adds up to the same 1.776 trillion parameters total.
MoE Routing: Two experts are routed to per forward pass and there are 55B shared parameters for attention. Each forward pass inference (generation of 1 token) utilizes about 280B parameters and about 560 TFLOPs. A dense 1.8 trillion parameter model would need about 3,700 TFLOP for each inference.
Why did OpenAI chose to go with MoE but keep a relatively small number of experts? Some commentary on it:
Researchers have shown that using 64 to 128 experts achieves better loss than 16 experts, but that’s purely research. There are multiple reasons to go with fewer experts. One reason for OpenAI choosing 16 experts is because more experts are difficult to generalize at many tasks. More experts can also be more difficult to achieve convergence with.
With such a large training run, OpenAI instead chose to be more conservative on the number of experts.
GPT-4 was trained on 13 trillion tokens
More precisely, the total training set was 13 trillion tokens. The 13T tokens are not unique tokens, they count the epochs as more tokens as well. They used 2 epochs for text-based data and 4 for code-based data, that is, text data was read in twice and code data 4 times. This implies that about 5-6 trillion unique tokens was in the original dataset.
This is far beyond what they used in the original GPT-3 (800B tokens), and more than what Google’s PaLM2 used (3.6T tokens). CommonCrawl and RefinedWeb have in the range of 5T, and there is enough out there on Twitter, Reddit and YouTube to add to get to the 5T token count.
One question is what higher quality sources were used in the dataset. Ilya Sutskever has mentioned in interviews the importance of high quality text input and that there’s enough data out there of that type, so they certainly did something. Peleg shared speculations on what GPT-4 added to the dataset mix:
Some speculations are: LibGen (4M+ books); Sci-Hub (80M+ papers); All of GitHub
My own opinion: The missing dataset it a custom dataset of college textbooks collected by hand for as much courses as possible.
It’s possible it was a mix of some or all of the above.
GPT-4 Training Details
OpenAI’s pre-training for GPT-4 required about 2.15 x 10^25 FLOPS. This meant running on 25,000 A100s for 90 to 100 days, with a total pre-training hardware cost of $63 million. They also estimated the cost of pre-training compute on H100s to be around $22 million.
The analysis also mentioned low utilization of the GPUs “due to an absurd number of failures requiring checkpoints that needed to be restarted from.”
Other details on training:
They trained on an 8k token sequence length, got 32k context window in fine-tuning.
They used 8-way tensor parallelism and 15-way pipeline parallelism in the pre-training.
Significant fine-tuning and RLHF done, with millions of instruction fine-tuning interaction data rows from ScaleAI and internally.
Vision in multi-modal GPT-4: The vision part of the model is similar to Flamingo, a “separate vision encoder from the text encoder, with cross-attention.” This adds more parameters and was trained after the text only pre-training on another two trillion tokens. Data reportedly used included various web texts, videos with sampled frames, web-page screenshots.
GPT-4 Inference Cost and Speed
To run inference on GPT-4 costs 3 times that of the 175B parameter GPT-3 / DaVinci. This is largely due to the larger clusters required for GPT-4 and much lower utilization achieved. GPT-4 inference runs on a cluster of 128 GPUs, using 8-way tensor parallelism and 16-way pipeline parallelism.
These large AI models face a challenge in inference cost and responsiveness in the inference. Every inference requires memory loading of all activated parameters as well as the generated token to be fed back in. As the number of parameters increases the bandwidth requirements soar. More systems are needed and communications overhead slows the response and makes it more expensive.
Inference hits a wall. Dylan Patel puts that limit around 300B parameters:
Effectively there is an inference constraint around 300 billion feed-forward parameters for an 8-way tensor parallel H100 system today.
You can observe that often smaller models respond faster, e.g., ChatGPT / GPT-3.5 is faster than GPT-4. Size slows the model down.
What’s the way out of the inference dilemma? Train the AI model to optimize for inference-compute, and use mixture-of-experts to create a sparse model. Dylan Patel makes a point we’ve made before:
The much more important issue with scaling AI, the real AI brick wall, is inference. The goal is to decouple training compute from inference compute. This is why it makes sense to train well past Chinchilla optimal for any model that will be deployed. This is why you do sparse model architecture; every parameter is not activated during inference.
Coincidentally, the sparse mixture-of-experts GPT-4 with two 111B parameter experts activated and 50B common parameters comes in just under that 300B parameter cap. GPT-4 outputs at human reading speed, and it would be overly expensive if not impossible to do the same thing with a large dense model.
The A100 inference cost is about $0.004 cents per 1k tokens of inference, and it was noted that the H100 can cut inference costs in half.
The GPT-4 Achievement
Some bottom-line conclusions from these GPT-4 reveals:
Good engineering not secret sauce: In an interview recently, Sam Altman said that GPT-4 didn’t represent any one breakthrough but a lot of engineering improvements. It sounded modest, but it seems accurate based on these revelations. OpenAI scaled up existing methods and used fairly conservative engineering decisions to create this model.
Secrets don’t last: When OpenAI failed to reveal the dataset used in training and the architecture of the GPT-4, etc. I was disappointed. It felt like the era of open source AI research was coming to a close. While not the same as real open-source sharing of AI research, these high-level technical details leaked on PaLM2 and GPT-4 provide guidance for others, including any GPT-4-level open source model efforts. I would urge technology leaders to be more open, so the facts don’t have to be leaked.
Training costs are (relatively) trivial: Given the billions and even trillions of dollars in economic value that AI may bring, a presumed $22 million or so pre-training compute cost for a GPT-4 equivalent model on H100s is trivial. There is millions more in additional costs of data acquisition, cleaning, fine-tuning, RLHF, etc. But there are also many AI model training optimizations that will cut the cost of training. For large tech companies, a $100 - $200 million project to build a flagship AI model is easy to justify.
GPT-4 level competitors
OpenAI’s GPT-4 was and is a marvel, and should be lauded as the first of its kind, but it’s only a start. Others will follow.
In other breaking news this week, Anthropic’s Claude2 was just announced, and it performs at or above GPT-4 with a larger context window. Google is working on Gemini that aims to do better than GPT-4, which some are calling next-level AI.
The engineering approach used by OpenAI to make GPT-4 give a roadmap for many organizations, including open source efforts. The speed of progress in these large AI models will accelerate, and other GPT-4-level capable models, including open source versions, will join GPT-4 soon.