Earlier in October, “The State of AI 2023” report was released. Written by Nathan Benaich and colleagues at Air Street Capital, it’s an in-depth report (of 163 presentation slides) covering a lot of ground in the world of AI this year.
They broke out their findings into categories: Research, industry, politics, safety, and predictions. The leading, biggest, and more important section was the research section, which covered new AI models, training techniques, and research and developments.
We’ve mentioned many of these specific releases and results in prior articles and weeklies, but it’s good to summarize the reports findings, as a way of reviewing the big picture in AI research and development progress this year.
LLM releases and features
The report mentioned the top-line AI model developments:
GPT-4 crushes every other LLM, and many humans. GPT-4 is multi-modal and the best model across the board. It was reportedly trained on 13 trillion tokens and is a Mixture of Experts model.
GPT-4 also validates the power of reinforcement learning from human feedback (RLHF). Other results on limits of distilled LLMs show “the false promise of imitating proprietary LLMs, or how RLHF is still king,”
Even so, researchers rush to find scalable alternatives to RLHF: Anthropic proposed RL from AI feedback, LLMs can self-improve (Google research).
The GPT-4 technical report puts the nail in the coffin of SOTA LLM research, showing closing off of formerly open corporate AI research. Unless Llamas can reverse the trend, as Meta provides more open AI models.
Meta’s LLaMa sets off a race of open(ish) competitive Large Language Models. LLaMa-2, the most generally capable and publicly accessible LLM currently, is competitive with Chat-GPT and trained on 2 trillion tokens.
Context length is the new parameter count. Anthropic Claude’s 100k context window shows value, and techniques like ALiBi show promise in making LLMs better through longer context length.
Lost in the middle: When input length is long, relevant information in the middle of long contexts can get ‘lost’ so that even the best LLMs fail on some knowledge retrieval tasks.
Increased context length and large datasets require architectural innovations: FlashAttention, FlashAttention-2; Speculative decoding during inference; SWARM Parallelism.
Challenges mount in evaluating state of the art models, as standard LLMs often struggle with robustness.
Comment: While RLHF is still king, it requires human effort; there is a strong desire for scalable alternative, such as RLAIF. AI-driven policy design (like the recent Eureka) may be the next big thing. Regarding limits of distilled models, they failed to mention Orca paper that found a way around these limits. While Llama 2 set the pace for open AI models, Mistral 7B set a new standard in capabilities for its size for an open source LLM.
Data for AI Models
It’s unclear how long human-generated data can sustain AI scaling trends. One estimate is that high-quality text data will be exhausted by LLMs by 2026. Videos and data locked up in enterprises are likely up next.
One approach to break the data ceiling is AI-generated content. Some studies show benefits, others show how models can break down from it. Unclear what the effects of adding synthetic data are.
Models like Phi-1 coding model show that small LLMs with high-quality datasets produce very high-quality AI models. Also, applying more compute, e.g., four epochs of the same data is about as good as 4 times the data.
Comment: Going forward, data not compute will become the constraint on scaling AI models. A data-optimal or data-constrained approach would be to run the same data through training as many times as it is effective, which a recent paper shows might be around four times (4 epochs).
AI Models for Coding
The leading LLM for coding is GPT-4, with Code Interpreter or now Advanced Data Analysis “leaving users in awe”. Open alternatives like WizardLM’s WizardCoder-34B and Unnatural CodeLLaMa hold up with ChatGPT in coding benchmarks.
Multi-Modal AI Models for Robotics
They mentioned multiple pioneering multi-modal AI models for robotics:
RT-2: Vision-language models can be fine-tuned all the way to low-level policies showing impressive performance in manipulating objects.
PaLM-E is Google’s 562-billion parameter, general-purpose, embodied generalist model that extended the PaLM model for robotics. It is trained on vision, language and robot data.
LINGO-1 is Wayve’s vision-language-action model that provides driving commentary, such as information about driving behavior or the driving scene.
RoboCat (built on DeepMind’s Gato) is a foundation agent for robotic manipulation that can generalise to new tasks and new robots in zero-shot or few-shot (100-1000 examples).
Swift is an autonomous system that can race a quadrotor flying drone at the level of human world champions using only onboard sensors and computation. It won several drone flying races against 3 champions, a first time win for a robot.
Image and Video Generation
Text-to-video generation race continues: VideoLDM is a latent diffusion model capable of high-resolution video generation (up to 1280 x 2048!). MAGVIT is a masked generative video transformer.
Instruction-based image editing assistants such as InstructPix2Pix were developed.
A new NeRF contender, 3D Gaussian splatting, shows impressive quality with real-time rendering.
NeRF-based generative models have improved in speed and quality (see HyperDiffusion, MobileNeRF, Neurolangelo and DynIBAR) and can model 3D geometry. They are promising for large scale creation of 3D assets. Instruct-Nerf2Nerf edits NeRF scenes based on instructions.
Meta’s “Segment Anything” model (SAM) achieves zero-shot capabilities via prompting on many image query tasks and outperforms existing SoTA on 70%+ of cases measured.
DINOv2 is a self-supervised Vision Transformer model from Meta, producing universal visual features that can be used across a variety of image level (e.g. classification) and pixel level (e.g. segmentation) tasks.
In Audio, new models from Google, Meta, and the open source community significantly advance the quality of controllable music generation.
Bio-Medical AI Models and Research
The report mentions bio-medical uses of LLMs and diffusion models that drive real-world breakthroughs in molecular biology, medicine and drug discovery:
Diffusion models design diverse functional proteins from simple molecular specifications. Inspired by their success in generative modelling of images
and language, diffusion models are now applied to de novo protein engineering.
Atomic-level protein structure can now be directly predicted from amino acid sequences without relying on costly and slow multiple sequence alignment (MSA). Evolutionary Scale Modeling–2 (ESM-2), is used to characterize the structure of
>617M metagenomic proteins. It offers significant speedups compared to AlphaFold-2 (AF2).
Predicting the outcome of perturbing multiple genes without a cell-based experiment, Graph-enhanced gene activation and repression simulator (GEARS) combines prior experimental knowledge with deep learning to predict
the gene expression outcome given unperturbed gene expression and the applied perturbation.
Google’s Med-PaLM 2 language model is a medical expert according to the USMLE by passing medical exams, and it is preferred to real physician answers by evaluation panels.
Med-PaLM goes multimodal. The system exhibits novel emergent capabilities such as generalization to novel medical concepts and tasks with image input. An alternative lighter-weight approach, ELIXR, grafts language-aligned vision encoders onto a fixed LLM to answer
A SOTA pathology language-image pretrained model from medical Twitter: This work used text-image pairs on Twitter to build a dataset (OpenPath) and model (PLIP) for diagnostics from images.
Complementarity-Driven Deferral to Clinical Workflow (CoDoC) learns to decide whether to rely on a predictive AI model’s output or defer to a clinical workflow instead.
AI for science: medicine is growing fastest but mathematics captures the most attention.
Conclusion
What’s the high-level takeaway? GPT-4 is the best AI model released in 2023 (thus far), and progress in AI research and innovation is greater and faster than ever across all areas. Researchers are developing new and better LLMs and multi-modal AI models at a very fast pace, with rapid improvements in the entire AI model development pipeline. With AI growing exponentially in its capabilities and impact, 2023 is the Year of AI.
With so much going on in AI, any attempt at summarization, including this article itself, risks glossing over important details. For that reason, it is hard to critique any report that is already so lengthy for slighting some aspect of AI progress or missing some innovations this year. They did a good job drinking from the fire hydrant of AI progress.
While hitting the highlights, they underreported progress in fine-tuning and distilling AI models, and glossed over some work in AI image and video generation, and weren’t complete. Yet for the most part, they covered the most important items and captured where AI progress and challenges are.
Perhaps the most important AI models are those devoted to bio-medical applications, so it was appropriate for them to highlight progress there. AI will massively reshape and hopefully improve medical care.
Their final slide on research is to note that over 70% of the most cited AI papers in the last 3 years have authors from US-based institutions and organizations. US-based AI labs at Google, Meta, Microsoft and leading Universities lead the way as the engines of AI technology development. China is a distant second, UK third, and every other nation far behind. Team USA will not stop.
We are in early innings in the AI revolution and the pace of AI progress will not slow down. If 2023 is the year of AI, then the 2020s is the decade of AI. The report mentions also 12 month predictions. We will deal with the future of AI another day; keeping up with the present of AI is enough for now.