Today, Stability AI released a new open-source language model, StableLM.
The Alpha version of the model is available in 3 billion and 7 billion parameters, with 15 billion to 65 billion parameter models to follow. Developers can freely inspect, use, and adapt our StableLM base models for commercial or research purposes, subject to the terms of the CC BY-SA-4.0 license.
The release of StableLM builds on our experience in open-sourcing earlier language models with EleutherAI, a nonprofit research hub. These language models include GPT-J, GPT-NeoX, and the Pythia suite, which were trained on The Pile open-source dataset. Many recent open-source language models continue to build on these efforts, including Cerebras-GPT and Dolly-2
A few impressive and interesting things about this:
Trained on a new version of The Pile, with 1.5 trillion tokens of content, three times the data in The Pile original datase. This is more data than used to train GPT-3.
They have released alpha checkpoint 3B and 7B models (still training), but have a roadmap for 15B, 30B, 65B, and 175B. No indication of release schedule.
They are available to run on HuggingFace spaces. Hugging Face has a deal with Amazon AWS, so these models (like many others) will be available to run in the cloud on AWS Sagemaker.
Using the AWS/HuggingFace ecosystem, you will be able to customize models using these open source LLMs as base models and do further fine-tuning on them, etc.
They are releasing instruction fine-tuned models “for research only” that use “a combination of five recent open-source datasets for conversational agents: Alpaca, GPT4All, Dolly, ShareGPT, and HH.” License limitations make this available for research use only. Hopefully, broader data can address that limitation.
Open Source LLMs
These LLMs are under the CC BY-SA-4.0 license, open source “Attribution-ShareAlike” license.
When OpenAI moved away from openness and transparency in their GPT-4 release, failing to disclose architectural details, training data sets or even how many parameters were in their model, it seemed the era of open AI research would be over. There’s been a tension in the AI field that has been until recently remarkably open.
Today’s release by StabilityAI shows a commitment to Open Source that suggests there will be at least a seat at the LLM table for open source models. They will play a role.
Can they compete with the best LLMs though? The resources to build the largest models are massive, and with each model generation the resources get more daunting. It’s like the level of capital needed to build a state-of-the-art semiconductor fab, which these days can run close to $20 billion! Our back-of-envelope calculations for a prospective GPT-5-level AI is in the hundreds of millions in compute - massive.
It’s hard to see how open source efforts can compete with that scale. This goes back to the conundrum OpenAI faced as a non-profit research organization, and why they evolved into a quasi-for-profit AI company.
But there’s another way. Here’s what StabilityAI’s CEO Emad Mostaque said on Twitter in late March, when he was challenged on his signing the “Pause” letter even while StabilityAI worked on building AI models:
We are not and do not want to train very large language models as our focus is on swarm not general intelligence - Emad Mostaque, StabilityAI CEO
Even though StabilityAI has a commitment to open source, they are a commercial enterprise startup with over $100 million invested and a $1 billion valuation. They have the resources to build some significant models in this space, and Emad Mostaque has the vision to be the supplier of open source foundation models to the world. So I believe they have the capabilities to drive these models to be fairly high quality.
We may see a future bifurcation between the largest commercial LLMs like GPT-4 and Claude+, and a variety of smaller, more nimble and specific LLMs. The latter will serve in a swarm of AI models that serve useful specific roles individually and have broad and power AI collectively. More on how that will work in a followup.
Benchmarks And Fine-Tuning
Some feedback on Twitter suggesting bad benchmark results:
“Is this why the new StableLM didn't post any benchmarks? I just ran one for fun and it got some pretty bad numbers compared to older models...”
Ratings on MMLU were bad, at a level with sub 1B model ratings. Another user reported poor results:
“Okay after playing with stability's model myself, the 7B model, I think... Its bad news … It's been relatively incoherent. The text flows right, but the logic in the text just doesn't make sense a lot.”
Others point out this is an alpha checkpoint, and even though 800B tokens of training has been done, it’s hard to make definitive conclusions just yet. We may need to let the AI model training complete before evaluating it further. It seems … not ready.
Twitter user Sir_deenicus notes:
Llama is strong relative to what's openly available but is a GPT3 type, ie rel weak at reasoning/code Models derived from codex approach (code-cushman, code-davinci, chatgpt, likely gpt4) have higher "fluid intelligence"/incontext learning vs every other model, PaLM included.
“Somehow there's this misconception in wider LLM community that fine-tuning is what makes a model good. No, what that does is teach model to keep on task and not wander off randomly in the middle”
The point is that you need a level of quality in the model from the token-by-token pre-training before you can fine-tune the LLM to be ‘instruction-following’. The pre-training gets you to a certain level, then instruction-following fine-tuning is more about optimizing to keep with the right the global context and stay on track. Fine-tuning cannot yield more emergent capabilities by itself.
The Alpaca result showed however that you could go pretty far in fine-tuning smaller models with interaction results from higher-quality larger models.
LLaMA’s 64B model showed close-to-GPT-4 level results. I am looking forward to a comparable fully open-source AI models of that calibre that can be freely used and customized further. It may be premature to presume much given the low quality of the alpha release models. However, having open source high-quality foundation AI models available in a broader, open AI ecosystem will change the world.
Postscript - The legal status of AI model input data
A Hugging Face community user reaction to the StableLM release made a plea: Clarity Needed for Commercial Licensing. Why are LLaMA and Alpaca restricting use for “research only”? Blame lawyers and the vagueness of Copyright Law; we are in uncharted waters as to whether the data is ‘fair use’ or not. As Yann LeCun put it:
"The world needs high performance open source LLM. The main obstacle today is the legal status of the training data." — Yann LeCun
Surely a recording artist’s voice likeness recreated by AI is clear mimickry and is a Copyright violation. You can train a diffusion model on a specific artist’s work to recreate their style; if that artist is a living artist, they’d have a claim to make.
At the other end of the scale, the public domain works of Shakespeare, Descartes and Dickens are free and fair to use. However, 99% of the data in The Pile v2 is made from internet-era data likely under some form of Copyright, with that majority of it being Reddit and other social media commentary, code, and blog posts made in recent years.
And what about the high-quality trove of every science journal article written in the last 40 years? The real high-quality data that could make real high-quality AI models are mainly not open source. If anything stymies the advance of open source AI models, this could be it.