AI Week In Review 24.07.20

GPT-4o mini, Mistral NeMo, Apple DCLM 7B, AuraFlow 0.1, Claude app for Android, Pinokio 2.0, Karpathy starts Eureka Labs, Fei-Fei Li starts World Labs, Meta Llama 3 405B coming

Jul 20, 2024

Figure 1. AI art from newly-released open AI image generator AuraFlow. The prompt: “a cat that is half orange tabby and half black, split down the middle. Holding a martini glass with a ball of yarn in it. He has a monocle on his left eye, and a blue top hat, art nouveau style.”

TL;DR: AI models released this week - GPT-4o mini, Mistral NeMo, Apple’s DCLM - show that AI innovation isn’t just at the largest frontier models, but making better and more efficient smaller AI models. Mistral NeMo and DCLM are also open AI models, showing that open AI models are staying competitive.

AI Tech and Product Releases

OpenAI released GPT-4o mini, a cost-efficient multi-modal LLM that scores 82% on MMLU, near GPT-4o levels. It supports text and image input and output, with video and audio input and output planned, and it has a context window of 128K tokens and supports up to 16K output tokens per request.

GPT-4o mini sets a new standard on cost and performance: It’s beating out Gemini 1.5 flash and Claude 3 Haiku on benchmarks, and yet an API call costs only $0.15 per million input tokens and $0.6 per million output tokens. This is an incredible 10-fold reduction from the price of GPT-3.5 in March 2023, and 99% cheaper than the original davinci API call price point 2 years ago.

Figure 2. GPT-4o mini sets a new standard for high-quality AI inference at a low price.

GPT-4o mini has notably good performance on HumanEval and MATH. With its large 128k context window, low cost, and upcoming full multi-modality, it’s a compelling option for long context use-cases such as large-document RAG, code assistance on large code bases, analyzing videos, etc.

Figure 3. GPT-4o mini evaluation comparisons.

Mistral and Nvidia announced Mistral NeMo, a 12B model built by Mistral in collaboration with Nvidia, with 128k context length. This AI model shows good performance, with an MMLU of 68%, and is an open AI model, released under the Apache 2.0 license.

One of the strengths of Mistral NeMo is its multilingual capability. This was helped in part by using the Tekken tokenizer, that was trained on more than 100 languages and makes it easier to handle many languages and text types.

Figure 4. Mistral Nemo vs Llama3 and Gemma 2.

Not to be left out, Apple shows off their open AI prowess by releasing a new open model set, called DCLM, with 7B and 1.5B models. DCLM 7B performance is competitive with prior 7B models such as Llama 3, with an MMLU score of 63.7%. Vaishaal Shankar, an ML engineer at Apple, says:

To our knowledge these are by far the best performing truly open-source models (open data, open weight models, open training code) … we also release our entire training set and pre-training recipe for the community to build on top of.

DCLM stands for DataComp for Language Models and comes from the paper “DataComp-LM: In search of the next generation of training sets for language models” which we covered in the AI Research roundup on June 21. This work studied methods to refine input data to make pretraining more efficient, then proved their results by efficiently training the DCLM LLMs.

So Apple’s DCLM breaks ground in establishing a more efficient model training approach, and the fully open-source AI model gives the data and techniques to others to replicate. The DCLM 7B model weights are on HuggingFace, and the data and code is on Github.

The folks at Fal are releasing AuraFlow v0.1, an Open Exploration of Large Rectified Flow Models, sharing “the largest yet completely open sourced flow-based generation model that is capable of text-to-image generation.” You can try it out on the Fal playground here.

Anthropic released a Claude App for Android, and its available for use by both free users and Pro users. You can share dialogs across web and phone, take advantage of its visual AI capabilities, and use it for real-time multilingual processing. On the other hand, it doesn’t yet have the voice input-output that ChatGPT app has that makes its easy to use.

Coming soon: Llama 3 405B release is near, as websites are getting prepped. The anticipated release date is next Tuesday.

Top Tools & Hacks

Pinokio 2.0 was released this week by its creator CocktailPeanut. Pinokio is an open-source local application that lets you install, run, and control local AI applications and automations. You can access AI agents like Devika, vision models like Florence2, and AI image gen models like flashdiffusion and ComfyUI via Pinokio.

Pinokio 2.0 offers “A complete rethinking of how offline web apps & AI apps should work.” One new feature is zero-click local app launching from any browser:

“In your favorite web browser, simply start typing the app's name. Your browser will autocomplete the localhost URL. Surf offline apps just like online websites.”

Any AI hacker wanting to look into If you want to try out Pinokio 2, here are install instructions from Olivio Sarikas:

For Ollama users: “So now tool/function calling works in Ollama.” Have not tried it yet, but will give it a go.

AI Research News

Our AI Research Roundup article for this week covered OpenAI’s Project Strawberry, Qwen2 technical report, and had a focus on scaling experts, data and retrieval, with these three papers:

Mixture of a Million Experts
NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?
Scaling Retrieval-Based Language Models with a Trillion-Token Datastore

The New York Times reports that The Data That Powers A.I. Is Disappearing Fast. Ironically, this article refers to the loss of “Digital Commons” but is itself behind a paywall. It references a white paper from the Data Provenance Initiative called “Consent in Crisis: The Rapid Decline of the AI Data Commons.”

The paper notes that websites are putting up barriers to crawlers and data use with rising Robots.txt crawling restrictions and other limits rising since 2022. AI restrictions are driven primarily by news, forums, and social media websites. They know data is valuable in the AI era and are locking it down. This may dry up public data needed to improve AI.

AI Business and Policy

We are Eureka Labs and we are building a new kind of school that is AI native. Andrej Karpathy

On X, Andrej Karpathy announced the launch of Eureka Labs, an AI-first education company. His vision for Eureka Labs is AI enabling scaling of the best teaching methods and curriculum, where a “teacher plus AI symbiosis could run an entire curriculum of courses on a common platform.”

Our first product will be the world's obviously best AI course, LLM101n. This is an undergraduate-level class that guides the student through training their own AI, very similar to a smaller version of the AI Teaching Assistant itself.

Exa, an AI research lab redesigning search for the AI age, announced $22M in funding, led by Lightspeed Venture Partners. This investment will accelerate Exa’s mission to build the search engine for AI.

“AI needs a search engine that’s powerful and precise enough to retrieve thousands of results with the best information. That’s where Exa comes in – the first search engine built for AI.” Exa CEO Will Bryk

AI Grant, an accelerator for seed-stage AI startups, has an August 9th deadline for Batch 4 of their grants. AI startup recipients get a $250k seed VC investment plus credits from cloud and AI providers.

AI pioneer Fei-Fei Li raises $100M for World Labs, a new spatial intelligence startup. According to the Financial Times, World Labs raised the capital over two funding rounds in the past 4 months, valuing the company at more than $1 billion. Investors include Andreessen Horowitz and Radical Ventures, a Toronto-based venture fund where Li is a scientific partner.

Former FDA chief Scott Gottlieb reveals ChatGPT scores 98% on medical exams, outperforming doctors. They tested 5 top AI models (ChatGPT, Claude, Google Gemini, Grok and Llama) on US Medical License Exam questions. All models passed, with ChatGPT scoring the highest at 98%, Anthropic Claude second at 90%:

It [ChatGPT-4o] provided detailed medical analyses, employing language reminiscent of a medical professional. It not only delivered answers with extensive reasoning, but also contextualized its decision-making process, explaining why alternative answers were less suitable.

Meta suspends the use of generative AI tools in Brazil due to government objections over Brazil’s new privacy policy, regarding use of data for generative AI. Brazil’s National Data Protection Authority (ANPD) suspended Meta’s privacy policy earlier this month. In response, Meta said it has decided to suspend the tools while it is in talks with ANPD to address the authority's doubts over generative artificial intelligence.

In related reaction-to-regulation news, Meta won't offer future multimodal AI models in EU, due to “lack of clarity” from regulators:

State of play: "We will release a multimodal Llama model over the coming months, but not in the EU due to the unpredictable nature of the European regulatory environment," Meta said in a statement to Axios.

It’s not just Meta AI models. Andrew Curran on X notes that EU won’t be getting Apple Intelligence as well, and “the next iteration of other frontier models will not release with all their capabilities intact.”

AI Opinions and Articles

Logan Kilpatrick reminds us of the vastness of 2 million token context:

People still fail to realize how wild long context is. In practice, 2 million tokens looks like:
100,000 lines of code
All the text messages you have sent in the last 10 years
16 average length English novels
Transcripts of over 400 podcasts

AI use cases and AI application haven’t fully caught up to these capabilities in most use-cases. On the other hand, long context has limitations. As our AI research highlighted this week (NeedleBench) showed, AI models don’t fully comprehend multiple information sources in a long context in a way sufficient to do complex reasoning over them.

AI Changes Everything

Discussion about this post