AI Week In Review 24.08.17

X AI Grok-2 & Grok-2 mini, Gemini Live, Pixel 9, GPT-4o update, Claude API context caching, Imagen 3, Runway Gen 3 Turbo, Flux LoRA, Herme 3, Sarvam 2B, Arcee Swarm, FalconMamba 7B, Minitron 4B.

Aug 17, 2024

Figure 1. Grok-2 includes Flux.1 and makes for photo-realistic images that are unreal (literally).

TL;DR – Another big week with many new AI releases. Read on …

AI Tech and Product Releases

Grok-2 and Grok-2 Mini: xAI has thrown their hat into the AI ring with not one, but two Groks! Imagine Grok-2 as the wise old sage, and Grok-2 Mini as its younger, slightly less wise but still incredibly helpful sibling. They're out there, probably arguing about the best way to make a Pan-Galactic Gargle Blaster. - Grok-2 reporting the news

Elon Musk’s X AI launched Grok-2, a frontier-level Grok, and Grok-2 mini in beta. Grok-2 is available as a Chatbot for Premium and Premium+ users on the X platform (try it here), and the interface has been combined with FLUX.1 for (relatively) uncensored image generation. With access to X posts and current data, Grok-2 can report on current events, including about itself (see quote above).

Figure 2. ELO ratings on Grok-2 and other AI models.

Grok-2 was pre-released on Lmsys as “sus-column-r” model where it obtained steller ELO rating of 1280. Grok-2 and Grok-2 Mini benchmark scores show impressive performance: Grok-2 achieved 92.1% on MMLU, 90.5% on HumanEval, and 82.4% on GSM8K, on par with SOTA AI models Claude 3.5 Sonnet and GPT-4o. Unlike some prior Grok releases, Grok-2 is not an open AI model.

Figure 3. Grok-2 benchmark results vs GPT-4o, Claude 3.5 sonnet, Gemini Pro 1.5 and Llama 3 405B.

X user Lukas Ersil says, “Not only does it give accurate answers to clearly defined questions, but it also creates incredible images thanks to #Flux.” Many has shared incredible photo-realistic FLUX.1 photos, including parody photos of Elon as Spiderman, Elon with Trump on a unicorn, and various political and figures. The super-realistic fakes are setting off ‘mis-information’ alarm bells.

At the Made by Google event, Google launched Gemini Live, Pixel 9, Pixel Buds Pro 2 with Gemini, and Google Assistant upgrades, infusing Gemini AI in their products:

Available to Advanced subscribers, Gemini Live makes your mobile device a voice-enable AI assistant. It’s Google’s answer to ChatGPT Voice Mode, with 10 voices in English on Android now; it will expand to iOS and other languages soon.
Gemini is getting embedded with more Google apps: “We’re launching new extensions in the coming weeks, including Keep, Tasks, Utilities and expanded features on YouTube Music”
Google’s Pixel Buds Pro 2 has a custom Tensor A1 chip to power Gemini functionality, so you can have the Gemini Live voice assistant directly in your ear for a "hands-free, eyes-free virtual AI assistant."
Google’s Pixel 9 smartphone comes loaded with many AI features, with AI photo enhancement, Call Notes summaries of phone calls, a built-in Tensor G4 for running AI on the phone, and more.

OpenAI released a ChatGPT GPT-4o update: It was announced on X by OpenAI. This new GPT-4o is tuned for ChatGPT and is now topping the LMSYS leaderboard. However, it is “not a new frontier-class model,” so not the long-awaited Strawberry release.

Anthropic added Prompt Caching to Claude API: Anthropic rolled out ability for users to cache frequently used prompts and context in its API, cutting input costs by up to 90% and reducing latency by up to 80%. This is beneficial for coding assistants, large document processing, and agentic tool use. This is similar to features in Gemini and DeepSeek AI models.

OpenRouter also will integrate prompt caching into its API, improving performance and cost efficiency for repetitive tasks.

Google released Imagen-3 for public use: Touted in their May I/O event, Google’s Imagen 3 AI text-to-image generator has now been opened to US users. It looks quite good, improving on prior versions. However, it’s not strong on text rendering and also has guardrails against images of public figures and copyrighted characters. You can try it here.

Runway Gen-3 Turbo, a 7x speedier version of Gen-3 video generation, is now available. The speedier Gen 3 Turbo is getting rave reviews on X: “Now, it only takes a couple seconds to generate a video. This is what we've been waiting for.”

Nous Research released Hermes 3: Hermes 3 is a suite of open Llama 3.1 fine-tunes, in 405B, 70B, and 8B sizes. Hermes 3 focus is designed to align to users with “less censorship and more steerability.” They shared technical features in the Hermes 3 Technical Report.

India-based Sarvam AI released Sarvam-2B, a 2B parameter open weights LLM focused on 10 Indic languages, including Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.

Arcee has launched Arcee Swarm, a groundbreaking mixture of agents (MoA) architecture inspired by cooperative intelligence in nature:

“Rather than relying on one LLM to handle all tasks, Arcee Swarm routes your query to a collection of smaller expert models.”

Abu Dubai’s TII released FalconMamba 7B, a model based on the Mamba State-space model (SSM) architecture. Trained using 5.5 trillion tokens of data from RefinedWeb, it beats leading 8B models like Llama 3.1 8B on selected benchmarks, such as GSM8K. FalconMamba 7B is available on Hugging Face.

AnswerAI introduced colbert-small-v1 a smaller, faster, better embedding model. With just 33 million parameters, it’s able to search through thousands of documents in milliseconds, for quickly retrieving documents in latency-sensitive applications.

Top Tools & Hacks

Figure 4. Flux-1 can be super-photo-realistic. Image generated by Flux-1 and shared on StableDiffusion sub-Reddit.

FLUX.1 is getting a lot of well-deserved hype for generating amazing images super-photorealism, and less censored than other image generators. A lot of hype is thanks to Grok-2 access to Flux; to use Grok-2 you need to pay $8/mo for X premium. Freepik now offers access to Flux.

Flux now has support for LORAs, available on Fal.ai. This X thread by Dr Cintas shows how you can train a LoRA for FLUX with your own likeness. So you can use photos of yourself to train a Flux LoRA to make you into Spiderman or some other imaginary scene (see this example on Replicate).

AI Research News

This week’s AI Research roundup covered reasoning, RAG, and AI agents:

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
OpenResearcher: Unleashing AI for Accelerated Scientific Research
Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation
Improving Retrieval Augmented Language Model with Self-Reasoning

In other AI research news, NVidia released Minitron 4B on HuggingFace and explained How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model:

NVIDIA researchers showed that structured weight pruning combined with knowledge distillation forms an effective and efficient strategy for obtaining progressively smaller language models from an initial larger sibling.

Cosine-’s just-released Genie claims to be the best SWE Agent, with a score of 30% eval score on SWE-Bench.

AI researchers at Salesforce presented on an even better SWE agent in the paper “Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents.” They showed that combining multiple SWE agents creates a SOTA SWE agent system. Their DEI combined agent scores 55% on SWE-Bench Lite.

Figure 5. The DEI Agent system: (a) Different SWE agents (Aider, Moatless, Agentless, OpenDevin) resolve different issues. ( c) Combining and selecting best candidate solutions will (b) improve the resolve rate significantly.

AI Business and Policy

Apple Pushes Ahead With Tabletop Robot in Search of New Revenue. The robot combines an iPad with a robotic limb. Bloomberg reports “Executive who oversaw car project is leading the effort.”

OpenAI says it disrupted Iranian influence operation using ChatGPT. OpenAI took down accounts of an Iranian group for using its ChatGPT chatbot to generate content meant for influencing the U.S. presidential election and other issues.

Goodfire raises $7M for its AI observability platform. The mission of the AI startup Goodfire is to “advance humanity's understanding of AI by examining the inner workings of advanced AI models (AI Interpretability).”

Boston-based startup Consensus raised $11 million to build an AI-powered search engine for academic topics.

ElevenLabs opens European HQ in London as they push to expand globally.

EliseAI raised $75 million in a Series D for AI that helps property managers interact with renters. The AI chatbots help respond to calls from renters about things such as apartment tours, maintenance requests, lease renewals and delinquencies.

AI-powered ‘undressing’ websites are getting sued. The San Francisco City Attorney’s office is suing 16 AI-powered “undressing” websites, to take down these websites commonly used to create nude deepfakes of women and girls.

The CA bill to regulate AI has been modified as it moves forward. Opposition has grown to the bill, as California Lawmakers Face Backlash Over Doomsday-Driven AI Bill.

AI Opinions and Articles

Francois Chollet has proposed a definition of intelligence: “Intelligence is the efficiency of operationalizing past information to deal with the future, expressible as a conversion ratio using algorithmic information theory.” He shared an equation to define it.

This set off some debate on views of intelligence and whether he is considering creativity. His perspective reminds us that “AI training” is an optimization process; it can only get to zero error. If he’s right, then AI is somewhat bounded.

A consequence of intelligence being a conversion ratio is that it is bounded. You cannot do better than optimality -- perfect conversion of the information you have available into the ability to deal with future situations. You often hear people talk about how future AI will be omnipotent since it will have "an IQ of 10,000" or something like that -- or about how machine intelligence could increase exponentially. This makes no sense. If you are very intelligent, then your bottleneck quickly becomes the speed at which you can collect new information, rather than your intelligence. In fact, most scientific fields today are not bounded by the limits of human intelligence but by experimentation. - Francois Chollet

AI Changes Everything

Discussion about this post