AI Week In Review 24.08.31

Qwen2-VL, Gemini Flash 8B, Google Gems, Claude Artifacts for all, CogVideoX-5B, Magic's 100M context, Cerebras inference tops speed test, Hermes Function Calling V1, GameNGen plays DOOM.

Aug 31, 2024

Figure 1. Imagen 3 AI image: woman with dog by fountain in Copenhagen 35mm. Imagen 3 is available on Gemini and will now generate images of (unindentified, non-celebrity) people.

AI Tech and Product Releases

Alibaba Released Qwen2-VL. Alibaba Cloud's Qwen team has released Qwen2-VL, a new multimodal LLM series coming in 72B, 7B, and 2B parameter versions, that excels in image and video understanding, supports multiple languages, and can be integrated on edge devices like mobile phones.

They released Qwen2-VL-2B and Qwen2-VL-7B as open models with Apache 2.0 license, and released Qwen2-VL-72B via an API.

Qwen 2-VL beats SOTA models GPT-4o and Claude 3.5 Sonnet on image-based math problem-solving (MATH-Vision, MathVista benchmarks), Doc QA and diagram understanding, general visual QA (MMT-Bench), and other visual understanding tasks.

Figure 2. Qwen2-VL 72B is SOTA across a range of visual understanding and image reasoning benchmarks, beating GPT-4o and Claude 3.5 Sonnet, while Qwen2VL 7B competes with GPT-4o-mini on visual QA.

Google rolled out three new experimental Gemini models: A new smaller Gemini 1.5 Flash 8B, a stronger Gemini 1.5 Pro, and an improved Gemini 1.5 Flash, further enhancing its capabilities. These experimental releases are available in Google’s AI Studio and Vertex API to try out. Google’s Logan Kilpatrick says:

“we are releasing experimental models to gather feedback and get our latest updates into the hands of developers. What we learn from experimental launches informs how we release models more widely”

Google is rolling out its Gems customized AI models for Gemini. Subscribers to Gemini can now create custom AI chatbots called "Gems" and personalize by giving them distinct personalities and specialized knowledge.

Like OpenAI’s custom GPTs, Google’s Gems can serve as personalized AI chatbots for specific roles and tasks, such as financial or career advisor, cooking partner, or specialized AI tutor. Google is launching Gems to users across over 150 countries and in most languages, available on both mobile and desktop.

In the same update, Google announced their latest AI image generation model, Imagen3, is coming to Gemini apps. They also mentioned support for generating images of people, albeit with limits to avoid abuse:

we’ve made significant progress in providing a better user experience when generating images of people. We don’t support the generation of photorealistic, identifiable individuals, depictions of minors or excessively gory, violent or sexual scenes.

Anthropic has made Artifacts available to all users. Anthropic announced the general availability of its Artifacts feature across its Free, Pro and Team tiers, as well as its availability on the official Claude iOS and Android mobile apps. Artifacts is a separate window in its chatbot interface to show code, run programs, output charts or graphics, etc. Now users could get Claude with Artifacts to code and run programs from their mobile phone.

The Zhipu AI developers behind the CogVideo have released CogVideoX-5B video generation model, their biggest and best-quality open-sourced text-to-video model. CogVideoX-5B can run on a low cost 12GB GPU and can be experienced on CogVideoX-5B HuggingFace Space. Their recent paper on CogVideoX goes into more details. This is great news for making text-to-video more accessible.

Figure 3. Still from a video generated by CogVideoX-5B. Not Sora-level quality, but its an open model that can be run locally.

Magic Dev announces LTM-2, their 100 million context window model: Magic Dev announced LTM-2 (standing for Long-Term Memory), officially confirming their rumored 100m token context model, though still remaining in stealth. They also announced a partnership to build two new supercomputers on Google Cloud.

Cerebras has announced the world’s fastest AI inference. AI chip-maker Cerebras is offering inference that can run Llama 3.1 70B inference at world’s record speed of 450 tokens/second.

Figure 4. Speed and cost comparisons between Cerebras, Groq, and GPU-based inference infrastructure.

Nous Research released Hermes Function Calling V1, a new open dataset for training models on tool use, function calling, and structured output, that gave Hermes 2 Pro its capabilities. The dataset is available on HuggingFace.

Top Tools & Hacks

There’s been a lot of buzz lately over AI-based code editor Cursor, thanks to a recent Andrei Karpathy shoutout and money they recently raised. The magic of Cursor comes from stacking AI code completion and generation on a fork of VS-Code to build a powerful AI-first programming environment. It has powerful auto-completes, AI chat assistance, and AI generation of code snippets, functions, or whole programs. Cursor is worth trying, but after the free trail it costs $20/mo.

There’s also an open-source alternative AI-power code editor called Zed AI, powered by Anthropic's Claude 3.5 Sonnet. You can also combine Continue and Ollama to get an AI coding co-pilot in VSCode.

AI Research News

This week’s AI research roundup covered these recent papers:

Controllable Text Generation for Large Language Models: A Survey
Foundation Models for Music: A Survey
Writing in the Margins: Better Inference Pattern for Long Context Retrieval
Language Model Can Listen While Speaking
To Code, or Not To Code? Exploring Impact of Code in Pre-training

There’s been excitement over GameNGen, an AI model from Google Research that generates the DOOM video game in real-time. As shared on the GameNGen project page and their paper “Diffusion Models Are Real-Time Game Engines,” GameNGen simulates the classic game DOOM at over 20 frames per second on a single TPU.

This breakthrough portends a future where video games might be creations of AI simulation engines that “enable real-time interaction with a complex environment over long trajectories at high quality.” It could be revolutionary.

Figure 5. Playing DOOM as an AI simulation, with all graphics rendered by a diffusion model.

Linked-In AI researchers announced Liger Kernel delivers huge efficiency gains for training LLMs:

In the same spirit as Flash-Attn, but for layers like RMSNorm, RoPE, SwiGLU, and CrossEntropy! Increases multi-GPU training throughput by 20% and reduces memory usage by 60% with kernel fusion, in-place replacement, and chunking techniques.

Nvidia released NVEagle, an impressive Video-capable Multimodal Language Model that comes in 7B, 13B and 13B parameter sizes, fine-tuned on chat. They also published an associated paper “EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders.” Eagle explores using multiple (mixture-of-experts) vision encoders and fusion strategies for visual perception. Their findings led to the Eagle family of MLLMs, which surpasses other open-source models on MLLM benchmarks.

AI Business and Policy

“We're actually seeing the momentum of generative AI accelerating. ” - Jensen Huang, Nvidia CEO

Nvidia’s earnings is moving the whole stock market. Nvidia reported robustly growing Q2 earnings, recording a record $30 billion in revenue, up 122% from last year, with earnings and future guidance above prior estimates. However, the stock declined after reporting earnings.

The Nvidia earnings call indicated H200 platform ramped sales this quarter, China data center revenue is growing and significant, and “sovereign AI revenue will reach low double-digit billions this year.” Rumored Blackwell delays are real, but due to a mask spin to improve yields; Blackwell will ship in volume by year-end. The race to build out AI infrastructure continues, and Nvidia remains the dominant AI hardware supplier.

The Information reported that OpenAI demonstrated its secretive AI model, "Strawberry," to U.S. National Security officials. This model enhances AI reasoning and is being used to help train OpenAI's next-generation AI, code-named “Orion.”

In related news, OpenAI and Anthropic signed agreements with the U.S. government to test their AI models. These deal is with the US AI Safety Institute, giving them access to major new models prior to release. aim to ensure safe and ethical AI use amid growing regulatory scrutiny.

OpenAI is in talks to raise billions in a new funding round which could value the company over $100 billion. Led by Thrive Capital putting in $1 billion, with investors that could include Apple and Nvidia, this funding round marks the largest investment in OpenAI since Microsoft’s $10 billion infusion in January 2023.

Generative AI coding startup Magic has raised $320 million in funding from Eric Schmidt, Atlassian and others, on top of their $100 million February raise. Magic, as noted above in our weekly, also announced their 100M token context model (LTM-2) and their partnership with Google cloud.

Amazon will use Anthropic's Claude for their Alexa revamp due for release in October. Their in-house AI was not up to snuff, so Amazon turned to Claude, which performed better.

AI-powered coding assistant developer Codeium AI has raised $150 million in funding at a $1.25B valuation. Codeium AI competes with Cursor, Github Copilot, and other AI coding agents. Why the rich valuation? TechCruch notes: “Polaris Research projects that the AI code tools market will be worth $27.17 billion by 2032.”

California Legislature Passes SB 1047: California's controversial AI regulation bill, SB 1047, has passed the State Assembly and is now one step away from Governor Gavin Newsom's desk. Most tech companies have balked at the bill, arguing it will dampen AI innovation, but Elon Musk expressed support, while Anthropic shifted to cautious support for SB 1047 after heavy-handed regulatory language in the bill was watered down.

AI Opinions and Articles

TechCrunch airs music industry’s angst about AI in “Grammy CEO says music industry also has AI concerns.” There is a mix of positive and negative connotations around the use of AI in music, but this (anonymous) musician who works for Big Tech expressed a desire for a positive mindset:

“I think a lot of musicians, particularly the ones who haven’t ‘made it,’ are taking a glass-half-empty perspective on AI. Just as the industrial revolution did not lead to widespread unemployment and in fact quite the opposite, more creatives, especially musicians, should flip their mindset and lean in.”

AI Changes Everything

Discussion about this post

Ready for more?