AI Week In Review 24.11.30

Runway Frames, Qwen QwQ-32B-Preview, Model Context Protocol (MCP), Olmo 2, ElevenLabs GenFM, Claude custom styles, Nvidia Fugatto, DeepSeek Janus, GenChess, Hymba-1.5B, Sora gets leaked.

Dec 01, 2024

Figure 1. Runway Frames AI image generations in Mise-en-scène style mode.

AI Tech and Product Releases

Alibaba's Qwen team released QwQ-32B-Preview, an open-source reasoning model with 32.5 billion parameters that competes with OpenAI's o1 series in complex reasoning tasks. We discussed QwQ and other o1-like reasoning AI models on our latest AI research roundup. Benchmarks and evaluations of QwQ-32B-Preview show it to be a strong AI model, including this comment:

It was able to solve a couple of hard math problems so it looks very promising for maths. It didn’t do so well on my coding task (generating bash script). By the results reported on the LiveCodeBench it has room for improvement. One thing that’s become very clear to me is that the reasoning capabilities of these LLMs are significantly closing the gap between the open and closed-sourced models. … After experimenting with this model, I realized that the reasoning paths are not fully optimized and there is a lot more optimization that needs to happen before these models are used in production settings.

Qwen QwQ can be downloaded under a permissive Apache 2.0 license from HuggingFace or tested out on HuggingFace spaces.

Anthropic introduced the Model Context Protocol (MCP), a new open-source standard enabling AI models to connect with data sources like business tools, databases, and software. MCP is built on a flexible, extensible architecture that is client-server based and enables seamless communication between LLM applications and integrations.

The protocol aims to simplify and standardize the integration process between data sources and AI applications, reducing the need for custom implementations for each new data source. They already have early adopters, integrating their data sources into the MCP for integration with Claude:

To help developers start exploring, we’re sharing pre-built MCP servers for popular enterprise systems like Google Drive, Slack, GitHub, Git, Postgres, and Puppeteer.

Figure 2. An example of MCP using sqlite database integration.

Allen AI released Olmo 2, a new family of 7B and 13B models, developed entirely with open-source tools and data and trained on 5T tokens. These fully open-source AI models are competitive with open-weight models such as Llama 3.1 on LLM benchmarks. They are available on HuggingFace and to try out on AI2 playground.

Figure 3. Olmo 2 outperforms prior Llama 2 and Llama 3.1 models, despite training on fewer tokens and less FLOPS, a testament to improved data quality and algorithms in the training.

ElevenLabs introduced GenFM, a tool in its ElevenLabs Reader iOS app that creates multi-speaker podcasts from uploaded PDFs, articles, and eBooks. The tool automatically selects two voices to generate a podcast, incorporating human elements like "ums" and pauses to mimic natural conversation. GenFM supports 32 languages. ElevenLabs plans future updates for more customization and multiple source integration.

Anthropic introduced new custom styles feature for its Claude AI assistant, allowing users to tailor the chatbot’s responses to match specific communication needs or personal styles. The feature includes three presets: Formal, Concise, and Explanatory, with an option to create custom styles based on uploading user-provided samples.

Google has integrated Spotify with Gemini Extensions for Enhanced Music Search and Play Capabilities. Users can now search and play music on Spotify using natural language requests through Gemini on compatible Android devices.

Nvidia introduced Fugatto, an advanced audio generation model that generates unique sounds and music based on text and audio inputs it hasn’t been trained on. Fugatto offers capabilities such as voice transformation, sound effect creation, and musical edits like isolating vocals or changing instruments. Nvidia says its new AI music editor can create “sounds never heard before” — like a trumpet that meows. The company is considering responsible ways to release this model due to potential misuse.

Runway launched an image-generation model called Frames that offers unprecedented stylistic control, with a number of style options from “The model excels at maintaining stylistic consistency while allowing for broad creative exploration.” The company has faced copyright concerns regarding its training data sources.

Figure 4. Some of the styles available in RunwayML’s Frames image generation: Digital picture profile, dynamic landscapes, still life, Japanese anime.

Cursor and Jetbrains now can integrate with ChatGPT MacOS app. This lets ChatGPT look at coding apps to provide better answers. OpenAI has been opening up ability to work with apps from ChatGPT on MacOS, with a growing list of apps available.

Got you covered: + VS Code forks: Code Insiders, VSCodium, Cursor, Windsurf + JetBrains: Android Studio, IntelliJ, PyCharm, WebStorm, PHPStorm, CLion, Rider, RubyMine, AppCode, GoLand, DataGrip + Nova & Prompt by Panic + BBEdit.

OpenAI’s Sora API was "leaked" on HuggingFace and some users tried it out. According to reports and tweets from various users, this version of Sora can generate high-quality videos at up to 1080p resolution and durations of up to 10 seconds.

It’s reported that artists who were beta testing OpenAI’s text-to-video model, Sora, leaked the software to protest what they claim is “unpaid R&D and PR.” OpenAI denies these claims, stating participation is voluntary and that they are working on balancing creativity with robust safety measures for broader use. Translation: No public release, yet.

Top Tools & Hacks

DeepSeek has released Janus, a multimodal understanding and generation model that runs entirely in-browser using WebGPU and Transformers.js. The model processes both text and visual inputs locally without server dependencies. If you want to run a fully-local in-browser LLM, this is worth trying out.

Google Labs released GenChess, an online Generative Chess game that uses Gemini Imagen 3 for customizing chess pieces with text prompts. Players can create personalized sets themed around various concepts and choose from different difficulty levels to play against AI opponents. You can try GenChess here.

Figure 5. ChessGen makes an Origami Chess set.

AI Research News

Nvidia released Hymba-1.5B, a hybrid architecture Small Language Model combining Mamba and transformer-based attention layers. Hymba-1.5B outperforms comparable sized 1.5B SLMs such as Qwen and SmolLM2 with 6-12x less training. Technical details were shared in the paper Hymba: A Hybrid-head Architecture for Small Language Models.

This week’s Research Roundup covered developments in GUI Agents and o1-like reasoning models:

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
The Dawn of GUI Agent: A Case Study with Claude 3.5 Computer Use
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions
Qwen QwQ-32B-Preview
The O1 Replication Journey - distilling reasoning

Share AI Changes Everything

AI Business and Policy

AWS is hosting their re:Invent Conference Next Week, and Amazon Might Unveil Its Flagship Olympus LLM at the event. The AI model can analyze images and videos to find specific scenes based on text prompts, aiming to reduce Amazon's reliance on Anthropic’s Claude.

Uber is using gig workers to expand into the AI labeling business. The company’s new “Scaled Solutions” division connects businesses with independent data operators for tasks like data labeling and testing, marking an extension of Uber's existing gig-based workforce model into the growing machine learning sector.

Microsoft clarified it does not use customer data from Microsoft 365 apps to train its AI models. This statement addresses recent reports and confusion arising from privacy settings in Microsoft Office that toggle “optional connected experiences,” which do not include AI training despite concerns raised by users.

Elon Musk files for injunction to halt OpenAI’s transition to a for-profit. Musk's attorneys claim that without an injunction, the defendants will engage in anticompetitive behavior, harming competition and depriving xAI of capital through unethical investor practices. The lawsuit seeks to preserve OpenAI’s nonprofit character and prevent further self-dealing transactions by its leaders.

Elon Musk’s xAI is preparing to release a stand-alone consumer Grok chatbot app similar to OpenAI’s ChatGPT, that could compete directly with OpenAI’s ChatGPT. The app’s launch will likely follow xAI securing a $5 billion funding round.

Perplexity is considering entering hardware with a simple, under $50 voice-to-voice question answering device. The history of AI hardware devices shows it can be challenging to succeed in this space.

A startup called /dev/agents raised $56 million to develop an operating system for AI agents. Led by former Google Android leaders, /dev/agents is developing an AI agent operating system to simplify the development of autonomous digital assistants. The company aims to create a unified platform that allows developers to build and integrate multi-step task handling AI agents across devices, which could unlock the full potential of AI agents.

Pathway raised $10 Million to develop live AI Systems that can think and learn in real-time. The company aims to enable developers to integrate live data into AI systems during their prompting stage, enhancing decision-making with up-to-date knowledge.

Cradle, a biotech company leveraging AI for protein design, has raised $73 million to accelerate the development of biomolecules for various applications in medicine and industry. The funding will help scale up their software services and wet labs to support more scientists in optimizing proteins efficiently.

Incoming President Donald Trump is considering naming an “AI czar” in the White House, who would help coordinate federal regulation and governmental use of AI.

AI Opinions and Articles

A Hugging Face employee published a dataset of 1 million Bluesky posts, then removed it after a backlash. Bluesky has an open API, and public posts can be scraped legally for AI training. The dataset was removed after the controversy, but this raises questions about user consent in publicly available data. Bluesky is exploring ways to allow users to communicate their consent preferences externally, though it cannot enforce these outside its systems.

This incident led to online hate-venting against HuggingFace and AI generally by online “mobs” of Bluesky users. All the hate, as Alex Volkov on X notes, is mis-directed:

They legit don't understand what a decentralized social media platform with a public unmonitored firehose feature means, nor do they realize the Streisand effect apparently.

A Look Back

We are hitting an important milestone. ChatGPT is celebrating its second birthday.

At the two year mark of ChatGPT, what’s next? The AI industry is still growing, with many AI competitors in the fray; AI technologies are still improving rapidly but, in some areas, starting to mature; business models are starting to get clarified; we are no longer satisfied with mere novelty, we need utility in our AI applications.

The AI revolution has changed how we work daily, but it’s still early innings. AI continues to Change Everything.

AI Changes Everything