AI Week In Review 24.10.26

Genmo Mochi-1, Runway Act-One, Rhymes Allegro, Stable Diffusion 3.5, Ideogram Canvas, Midjourney image editor, Claude 3.5 Sonnet Haiku & Computer Use, Eleven Labs' Voice Design, Cohere Embed 3 & more.

Oct 26, 2024

Figure 1. Still from Genmo Mochi-1, newly released SOTA open-source AI video generation model.

TL; DR: This was an extremely busy week for AI releases, especially for video and image generation. So many new features and AI models to try out!

AI Tech and Product Releases

Genmo has unveiled Mochi-1 preview, a diffusion transformer AI video generation model focused on generating realistic and complex motion in videos. Genmo claims Mochi-1 is SOTA, with higher ELO scores than competitors:

“Mochi 1 has superior motion quality, prompt adherence and exceptional rendering of humans that begins to cross the uncanny valley.”

Mochi-1 is open source (Apache 2.0 license) but requires multiple GPUs to run. They released a 480p model, with an HD version on the horizon.

Runway has introduced Act-One, a groundbreaking tool that generates expressive character performances using a driving video and character image. This technology eliminates the need for motion capture or rigging, potentially revolutionizing animation and filmmaking. Hollywood is getting disrupted, quickly. The Demo video is amazing.

Figure 2. RunWay Act One takes a real person’s facial expression and dialog and puts it in an animated video.

Rhymes has released Allegro, an AI video generation model than can generate 6-second videos at 15 frames per second and 720p resolution. Allegro is open source (Apache 2.0) and small and efficient, using 9.3 GB of GPU memory in BF16 to run the model. This makes it a great option for users wanting to run AI video generation locally.

Figure 3. Rhymes Allegro AI video generation renders an iconic astronaut on a horse video.

As we shared earlier this week, Anthropic announced updated Claude Sonnet 3.5, new Claude Haiku 3.5 and Computer Use. The updated Claude Sonnet 3.5 got even better at coding and reasoning, surpassing even OpenAI's o1 models on some benchmarks, and features an expanded 8K context window.

The new Claude Haiku 3.5 will be released later this month; benchmarks show it to be competitive with GPT-4o mini in performance.

The new Computer Use feature allows Claude to interact with computers, browse the web, and perform actions like clicking buttons, acting as an AI agent. While impressive, (e.g., it ordered some Doordash), the feature is still in beta and requires Docker to use. Here’s how to setup Computer Use.

Anthropic’s Claude chatbot can also now write and run JavaScript code. This enables Claude to perform precise calculations and analyze data from files like spreadsheets and PDFs, rendering results as interactive visualizations, as well as create dashboards.

Eleven Labs released Voice Design, which allows users to create custom voices using plain text prompts. This feature simplifies bespoke voice creation, making it easier to generate unique and expressive voices for various applications.

Stability AI has released Stable Diffusion 3.5, offering improvements to image quality and prompt accuracy for their AI image generation models. Stable Diffusion 3.5 comes in 3 versions: 8B parameter Large that is SOTA in prompt adherence, Stable Diffusion 3.5 Large Turbo that generates high-quality images with exceptional prompt adherence quickly, and Medium, a choice for efficient, high-quality performance.

Figure 4. Stable Diffusion 3.5 offers great prompt adherence.

Ideogram launched Canvas, a new interface that enables users to mix, match, and fine-tune AI-generated artwork. Features include Magic Fill and Extend, generate text, combine images, and match text styles. Canvas provides unprecedented control and creative expression:

Ideogram Canvas is a newly designed interface, built from the ground up, to enhance your iterative creative process with the magic of AI. Ideogram Canvas will empower you to bring your creative vision to life with remarkable speed and precision.

Figure 5. Matching a text style and generating text in another image.

Midjourney has announced a new web-based image editor and image re-texturing. This update further enhances the platform's capabilities and user experience, solidifying its position as a leading AI art generation tool.

Figure 6. Midjourney image editing feature allows for directed AI-generated edits of images.

Cohere has introduced Embed 3, which allows for the creation of embeddings for both text and visual data like graphs and designs. This facilitates multimodal search and knowledge retrieval, enabling more comprehensive understanding and analysis of information.

Cohere For AI has released Aya Expanse, a powerful multilingual language model supporting 23 languages. This model leverages extensive research in data arbitrage, multilingual preference training, and safety tuning to achieve high performance across various languages.

Apple is rolling out their Apple Intelligence suite of AI features with the release of Apple iOS 18.2 as a developer beta this week. Basic AI functionalities like writing tools and image cleanup are available on devices running iOS 18.1. More advanced AI features in iOS 18.2 include Genmoji emoji generation, Image Playground, Visual Intelligence, Image Wand, and ChatGPT integration. ChatGPT integration can route queries Siri can't handle, effectively acknowledging Siri's limitations while leveraging the more advanced GPT-4o.

Notion launched an email client named Notion Mail at its first user conference, remaking the email client for a more integrated workflow experience based on AI-driven organization and automation. Notion Mail is in preview and has integration with Notion Calendar.

Perplexity launched its native Mac app. The app is free but offers a Pro subscription for more in-depth searches and features like voice mode and file uploads.

IBM has released Granite 3.0, state-of-the-art LLMs for Enterprise. Granite 3.0 is an open-source family of models available on HuggingFace. Granite 3.0 8B Instruct was trained on 12 trillion tokens across 12 different natural languages and intended to serve in sophisticated workflows and tool-based use cases. Granite 3.0 also includes a 2B LLM, 3B and 1B mixture of experts (MoE) models for minimum latency, LLM-based input-output guardrail models, and a speculative decoder for increased inference speed. They also released a Technical Report on Granite 3.0.

Transformers.js v3 was released with WebGPU support, offering up to 100-times faster performance for browser-based AI applications. WebGPU is a new web standard for accelerated graphics and compute. This open-source update simplifies running AI models locally and privately, making them more accessible to developers and users.

Starting next week, the Google Photos app will add a new disclosure for when a photo has been edited with one of its AI features. The disclosure will appear in the “Details” section, noting when a photo was "Edited with Google AI," aiming to improve transparency despite lacking visual watermarks within the image frame.

AI Research News

OpenAI shared research on Simplifying, stabilizing, and scaling continuous-time consistency models for image generation, in the paper “Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models”. Their approach, called sCM, simplifies the formulation of continuous-time consistency models, allowing them to scale their training and achieve comparable quality to leading diffusion models at a fraction of the inference time. Their is on Arxiv.

Meta shared new research, models, and datasets from Meta FAIR including their Segment Anything 2.1 (SAM 2.1), an update to our popular Segment Anything Model 2 for images and videos.

This week, our Research Roundup highlighted Claude’s Computer Use, assessing LLM planning, cross-capabilities in LLMs, and improved RAG using inference scaling:

Claude’s Computer Use
On The Planning Abilities of OpenAI's o1 Models
Revealing the Barriers of Language Agents in Planning
CrossEval: Law of the Weakest Link and Cross Capabilities of LLMs
Inference Scaling for Long-Context Retrieval Augmented Generation

AI Business and Policy

AI search engine startup Perplexity is now handling approximately 400 million queries monthly, up from 250 million in July. The company is exploring new features like one-click purchases for subscribers and is in talks with major brands for sponsored queries.

Universal announced a Spanish version of Brenda Lee’s “Rockin’ Around the Christmas Tree” using SoundLabs AI’s MicDrop technology to reconstruct vocals. Now 79, Lee approves of this new AI-generated Spanish rendition, expressing her joy in introducing the song to fans in a fresh way.

Alimetry’s wearable device for diagnosing functional gastric disorders is making significant strides with AI, making it easier for gastroenterologists to identify and treat various gastric disorders. The company recently announced a $35 million funding round to support their next phase of commercialization.

AI startup Pharos secured $5 million to leverage AI to automate quality reporting for hospitals to external clinical registries. streamline this administrative burden using AI technology. Pharos is from Y Combinator’s summer 2024 cohort.

Waymo has closed a $5.6 billion Series C funding round led by Alphabet. The funds will be used to expand into new cities and enhance its autonomous vehicle capabilities for business applications. Waymo now operates commercial robotaxi services in multiple markets, providing paid rides to over 100,000 customers weekly.

President Biden Expected to Sign AI Memorandum for National Security Agencies. The White House memo emphasizes human oversight in the use of AI tools, especially in critical decision-making areas like asylum granting and weapons targeting. It also directs intelligence agencies to protect AI technology from foreign espionage and mandates pre-release inspection by the AI Safety Institute to prevent misuse by terrorists or adversarial nations.

The U.K.’s Competition and Markets Authority (CMA) is launching a formal probe into Alphabet’s investment in AI rival Anthropic. This investigation is part of efforts to address quasi-mergers where Big Tech firms gain control over startups through strategic investments, with the CMA set to decide by December 19, 2024, whether to refer the case for a more detailed phase 2 investigation.

AI Opinions and Articles

fake news out of control – Sam Altman on X

The AI model release rumor mill is in overdrive. OpenAI Says It Doesn’t Plan to Release AI Model Orion This Year, despite reports from The Verge suggesting Orion / next-GPT was set to launch by December. Also, the Verge says Gemini 2.0 is coming soon. No denials or confirmation yet from Google.

Meanwhile Anthropic’s much-anticipated Claude 3.5 Opus has seemingly vanished from their website. Simon Willison noticed that Anthropic used to list Claude 3.5 Opus as "coming soon" but no longer mentions it at all. He muses:

... I wonder if Opus 3.5 got to the testing phase and was deemed to be "unsafe" to release?

It could be that it’s too unsafe, or that Anthropic doesn’t want to get too far ahead of the other AI labs for safety or other reasons. But it may be that huge AI models are too big and slow in an era when a distilled model can get 98% of performance at half the cost, and Opus would just disappoint. As Levan Kvirkvelia says:

I don't understand why anyone would release Opus. just distillate it to sonnet and update sonnet. The era of public giant models has to end.

What do you think: Will be super-massive LLM go the way of the dinosaur? Small is beautiful? Or will we see 100 trillion parameter AI models in our future?

AI Changes Everything

Discussion about this post