AI Week In Review 25.01.18

New LLMs: MiniMax-01, InternLM3, Codetral 25.01, Helium-1, ReaderLM v2. New TTS models: T2A-01-HD, Kokoro and OuteTTS 0.3. Video AI models: Luma Labs' Ray 2, Flux adds finetuning. Transformer-squared.

Jan 18, 2025

Figure 1. Still from Luma Labs Ray 2 video generation, playing violin in the rain.

AI Tech and Product Releases

Chinese AI firm Hailou AI (MiniMax) has released and open-sourced MiniMax-01, a pair of highly competitive AI models, an LLM and a vision language model. Their MiniMax-Text-01 text-only LLM uses Mixture of Experts (MoE) with 32 experts, with 456B parameters, 45B active at a time, and boasts a massive 4 million token context. Core benchmark results (see Figure) show that MiniMaxText-01 is competitive with leading AI models such as GPT-4o and Claude-3.5-Sonnet.

Their vision language model, MiniMax-VL-01, is based on the LLM by adding a lightweight Vision Transformer module and training on an additional 512 billion vision-languages tokens for vision tasks.

On vision tasks, MiniMax-VL-01 is SOTA. They have shared technical details on the MiniMax-01 model in a MiniMax-01 Technical Report.

Figure 2. MiniMax model benchmark performance is competitive with Qwen, GPT-4o, and Claude 3.5-Sonnet.

The MiniMax models boast competitive performance metrics and large context windows but come with restrictive licensing terms that belie their claim to being open source. The company faces controversy and lawsuits in China for training their AI models from copyrighted materials.

Hailuo AI (Minimax) has also released a text-to-audio AI model called T2A-01-HD that synthesizes text-to-speech (TTS) with “unmatched versatility, emotional depth, and multilingual authenticity.” The customization requires only 10 seconds of audio to clone, has over 300 pre-built voices, and speaks in over 17 languages, covering English with varying accents. It’s available for developers via API or to try here.

Recently released Kokoro TTS model is an 82 million parameter text-to-speech model that has generated a lot of excitement due to its being a state-of-the-art open-source TTS that you can run locally. This small model is so efficient it can generate 2 minutes of speech in 4 seconds on a T4, and its quality matches TTS models many times its size.

Kokoro can be run locally in your browser using Kokoro.js, a Javascript library for Kokoro inference. Sam Witteven has a breakdown on using Kokoro TTS locally and creating custom voices. You can also try it here.

OuteAI has released OuteTTS 0.3, a TTS model built on LLMs that offers zero shot voice cloning. OuteTTS 500M is a 500 million parameter TTS model based on Qwen2.5-05B, that is also open weight for non-commercial use and available via HuggingFace.

Shanghai AI Lab has released InternLM3, their latest version in the InternLM LLM series. InternLM3-8B-Instruct is an open-source (Apache 2 License) AI model with performance comparable to Qwen2.5-8B and GPT-4o-mini. The project page is on GitHub and the model is available on HuggingFace.

OpenAI is officially rolling out a new user interface for ChatGPT’s custom instructions, allowing users more control over how ChatGPT responds and behaves. The feature will initially be available on ChatGPT.com and Windows desktop app.

OpenAI is testing a new feature allowing ChatGPT users in the U.S. and India to sign up with just a phone number, though restrictions apply such as inability to upgrade to paid plans without email verification.

Mistral has released a new CodeStral 25.01. This updated version of their original CodeStral features a more efficient architecture that can generate and complete code 2 times faster while achieving SOTA results on coding benchmarks. This model is available for use in Continue.Dev and other AI coding assistant environments.

Luma Labs has released their Ray2 video model:

Ray2 is a large–scale video generative model capable of creating realistic visuals with natural, coherent motion. It has a strong understanding of text instructions and can take image and video as input.

Ray2 features advanced capabilities - producing fast coherent motion, ultra-realistic details, and logical event sequences - as a result of being trained on Luma’s new multi-modal architecture scaled to 10x compute of Ray1. Text-to-video generation is available in Ray2 now, with image-to-video, video-to-video and editing capabilities coming soon.

Black Forest Labs has announced Finetuning for Flux Pro and Ultra via API. This enhancement allows creators to use their own images and concepts to customize FLUX.1, giving them more control over the final results. This brings unprecedented customization capabilities to FLUX Pro model.

Figure 3. BFL’s finetuning for Flux.1 can leverage Flux other customization features – Depth, inpainting, style modes – for controllable, customized, finetuned image generation.

KyutAI Labs has released Helium-1 2B, a small language model (SLM) designed for edge device use.

Jina AI has released ReaderLM v2, a specialized 1.5B Small Language Model HTML-to-Markdown conversion and HTML-to-JSON extraction with exceptional quality. This can be used for LLM processing of webpages, extremely useful in AI agents flows. It’s available on HuggingFace.

Apple’s not-quite Intelligence generates fake news! Apple is pulling their AI notification summary feature for news and entertainment apps due to hallucinations. Users and media organizations complained about their summarizations generating misleading content. As a result, Apple pulled the feature for news and updated all notification summaries to increase transparency, such as showing them in italics and adding a beta warning.

Microsoft introduced a Pay-As-You-Go plan for Copilot with new Copilot Chat. The plan, powered by GPT-4o, offers features like asking business-related questions and generating images on a flexible pricing model. It aims to win over users with AI-powered productivity features in a more accessible form than the full Microsoft 365 Copilot suite.

Google announces free AI features for Workspace customers with a $2 monthly increase per user. The update includes tools like email summaries and automated notetaking previously available only via the $20 Gemini for Workplace add-on, now bundled into the standard $14 plan.

Share AI Changes Everything

AI Research News

Sakana AI has presented a novel self-adapting extension of transformers in the research paper “Transformer²: Self-adaptive LLMs.” The self-adaptation framework adapts LLM weights during inference for new tasks. To adjust weights on-the-fly, they use task-specific "expert" vectors that are dynamically mixed to obtain targeted behavior for an incoming prompt. The authors claim:

Transformer2 represents a significant leap forward, offering a scalable, efficient solution for enhancing the adaptability and task-specific performance of LLMs, paving the way for truly dynamic, self-organizing AI systems.

This method may be viewed as a creative combination of test-time compute and fine-tuning; it outperforms existing fine-tuning approaches such as LoRA.

AI Business and Policy

Perplexity has acquired Read.cv, a professional social media platform competing with LinkedIn, marking Perplexity’s continued investment in corporate-focused functionalities. As part of the deal, Read.cv will wind down operations.

OpenAI says it trained a new AI model called GPT-4b micro with Retro Biosciences. This AI model is aiming to extend human lifespan by re-engineering proteins that can transform adult cells into young stem cells. Retro itself is backed by Sam Altman and this collaboration, spanning about a year, represents OpenAI’s first custom-built biological research model. Both parties plan to release research on the model's outcomes and applications in building human organs and cell replacements.

Nvidia released three new NIM Microservices for enhanced AI control and safety. The services include content safety measures, topic-focused conversation control, and protection against jailbreak attempts, all part of Nvidia’s NeMo Guardrails to help enterprises secure their AI applications better.

Meta CEO Mark Zuckerberg defended use of pirated e-books in AI training in a deposition in Kadrey v. Meta Platforms, a copyright lawsuit against Meta. Zuckerberg compared Meta’s use of the LibGen dataset to YouTube's approach toward copyrighted content, arguing for a nuanced stance on "fair use."

Court releases of Meta documents also reveal that Meta’s AI leaders were intensely focused on surpassing OpenAI's GPT-4 while developing Llama 3. Executives discussed aggressively obtaining training data, including copyrighted materials, highlighting the competitive pressure within the company.

Could Chinese AI models in U.S. products unwittingly spread Chinese propaganda? Boox, an e-reader using ByteDance’s Doubao AI, was found generating responses that mirrored Chinese government propaganda when questioned about sensitive topics like Tiananmen Square and North Korea. These risks from Chinese generative AI may impact Western AI applications, given the rise of Chinese AI models and the CCP-promoting “RedBook” TikTok replacement social media app.

Younger Gen Zers are embracing OpenAI’s ChatGPT for schoolwork, according to a new survey by the Pew Research Center. Over 26% of U.S.-based teens aged 13 to 17 have used ChatGPT for homework, doubling from two years ago. One concern: Awareness of ChatGPT’s limitations by young users remains low.

Several Big Tech alliances with News organizations were announced recently:

Google is integrating a real-time information feed from The Associated Press into its Gemini chatbot app to enhance user experience with up-to-date content. The rollout timing and regional availability remain undisclosed.
Axios announced a partnership with OpenAI to expand its local newsletters into four new cities: Pittsburgh, Kansas City, Boulder, and Huntsville.
Mistral has announced a content deal with Agence France-Presse (AFP) to enhance the accuracy of its chatbot Le Chat. This marks Mistral’s first such agreement, enabling Le Chat to access AFP’s extensive archive and daily news output in six languages.

AI startup funding news:

London-based AI startup Synthesia has raised $180 million, valuing the company at $2.1 billion. Synthesia’s AI avatar technology is popular worldwide, with 60,000 businesses and over a million users adopting it. The company plans to focus on building technology in-house alongside leveraging APIs for specialized functions.
AI recruitment platform Maki secured $28.6 million in Series A funding. Maki claims its conversational AI agent technology automates up to 80% of the recruiting tasks, streamlining the hiring process and providing personalized feedback to candidates.
AI orchestration startup Nexos.ai emerged from stealth with $8 million in funding aimed at easing enterprise adoption of LLMs. Their platform promises to enhance visibility, security, and cost management for LLMs through an API that supports over 200 AI models.
Rockfish raised $4 million to develop synthetic data solutions that help enterprises manage their data silos using generative AI.

OpenAI appointed an executive at investment firm BlackRock to its board of directors. Just-appointed Adebayo “Bayo” Ogunlesi is CEO of Global Infrastructure Partners and a senior managing director at BlackRock. He brings extensive experience in infrastructure investing and commercial leadership to OpenAI’s board.

AI Opinions and Articles

Matthew Berman on YouTube says test-time scaling will be BIGGER than anyone realizes, declaring:

Test-time compute is the most important breakthrough since transformers changed the world in 2017.

I agree. His review of this topic is worth watching. This confirms our own perspective on the importance of test-time compute, as we wrote up in “The Fourth Turning.”

AI Changes Everything