AI Week In Review 24.09.07
Yi-Coder 9B, Reflection 70B, DeepSeek V2.5, Replit Agent for coding, Claude for Enterprise, Cohere updates Command-R and Command-R+, Groq Cloud goes Multi-modal, vLLM 0.6, OLMoE-1B-7B, AlphaProteo.
Top Tools & Hacks
This week, Chinese AI firm 01.AI released Yi-Coder 1.5B and 9B, small but mighty LLMs for code. The open weights Yi-Coder 1.5B and 9B come in base and chat versions with a 128K context window. Yi-Coder 9B delivers state-of-the-art coding performance, scoring 85.4% on HumanEval Python and 73.8% on MBPP. It achieved this by building on Yi-9B with continued pretraining using 2.4 trillion additional tokens of code, comprising over 52 major programming languages.
Yi-Coder 9B SOTA performance for coding makes it a great local AI model to try out for AI coding tasks. Available on HuggingFace and in Ollama, I will be adding Yi-Coder to my Continue AI code assistant setup in VSCode.
AI Tech and Product Releases
Matt Shumer and Glaive AI announced Reflection 70B, an open-weights fine-tune of Llama 3.1-70B with stellar performance. It scores an incredible 89.9% on MMLU, 91% on HumanEval, and 55% on GPQA benchmarks, scoring better than Llama 3.1 405B and GPT-4o, and even beating Claude 3.5 Sonnet on most benchmarks. Reflection 70B is available on HuggingFace.
Reflection 70B is so good because they fine-tuned the AI model to do chain-of-thought and reflection during inference. This technique trains the AI model plan, think, reflect, then reply during inference to craft a better answer. AI models like Orca have used similar fine-tuning techniques, so we can expect it to improve AI models generally. From Matt Schumer:
405B is coming next week, and we expect it to outperform Sonnet and GPT-4o by a wide margin.
DeepSeek released DeepSeek-V2.5, a 128K context MoE model with 21B active parameters and 238B total parameters. This new model combines the general and coding abilities of the two previous DeepSeek MoE versions, DeepSeek-V2-Chat and DeepSeek-Coder-V2-Instruct, making it very capable in both writing and coding. The open weights DeepSeek V2.5 is available on HuggingFace.
Replit launched Replit Agent, an AI coding agent that codes and deploys full apps from prompts.
Replit has had AI assistance in their integrated software edit-and-deploy environment, but Replit Agent takes it to a new level. Replit Agent is available for beta access to subscribers. Martin Bowling review says “In 2:43 the new Replit code agent built me a working wordle clone. … The barrier to build is now 0.”
Anthropic introduced Claude for Enterprise, for business users collaborating on AI with internal information. To support these use cases, it comes with an expanded 500K context window, native GitHub integration, enterprise-grade security features (like SSO and role-based access), and a guarantee that there's no training of any kind on chats or files.
In not-so friendly Anthropic news, Claude sometimes invisibly adds instructions and additional context into prompts, in both their UI and API interfaces. If you mention a copyrighted character or work, or ask for verbatim recitation of anything, it triggers prompt re-writing.
Cohere released updates to Command-R and Command R+, offering improvements in efficiency, cost, speed and performance on math and reasoning. These models are particularly suited for retrieval-augmented generation (RAG) and tool usage. They note:
“Command R, our fastest and most efficient model, has demonstrated material gains across the board and is now on par with the prior version of the much larger Command R+.”
GroqCloud is expanding support to multi-modal AI models, covering image, audio & text, starting with support for Llava V1.5 7B.
MMMU Pro, a more robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark, has been released. The MMMU authors published the benchmark dataset on HuggingFace along with a paper on it. MMMU and MMMU Pro are evaluation benchmarks for multi-modal LLMs.
LangChain released LangGraph.js v0.2, a framework for building dependable agents in Javascript.
The high-performing LLM inferencing library vLLM has a new release: vLLM v0.6 improves throughput by 2.7 times and reduces latency by 5 times.
AI Research News
Our AI Research Roundup for this week covered:
Sapiens: Foundation for Human Vision Models
Law of Vision Representation in MLLMs
Simulated communities of AI agents with Project Sid.
Automated Design of Agentic Systems
The "Estimate, Extrapolate, and Situate" Self-Training Algorithm for Robots
Flexible and Effective Mixing of LLMs into a Mixture of Domain Experts
OLMoE (Open Mixture-of-Experts Language Models). The Allen Institute for AI released OLMoE-1B-7B, a Mixture-of-Experts (MoE) LLM with 1B active parameters and 7B total parameters. It is 100% open-source, with open weights, code, logs, and Technical Report.
In other notable research news, Google DeepMind introduced AlphaProteo, which generates novel proteins for biology and health research. They shared their research in the paper “De novo design of high-affinity protein binders with AlphaProteo.” AlphaProteo helps generate promising protein binders that link to other proteins, helping to advance drug design and biological understanding.
ByteDance just shared Loopy, an avatar animation project, which turns an image and an audio track into a talking video. Technically, Loopy is an end-to-end audio-only conditioned video diffusion model; stylistically, Loopy is highly realistic, generating video of people talking to the audio track, with realistic facial expressions, head movements, even vocal cord movements. The paper “Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency” describes the technical details.
AI Business and Policy
Safe Superintelligence (SSI), the AI company founded just 3 months ago by Ilia Sutskever, raised $1 billion at a $5 billion valuation. The funds will be used to acquire computing power. The eye-popping valuation is a testament to the tech community’s respect for Ilya and his 10-person, and the faith that they’ll deliver something incredible.
The head of OpenAI Japan indicated they will release "GPT Next" this year and it will be significantly more powerful than GPT-4, with improved multi-step reasoning. Then OpenAI walked back the story, saying time-lines aren’t firm and “100x GPT-4” is illustrative.
There are reports that OpenAI will launch their reasoning engine Strawberry within ChatGPT this fall. Other leaks state that the next big model is called Orion, and Project Strawberry is being used to train it by generating synthetic data. Another report says OpenAI is considering charging $2000 a month subscription for their next big AI model. The rumors and leaks are getting exhausting. OpenAI, give us a clear roadmap and just ship it.
Elon Musk’s xAI brought the massive AI training cluster “Colossus” online. Colossus is currently the world's most powerful AI supercomputer, equipped with 100,000 Nvidia H100 GPUs. With plans to double its capacity soon, Colossus is expected to significantly accelerate the development of xAI’s Grok AI models, potentially surpassing other AI models by the end of the year. Elon gloats that:
This is a significant advantage in training the world’s most powerful AI by every metric by December this year.
On the horizon are even more ambitious AI mega data centers. The Information reports two AI developers are planning $125 billion supercomputers. North Dakota is in talks with two unnamed tech giants to build AI data centers with 500 to 1,000 megawatts of capacity, potentially scaling up to 5 to 10 gigawatts.
Canva is introducing new AI features but is raising subscription prices significantly, potentially impacting its user base and market positioning. “Some Canva Teams subscribers aren't happy about the eye-watering increase.”
MidJourney says it is venturing into hardware.
Preventative healthcare startup Neko Health is expanding to London. The Swedish company offers full-body scans and AI-powered insights to detect potential chronic disease conditions. It has garnered a waiting list of 22,000 people seeking its services.
You.com announced it has raised $50 million in a Series B funding round. Seeking a profitable niche, the AI search engine compay is refocusing on answering complex questions that require research and analysis, rather than simple factual queries.
A startup focused on open-source AI agents for developers called All Hands AI has raised $5 million in seed funding. TechCrunch reports on the surge in investment for AI coding tools, highlighting the potential for AI to boost developer productivity, but also making for a crowded marketplace.
The UK's Competition and Markets Authority (CMA) has cleared Microsoft's acquisition of the Inflection AI team, stating that the deal doesn't raise competition concerns. However, the CMA considers the deal a "relevant merger situation," so regulators will review similar ‘acquihire’ deals in the future.
Coinbase demonstrated the first AI-to-AI crypto transaction at Coinbase Dev conference. AI agents can now use crypto wallets to transact with each other and humans. What could possibly go wrong?
Fortune 500 companies grow concerned about AI Regulations and AI risks: A report from Arize indicates a 500% increase since 2022 in the number of Fortune 500 companies citing AI regulations as a business risk, highlighting concerns about compliance and potential hindrances to AI development.
AI Opinions and Articles
What is the real impact of AI on our work? New research finds the real impact of AI on programmer productivity is a 26% increase in completed tasks.
The study evaluated the impact of generative AI on software developer productivity by analyzing data from three randomized controlled trials at major tech companies. They used the now-dated GPT-3.5 powered Github Copilot for 4,867 coders in Fortune 100 firms. If even that could be a 26% boost, imagine using the latest AI coding agents that can take you from zero to basic web app in 10 minutes.
AI helps make professional workers more productive. I will let Ethan Mollick have the last word on it.
We now have randomized controlled trials showing large performance gains in real companies for coding, management, entrepreneurship, and writing using AI. - Ethan Mollick