AI Week In Review 24.03.02
Alibaba's EMO lip-syncing, Mistral Large, Ideogram 1.0, StarCoder2, Genie, Cosmopedia, STORM, Chunk Llama, 1.5 bit LLMs, Evo Genomics model, Elon sues OpenAI.
AI Tech and Product Releases
Alibaba this week shared EMO, a way to produce highly accurate lip-syncing videos from an image. EMO, which stands for Emote Portrait Alive, has amazing coherence and realism in lip-syncing, as the developers showed in some impressive demos. The paper “EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions” explains technical details, including being a diffusion model that was trained on 250 hours of video.
EMO lip-syncing is so convincing, it must be seen to be believed (or disbelieved).
Mistral released Mistral Large. It’s quality is near-GPT-4, multi-lingual (strong on several European languages), strong on function-calling, and available on Azure as well as Mistral’s new interface, Le Chat. More details in our article “LLM update Q1 24: Mistral Large Enters Le Chat.”
Nvidia, Hugging Face, and ServiceNow have released StarCoder2, a suite of code generation models sized at 3B, 7B, and 15B parameters. The StarCoder2 models deliver “new standards for performance, transparency, and cost-effectiveness” beating prior similarly sized coding models. This was achieved by training on over 600 programming languages using a new code dataset called Stack v2, with 900B tokens.
HuggingFace Cosmopedia dataset v0.1 and related code were released. Cosmopedia is a synthetic dataset, sized at 25 billion tokens and consisting of 30 million textbooks, blogposts, stories, posts and WikiHow. This makes it the largest open synthetic dataset thus far. These were generated by Mixtral 8x7B mapping knowledge from RefinedWeb and RedPajama. This is useful because curated synthetic datasets can be a path to more efficient AI models, as the Phi models showed.
Shishir Patil, UC Berkeley PhD student and developer of the function-calling LLM Gorilla announced the release of the live Berkeley Function-Calling Leaderboard. “Also debuting openfunctions-v2 the latest open-source SoTA function-calling model on-par with GPT-4. Native support for Javascript, Java, REST! “
Argilla released OpenHermesPreferences, an open synthetic dataset of ~1 million AI preferences that can be used for aligning and fine-tuning LLMs. It is a great example of open source collaboration: It is derived from teknium/OpenHermes-2.5, then combined it with responses from Mixtral-8x7B-Instruct-v0.1 and Nous-Hermes-2-Yi-34B, using PairRM to score and rank generated responses.
In not-yet-released news, The Information reported Meta wants Llama3 to handle contentious issues better and as a consequence, Llama3 release is now said to be in July. Meta is spooked by backlash to Gemini and pushed the Llama3 release back to deal with alignment issues.
Top Tools & Hacks
Ideogram has announced their 1.0 release, and it’s a stunningly good AI image generation tool. It easily wins our “top tool” for this week easily.
Some of its great features:
Excellent prompt adherence and hyper-realism: “Ideogram 1.0 generates sharp, detailed images while understanding long, complex prompts.”
Incredible details and faithful rendering of AI-generated text.
A feature called Magic Prompt to help with creating prompts.
MattVidPro is “really really impressed” in his review of it, asking “Did this just win the crown?” It seems to beat Dalle-3 and Midjourney 6 overall in user preferences, and adherence to user prompts, image quality, and especially text adherence is excellent. It’s my new go-to for image generation.
AI Research News
Researchers from Google DeepMind present “Genie: Generative Interactive Environments,” a generative AI that creates virtual playable worlds from image or text prompts. They describe the 11B latent action model, as “ a foundation world model” as it is able to create a latent action interface from fully unsupervised from Internet videos.
Some ‘wow’ factors from this: Custom video games generated by AI on-the-fly are now possible; just like AI-generated images and video, now fully playable worlds. Also, they say:
… the resulting learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future.
Translation: Being able to generate virtual worlds like this facilitates training of AI agents and robotics systems.
STORM: Can we teach LLMs to write long articles from scratch, grounded in trustworthy sources? Yes, say the authors of “Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models.” They created and released STORM, a system that writes Wikipedia-like articles based on Internet search.
Chunk Llama, an extended-context Llama, comes from dual-chunk attention (DCA), a new way to scale LLMs to context lengths 8 times their original pre-training length, without training. As described in “Training-Free Long-Context Scaling of Large Language Models,” DCA decomposes the attention computation for long sequences into chunk-based modules to capture long context attentions.
A new paper announced “The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits.” They introduce BitNet b1.58, an LLM in which every parameter is ternary {-1, 0, 1}. Amazingly, they claim:
BitNet b1.58 matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance.
To have same performance for much fewer precision in bits makes it significantly more efficient in latency, memory and throughput. This could make LLMs radically more efficient and cost-effective, including radically simplifying LLM inference.
From ArcInstitute, Evo: DNA foundation modeling from molecular to genome scale:
Evo is a long-context biological foundation model that generalizes across the fundamental languages of biology: DNA, RNA, and proteins. It is capable of both prediction tasks and generative design, from molecular to whole genome scale (over 650k tokens in length).
As an AI model, Evo is based on StripedHyena, an architecture that combines state-space models and Transformers, which enables it to comprehend long sequences. It has a context length of 131k tokens. Evo is an open model, available on GitHub repository and the Together playground. They are also open-sourcing their 300B token training dataset as “OpenGenome, consisting of 2.7M publicly available genomes from prokaryotes.”
AI Business and Policy
"OpenAI, Inc. has been transformed into a closed-source de facto subsidiary of the largest technology company in the world: Microsoft. Under its new board, it is not just developing but is actually refining an AGI to maximize profits for Microsoft, rather than for the benefit of humanity.” — Elon Musk
Elon Musk dropped a bombshell lawsuit against OpenAI. Andrew Curran on X has details:
"Elon Musk has filed a lawsuit against OpenAI for breach of contract, breach of fiduciary duty and unfair business practices, and is asking for OpenAI to revert back to open source, and to share all its research for the benefit of humanity."
Elon Musk was the main benefactor for the original non-profit OpenAI. His suit argues that OpenAI’s latest works (Q* and more powerful LLMs) are AGI and therefore OpenAI needs to live up to commitments around AGI in its founding.
Does it have legs? Experts weigh in. If nothing else, this lawsuit exposes what I’m calling OpaqueAI - OpenAI’s lack of transparency on what they are really doing. It also interestingly may lead to a court defining what AGI means.
Klarna announced that their new AI Assistant chatbot has replaced 700 customer service rep workers, and did it better. The chatbot handled 2.3M conversations in the last month, average time to resolution dropped from 11 minutes to 2 minutes, and the AI can handle many additional languages better. Ratings by customers showed the AI was as good as human customer service reps.
Overall I'm optimistic about the implications of AI. But when we realized that the AI assistant is doing the equivalent work of 700 agents, we wanted to be honest about that and bring it to the attention of the wider world because this is just the beginning. I think it's better that society has this conversation as early as possible and already starts thinking about how to support people who are impacted in the short-term, even if the long-term impact will most likely be positive for society as a whole. - Klarna CEO on X
Tumblr and WordPress to Sell Users’ Data to Train AI Tools to OpenAI and Midjourney. This follows the recent news of Reddit selling data to Google. But what about personal user data? “Internal documents obtained by 404 Media show that Tumblr staff compiled users' data as part of a deal with Midjourney and OpenAI.” Oops.
Figure raised $675M at $2.6B Valuation and partnered with OpenAI to develop next generation AI models for robots.
The Ensuring Likeness Voice and Image Security (ELVIS) Act moved forward in Tennessee legislature after compelling personal testimony from prominent artist-songwriters.
"Every day, there are new stories about deepfakes and AI-cloned voices and images that manipulates someone's likeness without their consent. This is not just a problem that effects celebrities, this is a human problem that affects us all. As a mother of three daughters, I am terrified by how this technology has been used to exploit teenagers,” - Natalie Grant
AI Opinions and Articles
Facebook Drops Anti-Scraping Lawsuit Against Bright Data (Guest Blog Post) explains why Facebook stopped fighting scrapers:
…. more important to Meta than keeping others off their public data is having access to everyone else’s public data. Meta is concerned that their perceived hypocrisy on these issues might just work against them. Just last month, Meta had its success in prior scraping cases thrown back in their face in a trespass to chattels case. Perhaps they were worried here that success on appeal might do them more harm than good.
A Look Back …
These user-generated content sites once defined the human internet: Wikipedia, Reddit, StackOverflow, WordPress, Tumbler, YouTube. As AI takes over as an intermediary for information, questions arise: Are they becoming obsolete? Where will human user-generated content be on the Internet be in 10 years?
Wikipedia, the free-content online encyclopedia, began with its first edit on January 15, 2001. The concept of an online encyclopedia was proposed even earlier, with the earliest known proposal made by Rick Gates in 1993. The idea specifically included that no central organization should control editing.
If AI depends on this prior user-generated content to create models, what happens when AI displaces it? Will the fall in usage of, say, StackOverflow, very helpful for AI coding model, doom future AI progress? One opinion:
I don't think StackOverflow will outright disappear. It will have to downsize by a large factor (80%?), but the need to get human-written answers to novel questions will always be there. You can't automate it away with current technology.
In my opinion, crowd-sourcing and user-generated content will still be created, because there is a human desire for self-expression. Whether it is on TikTok, Reddit, X, or StackOverflow, or, as this author does, on Substack, the human will always be in the creative loop.