AI Week In Review 24.11.02

Claude on desktop, ChatGPT Search, GitHub Copilot adds Claude 3.5 Sonnet, GitHub Spark, NotebookLlama, Grok adds vision, ReCraft V3, 11Labs X to Voice, Suno Personas, Oasis video game, Blendbox.

Nov 02, 2024

Figure 1. RedcraftV3 AI image generation is very photo-realistic, but can you spot the artifacts?

Top Tools & Hacks

Now Claude and ChatGPT both offer dedicated desktop applications for Windows and Mac, while Perplexity offers a native MacOS app.

Anthropic released Claude desktop apps for Mac and Windows. These beta-release apps bring Claude’s capabilities directly to the desktop but do not include the recently announced Computer Use feature. They are available for both free and premium Anthropic users.

Anthropic also added a dictation tool for their Claude mobile app. Dictation isn’t real-time conversation or full voice-mode but more like sending a voice message to the AI.

OpenAI announced that ChatGPT Advanced Voice is now available in the macOS and Windows desktop apps, making it a very capable Jarvis-like AI companion.

AI Tech and Product Releases

OpenAI released ChatGPT Search, which allows users to access real-time information via web search directly through the ChatGPT. This new search capability is an evolution of their prior Search GPT beta. While ChatGPT Search is a simple, clean and useful addition, it still hallucinates. Comparing it with other AI plus Search options, most reviewers find Perplexity AI shows more accurate and comprehensive results as of now. ChatGPT Search is available on ChatGPT website and desktop and mobile apps.

Speaking of search, Google announced that AI Overviews in Search is expanding to more than 100 countries. Also, there is “Grounding with Google Search” in Gemini API and Google AI Studio, so developers can build their own knowledge-grounding into their AI systems. Benefits include reduced hallucinations and more up-to-date information.

The world of AI Search may get further shaken up: Meta is reportedly building its own AI search engine to reduce its reliance on Google and Microsoft. Via The Information:

“Meta hopes to lower its reliance on Google Search and Microsoft’s Bing, which currently provide information about news, sports and stocks to people using Meta AI, according to a person who has spoken with the search engine team.”

GitHub's Universe conference announced a new AI-powered dev tool Spark and updates to GitHub Copilot. GitHub Copilot now supports developer access to Google's Gemini 1.5 Pro and Anthropic's Claude 3.5 Sonnet models in addition to OpenAI models for code generation; Copilot now supports multi-file edits in VS Code, similar to Cursor, and faster code reviews with Workspace. GitHub Copilot extensions are planned for release in 2025.

GitHub Spark is a new AI tool for building AI applications that allows users to create micro-apps using natural language. GitHub bills it as the “AI Tool for the Next Billion Devs.” Put this together with the GitHub Copilot improvements, and GitHub is working hard to stay on top in the competitive AI coding tools space.

Meta has released NotebookLlama, an open-source AI podcasting tool similar to Google's NotebookLM, capable of converting text files into podcast-style summaries using Meta's Llama models. While a great offering (since it’s free and open source), TechCrunch notes:

The results don’t sound nearly as good as NotebookLM. In the NotebookLlama samples I’ve listened to, the voices have a very obvious robotic quality to them and tend to talk over each other at odd points.

NotebookLlama is free to download and use locally, in conjunction with Llama 3.1 LLMs and Suno bark audio generation.

Figure 2. How NotebookLlama turns a document into a podcast, in 4 steps.

Grok AI model now has image understanding. Catching up to other LLMs, Grok 3 can now understand and explain images, including memes and jokes. It’s an early version, so more improvements will be coming.

ReCraft AI released Recraft V3 AI Image Generation Model. ReCraft V3, or Red Panda, is a new AI image generation model that offers state-of-the-art photorealism and text adherence. Recraft V3 is truly SOTA, with a 72% win rate and the highest ELO score on the image generation leaderboard, beating out competitors including Black Forest Labs Flux. One notable feature is it creates SVG vector files that are editable with non-AI image editing tools.

Figure 3. The Recraft V3 (Red Panda) AI model excels at generating text for images, making it perfect for applications like movie posters and similar visual content.

ElevenLabs now offers “X to Voice”, which generates a unique voice from your X/Twitter profile.

Suno introduced Personas: “Personas let you save the essence of a song - vocals, style, vibe - and reimagine it across your creations.” This is like having a style guide for a song. You bring the vocals or style from one song into a new song by importing the Persona.

DecartAI released Oasis, the first playable AI-generated video game, developed in collaboration with Etched. This groundbreaking game utilizes AI to create immersive and dynamic real-time environments, which look a lot like Minecraft. The difference is Oasis is generating the world on-the-fly with AI:

Oasis marks our initial foray into more complex interactive worlds and offers a glimpse of what we call and hope to coin, a “Generative Interactive Experience.”

Figure 4. Oasis is an on-the-fly AI-generated video game Minecraft clone.

The Oasis real-time simulated game environment has low resolution and memory issues, so it has far to go to be useful and ‘playable’. This release coincides with news that Decart, an Israeli AI company, emerged with $21 million in funding.

Open-source text-to-speech model MaskGCT has been released. MaskGCT offers zero-shot voice cloning, emotional text-to-speech, long-form synthesis, variable speed synthesis, and bilingual (Chinese & English) capabilities. It’s available on Hugging Face as a downloadable model or to try a demo. A paper on MaskGCT was published on Arxiv.

ZhipuAI launched GLM-4-Voice, an end-to-end Chinese/English speech AI model that provides direct speech understanding and generation, useful for real-time conversations. The GLM-4-voice-9B model is open-sourced and readily available on HuggingFace for download.

Blockade Labs introduced BlendBox Alpha, a novel generative AI tool with an innovative UI. BlendBox allows users to create cut-out images, they call ‘stickers,’ and seamlessly assemble and blend them into different backgrounds. BlendBox is a promising tool for image editing, manipulation, and generation.

Figure 5. BlendBox lets users assemble and blend image elements to make an image.

Apple confirms general availability of iOS 18.1, iPadOS 18.1, and macOS Sequoia 15.1, featuring the first batch of Apple Intelligence tools like writing aids and image cleanup. However, only select devices with advanced chips can access these AI features, and more advanced features like Genmoji and ChatGPT integration await the next OS release 18.2. Some users have been underwhelmed, given more advanced Apple Intelligence features were touted at last WWDC.

Google Maps is integrating Google’s AI model Gemini to offer users personalized recommendations for places to visit and improved navigation features. The updates include clearer lane guidance and the ability to ask AI-powered questions about locations, rolling out in the U.S. this week.

AI Research News

Meta released open source VLM LongVU, a VLM designed for enhanced long video understanding. LongVU is a project focused on improving long video language comprehension and their research was shared in the paper “LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding.” It compresses video information and fuses features using DINOv2 and SigLIP, and select tokens are passed to Qwen2/Llama-3.2-3B for understanding. This handles long videos with impressive performance. he LongVU demo is available on HuggingFace.

OpenAI released a new factuality benchmark called SimpleQA. The goal of SimpleQA is to benchmark correctness on fact-seeking queries across a wide range of topics and be challenging for frontier models.

MIT this week showcased a new model for training robots using a large-scale data approach akin to that used in training LLMs. The new architecture called Heterogeneous Pretrained Transformers (HPT) integrates information from various sensors and environments, aiming to create a universal robot brain requiring no additional training.

Our AI research highlights article for this week covered long context in LLMs and related topics:

LOGO: Long context alignment via efficient preference optimization
LLM×MapReduce: Simplified Long-Sequence Processing using Large Language Models
LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs
HalluEditBench: Can Knowledge Editing Really Correct Hallucinations?
Contextual Document Embeddings

AI Business and Policy

Amazon CEO Andy Jassy hinted at an improved, “agentic” version of Alexa during the Q3 2024 earnings call. The new assistant will act on users' behalf and is being rebuilt with a new set of foundation models. However, the launch of improved Alexa has been delayed until 2025.

In an AMA, OpenAI CEO Sam Altman admitted compute capacity limits product releases. Struggles with compute infrastructure have delayed features like ChatGPT’s Advanced Voice Mode and DALL-E’s next release. Altman also notes that the company is working on an AI chip with Broadcom, expected by 2026.

Google CEO Sundar Pichai announced Google's AI-Generated Code Surpasses 25%, with human engineers overseeing the process.

Meta is partnering with GelSight and Wonik Robotics to commercialize advanced tactile sensors for AI robotics. The Digit 360 sensor, designed for scientists, uses an on-device AI chip to digitize touch signals with human-level sensing capabilities. Meta also collaborates on the Allegro Hand by Wonik, integrating these tactile sensors onto a robotic hand platform available starting next year.

AI Startup Noma Security secured $25 million in Series A funding to develop tools to identify and mitigate cybersecurity vulnerabilities in AI applications.

Chinese researchers affiliated with the PLA used Meta’s Llama 2 AI model to develop a military-focused chatbot named ChatBIT for intelligence gathering and operational decision-making. This is not a very advanced use of the AI mode, but China’s military leveraging open AI models for defense, potentially fueling debates on the risks and benefits of open-source AI.

The European Union has initiated a review of Nvidia’s $700 million acquisition of AI startup Run:ai following Italy's referral request under the EU Merger Regulation. This move could delay or complicate the transaction as the Commission investigates potential competitive issues within the EU.

Perplexity Launched an Election Hub with AP and Democracy Works Data, providing live updates on races at state and national levels for the upcoming election.

AI Opinions and Articles

Larry Ellison is passionate about AI for Healthcare and says Oracle is on a mission to transform healthcare. He shared a lot of specifics where digitization and AI can accelerate progress, advance clinical trials, and improve outcomes. Along the way, he mentions how passwords will be replaced by Biometrics and voice will be the interface to new AI-enabled medical systems.

I remain optimistic and excited and highly motivated as we tackle these huge [Healthcare and Medical] problems. I can't think of a better way to live a life. - Larry Ellison

AI Changes Everything

Discussion about this post