Caveats Needed
I was going to add this as an addendum to Monday’s Original post on Claude 3, “Claude 3 Released - It's A New AI Model Leader,” but I am making it a separate article since it highlights new information for those who read the original. This update adds more flavor and information on what Claude 3 can do and real-world uses.
There’s always a risk jumping on new AI model release announcement before it’s really put through the paces. Our day-of-release article called Claude 3 “A New AI Model Leader,” suggesting it was a GPT-4 beater, with caveats that real-world use would be the real test.
We stand by our overall assessment, but as Google’s Gemini announcement also showed, you need to read the fine print on announcement claims. While Claude 3 benchmarks that show Claude 3 Opus beating GPT-4 across the board, this is against GPT-4 original numbers, not GPT-4 turbo today, which has gotten better since GPT-4 of March 2023. There was a bit of benchmark sleight-of-hand.
Reviews and Real-World Usage
As people have tried it in the real-world, the reviews have been mix of solid results and worse-than-expected outputs. Claude 3 Opus is a very good model, better in some (but not all) cases than GPT-4, but not a clear GPT-4 beater. Claude 3 Sonnet is also a very strong AI model, which makes it very appealing as its free on Anthropic’s chat interface, and API is low-cost.
Claude 3 is good on needle test, even to the point in one case of recognizing an inserted “needle in the haystack” sentence about pizza was out-of-place in its inserted technical documentation.
Ralph Brooks on X gave Claude 3 Opus and ChatGPT a finance question: “Claude does math as part of the prompt MUCH better than what I have seen in GPT-4 (where the calc is outsourced to python), but the reasoning isn't 100%.” Advantage Claude 3.
He also shows Claude 3 doing best of AI models at AI research paper summary “It writes formulas. It gives PyTorch code.” GPT-4 did too, but not correctly.
Some users and reviewers found Claude 3 failing on some logic tests that GPT-4 gets correctly.
Multiple users show Claude 3 does well on long paper and text summarization.
Claude 3 still refuses some requests it shouldn't refuse. Ray Fernando found Claude 3 refused to review Andreesen’s Techno-Optimist manifesto. Justin Hart found it refusing to aid in political marketing that involved the border issue. Despite some odd refusals, it doesn’t seem to be crippled by political shyness or correctness. Claude 3 will do Trump and Obama poems on command, and Matt Wolfe found it would give ‘balanced’ commentary on politics.
If you have used Claude 3 and have some feedback to share on it, leave a comment.
Context Windows and Deep Results
Multiple reviews show assisting research is a strength of Claude 3 AI models, with some remarkable results. One is an amazing result in algorithmic research: “Claude 3 Opus just reinvented this quantum algorithm from scratch in just 2 prompts.”
Another is how Claude 3 Opus managed to superbly handle translations of Russian to Circassian: “Claude not only provided a perfect translation but also broke down the grammar & morphology.”
The user who tried this found that Claude performed as well as his built custom AI models for this translation. He tried this process on GPT-4 and it failed. (Note: He originally thought Claude 3 got its Circassian knowledge just from information in the context window, but it turns out that Claude 3 did have prior knowledge of the language.)
Part of the ‘magic’ of handling complex queries well is having a large context window, so a lot of specific information can be loaded into directly to a powerful AI model. Querying from a large context input enable use cases that require a grasp on a lot of detail, and 200,000 tokens is several large novels worth of information, whole textbooks, and large codebases.
It will get even better. GPT-4 is also at the 200K token level, and Anthropic claims they can go up to one million token context window, just as Gemini 1.5 Pro can do.
Large context has profound implications for how deep AI models can go on problems. Large memory in context could displace some uses of fine-tuned AI models, and it can make for much more personalized AI assistants ‘on-the-fly’.
Claude 3 is Great at Programming
This last point has implications as well for AI for coding. Specifically, loading in a large programming codebase for an AI model to utilize makes for a very effective AI coding assistant.
An Anthropic engineer showed how Claude Opus made a Javascript D3 visualization of a self-portrait, then rendered the code into a visualization, that became our Figure 1 cover art. She also made a Claude 3 code-to-visualization demo of creating “a visual scrolly to explain backprop and Kolmogorov’s complexity concepts for beginners in ML with simple three.js animations.” Claude 3 can assist with both data science and the code-generated visualization of it.
Reviewers such as Fireship and Matt Wolfe have gotten good results on programming tasks. It is handling using libraries well. Owen Colegrove found using Claude 3 for programming has been ‘a step up’:
I've been programming w/ Claude-3 Opus today, it feels like a step up from gpt-4. It is often providing better code and w/out abbreviation so that I can immediately copy + paste the output.
Ben DeKraker asks of two code snippets (A and B) shared in the figure below: “Based on this snippet, which do you think is Opus and which do you think is GPT?”
His own reply: “B is Opus -- more complete and more thorough. Multiply this example x100 times and it really adds up.”
Other users chimed in about GPT-4’s “laziness” in not providing all code. Another user replied on X:
I've had Opus generate me 200+ lines of non-stop code and it *really* makes a difference. Unsubscribed from ChatGPT specifically because of its laziness.
General reactions seems to be that Claude 3 Opus generates more complete and cleaner code, and that you don’t have to explicitly ask for a full solution. One user chimed in:
Claude 3 has been insanely good for code for me so far, significantly better than gpt4. Hopefully causes OpenAI to up their game.
Bottom line, Claude 3 Opus is providing better AI code generation for many users than they can get from GPT-4, which makes Opus possibly the best coding assistant out there right now.
Summary - The Power of Three
Putting these takeaways, reviews, and user experiences together:
The Claude 3 release announcement had some bold release claims that, like Gemini 1.0 Ultra, needed to be taken with grain of salt. That said, Claude 3 is a strong AI model suite that in many respects lives up to claims.
Claude 3 Opus is not beating the pants off GPT-4, but that in some cases is better than GPT-4, in other cases less preferred. Currently on the Lmsys Chatbot Arena leaderboard, Claude 3 Opus (at 1233) is close but below reigning leader GPT-4 (1251).
Claude 3 Opus really shines in some important use cases, such as deep research, large corpus summarization, and generating code. Thus, it will be uniquely useful for some users.
Claude 3 Sonnet is quite good as well, below but near-GPT-4 performance (1180 on Chatbot Arena leaderboard). Since it will cost $20 a month to get on “Claude Pro” to access Claude 3 Opus, sticking to the free tier Claude 3 Sonnet may work best for many. I’ll be trying it for coding and summarization tasks.
Sam Witteven called Haiku the ‘star of the show’ because it’s better than GPT-3 yet much cheaper and faster. So, depending on your use cases, any or all three of the Claude 3 AI models could be of use.