Claude 3.7 Sonnet Unleashed

Anthropic releases a powerful Claude 3.7 Sonnet that’s great for coding, adds extended thinking, and releases Claude Code agent to further support AI coding.

Feb 26, 2025

A screenshot of a video game

AI-generated content may be incorrect. — Figure 1. Ozgrozer on X makes a virtual 3D city in one prompt: “OMG Claude Sonnet 3.7 is insane. I got this stunning 3D city with one shot. Look at the shadows and the way the day transitions. This is just awesome.”

Anthropic Releases Claude 3.7 Sonnet

Today, we’re announcing Claude 3.7 Sonnet, our most intelligent model to date and the first hybrid reasoning model on the market. - Anthropic

Anthropic released their latest and greatest model, Claude 3.7 Sonnet, a SOTA frontier AI model that boasts significant improvements over its predecessor and competitors. Claude 3.7 Sonnet excels particularly in coding, reasoning, and agentic tool use.

Key features, which we will cover further below:

Claude 3.7 Sonnet is a hybrid reasoning model that can answer queries either as a base model or using Extended Thinking for AI reasoning.
Claude 3.7 Sonnet excels in coding, achieving over 70% in SWE-bench.
Anthropic released a new command-line interface called Claude Code, Anthropic’s first agentic coding tool, in a limited research preview.

Benchmarks and The Base Model

Claude 3.7 Sonnet’s strongest area of performance is coding. Since its release last year, Claude 3.5 Sonnet has been the favorite AI model of many coders, used both for standalone code generation and in Cursor or other AI coding environments. Recent AI model releases such as o3-mini called that into question, but now Claude 3.7 Sonnet reclaimed that “Best Coding AI Model” title.

Claude 3.7 Sonnet shows a significant improvement over previous versions and other models on various benchmarks, including a 20% increase on SWE-bench, obtaining 70.3% score on SWE-bench with custom scaffolding.

A graph of different colored rectangular shapes

AI-generated content may be incorrect. — Figure 2. On SWE-bench verified, Claude 3.7 Sonnet performs at the SOTA for a base AI model and outperforms even o3-mini.

The ‘vibe check’ of initial reviews have been very positive, earning Claude 3.7 hype video titles like “CLINE + Claude Sonnet 3.7 Is Completely INSANE” and Claude 3.7 just dropped and it's insane (best code model ever), while All About AI tested Claude 3.7 on various coding, creative writing, and puzzle-solving tasks and found it impressive, expressing excitement about the new model.

The Claude 3.7 Sonnet has already been integrated in AI coding tools such as Cursor, representing a big step-up in capabilities. They also have improved the Claude.AI coding experience:

Our GitHub integration is now available on all Claude plans—enabling developers to connect their code repositories directly to Claude.

Another area where Claude 3.7 Sonnet excels is agentic tool use, outperforming others in real-world tasks involving API interaction and performing highly on TAU-bench, which tracks agentic use cases.

Claude 3.7 Sonnet is a solid AI model all around. Anthropic said they focused on making Claude 3.7 better on real-world coding than benchmarks, with some benchmarks only modestly improving over Claude 3.5, while other results are stellar.

A screenshot of a graph

AI-generated content may be incorrect. — Figure 3. Claude 3.7 Sonnet performance on a variety of benchmarks.

Extended Thinking

Every competitive frontier AI model in 2025 will have some form of AI reasoning and Claude 3.7 is no exception. Claude’s extended thinking provides chain-of-thought reasoning for improved accuracy and performance, using scaling of test-time compute.

The Claude extended thinking process is not hidden from users, but visible in raw form. Anthropic explained the benefits of this: Trust, alignment, and interest.

In the Claude.ai interface, there is a switch to turn on extended thinking. When using the API, users or developers can budget the amount of thinking used in a query by specifying a token limit.

How well does extended thinking perform? As with other evaluations, the more thinking tokens that are used, the more it improves results. In the GPQA benchmark, Claude 3.7 Sonnet obtained 68% without extended thinking, and improved it to a better-than-human 84% with extended thinking.

A graph with a line going up

AI-generated content may be incorrect. — Figure 4. Claude 3.7 Sonnet’s performance on AIME 2024 mathematics benchmark, scaling performance according to how many thinking tokens it uses per problem.

Claude’s extended thinking and agent training are a powerful combination in virtual tasks, which is shown by performing better on evaluations like OSWorld. They also give it a major boost in any interesting area: Video gameplay. They got Claude 3.7 Sonnet to play Pokémon, and it’s actually quite good.

Claude Code

Claude Code is an agentic coding tool that lives in your terminal and integrates directly with your development environment. The Claude Code environment allows you to efficiently generate and modify code through the command-line. As an interface, Claude Code reminds me of Aider (which is a great AI coding assistant):

Claude Code is an active collaborator that can search and read code, edit files, write and run tests, commit and push code to GitHub, and use command line tools—keeping you in the loop at every step.

A screenshot of a computer

AI-generated content may be incorrect. — Figure 6. Claude Coder is a simple command-line AI coding tool built on powerful Claude 3.7 Sonnet.

Claude Coder was released to a limited beta preview, and there is a waitlist to get on. It works on macOS, Ubuntu/Linux, or Windows WSL operating systems, so pretty much any developer could adopt it.

Claude AI models are currently used in most AI coding environments, such as Replit, Cursor, and Vercel; it’s an interesting decision for them to build their own interface and compete with AI coding assistants directly.

Pricing and Access

Access: Claude 3.7 Sonnet is now available on all Claude plans, including Free tier, but you need the Pro tier or other paid plans to access their Extended Thinking mode.

Pricing: Claude 3.7 via the API is not cheap, with pricing for Claude 3.7 at $3 / million token input and $15 / million token output. This is the same as Claude 3.5, and considerable higher than alternatives such as Gemini 2.0 Pro or even o3-mini.

A close-up of numbers

AI-generated content may be incorrect. — Figure 7. Pricing for Claude 3.7 Sonnet. As other AI models have come down in price, it remains relatively expensive.

Limitations: One feature Claude 3.7 currently lacks is access to live information from the web. It cannot compete on use cases requiring web access or knowledge of current events. Sonnet has a knowledge cut-off of October 2024.

Safety: Anthropic published a detailed system card for Claude 3.7 Sonnet, covering their safety evaluation results in several categories, showing progress in alignment, utility and safety. For example:

Claude 3.7 Sonnet makes more nuanced distinctions between harmful and benign requests, reducing unnecessary refusals by 45% compared to its predecessor.

Conclusion - Gen 3 Models

Overall, Claude 3.7 Sonnet is an excellent frontier AI model, state-of-the-art for reasoning, coding, and AI agent applications. Anthropic is likely focused on coding applications because that’s their bread-and-butter; their work on the Anthropic Economic Index shows that 37% of Claude conversations are computer programming-related.

With Grok-3 and Claude 3.7 Sonnet released, what are the next shoes to drop? Rumors abound that GPT-4.5 will be released this week, and Llama 4 will release soon.

Why are these major AI releases happening now? Multiple major AI labs spent 2024 building new AI compute capacity on the latest Nvidia GPUs and training AI models at a scale that was impossible before. Their efforts are now bearing fruit.

Ethan Mollick is calling these newly released AI models “Gen 3 models”, suggesting pre-training scaling beyond 10^26 FLOPs is yielding the next generation of base AI models beyond GPT-4. While we don’t know the dataset size or exact FLOPS used to train Grok-3 or Claude 3.7, they confirm one thing: Scaling still matters.

AI Changes Everything

Discussion about this post