Thinking as a Tool
Anthropic introduces the Think tool, improving AI agents by combining reasoning and tool use.

Anthropic Introduces the “Think” Tool
Anthropic has developed a novel way to combine tool use and thinking to further enhance AI agentic behaviors. In their blog article “The "think" tool: Enabling Claude to stop and think in complex tool use situations,” Anthropic announced and explained this new feature and capability, along with prompting techniques to invoke it.
The feature is the “Think” tool, which “creates dedicated space for structured thinking during complex tasks.” This allows Claude to pause and think during response generation as a tool call, to check it has all necessary information, validate steps and reason about key facts or actions.
Thinking in a tool call is related to but different from extended thinking in an AI reasoning model:
Extended thinking is all about what Claude does before it starts generating a response. With extended thinking, Claude deeply considers and iterates on its plan before taking action. The "think" tool in Claude, once it starts generating a response, adds a step to stop and think about whether it has all the information it needs to move forward.
The Think tool helps in “long chains of tool calls or multi-step conversations” by reasoning to validate prior steps or process added information to complete the response.
How Think Tool Works
The Think tool creates a dedicated space or scratchpad for structured thinking, and when the tool is invoked, it adds structured thinking output to the scratchpad, which can be added back to the context for further processing. These thoughts help Claude follow specific policies and make informed, consistent decisions, so that it can more successfully complete complex agentic tasks.
Having the Think tool available by itself doesn’t do much. The AI model needs to invoke it in the right way, so it needs to be prompted to induce the right kinds of structured thinking steps. An example Anthropic shared is a use-case of Airline reservation support; they implement a specific prompt to utilize the Think tool to check reservation policies and check it against the actions taken by the agent tools.
Given that the tool called by Think is the AI model itself, and the model is simply adding tokens to a scratch-pad of thoughts, Simon Willison calls it a no-op:
The "think" tool is a no-op - there is no implementation, it just allows the model to use its existing training in terms of when-to-use-a-tool to stop and dump some additional thoughts into the context.
True that this is just more reasoning from the same AI model, but this is interleaving thinking and tool-calling, which is not how current extended thinking works. Another way of looking at it is that it is imbuing tools with reasoning capabilities.
If the prompt is just a way to guide the AI model to reason more, then it raises some questions, such as asked by Nick Dobos on X:
Is this basically a prompt hack to create extra-long reasoning chains?
He compares it to guidance on variable length token chains. This prompt technique, like the “think step-by-step” prompt that invoked early versions of chain-of-thought in LLMs, encourages a new thinking process, having the right thought in the right place.
Benchmark Results
Giving Claude a simple 'think' tool dramatically improves instruction adherence and multi-step problem solving for agents. - Alex Albert, Anthropic
Anthropic shared benchmark use cases in airline booking and retail service support to show how the Think tool improves Claude's agentic tool use ability, including “following policies, making consistent decisions, and handling multi-step problems.”
They evaluated this on a benchmark called tau-bench or τ-bench. The evaluation metric used was pass^k, which measures the probability that all k independent task trials are successful for a given task, averaged across all tasks. The pass^k metric rewards consistency and reliability, critical for customer service agents where consistent performance is essential.
Evaluations showed “significant performance improvements” when the Think tool was combined with optimized prompting in realistic customer service scenarios.
Specifically, on the Airline domain use case, the Think tool with an optimized prompt achieved 0.570 on the pass^1 metric, compared to just 0.370 for the baseline, a 54% relative improvement. On the retail domain, the Think tool alone achieved 0.812, a modest improvement over the 0.783 baseline.

Use Cases
Anthropic states the Think tool it is most beneficial for “tool output analysis, policy-heavy environments, and sequential decision-making.” It may not improve simpler instruction-following cases or non-sequential tool calls. Claude users have confirmed it does make a positive difference.
The Anthropic article provides guidance on implementation, recommending strategic instructions in the system prompt that define domain-specific examples for Think tool use. The prompt is detailed and structured, which limits its usefulness to pre-defined tasks.
The is an example of policy adherence reasoning to ensure compliance with complex policies:
Human: I want to return this product I bought 45 days ago, but I lost the receipt. The package is opened. Can I still get a refund?
Claude: Let me check our return policy carefully. [Claude uses the think tool]
Think tool: I need to analyze this return request against our store policy:
Item was purchased 45 days ago. Customer has no receipt. Package is opened.
According to standard policy, returns require: Returns within 30 days (this is outside that window at 45 days). Original receipt (customer doesn't have this). Unopened packaging for certain items (package is opened).
Exceptions to consider: Store credit might be possible even without receipt if we can verify purchase in our system. A manager override for the 45-day window might be possible. Product category matters - electronics and clothing have different policies. Loyalty program members may have extended return windows.
I need to explain these policy details and suggest possible options.
[Claude responds with a policy-compliant answer]
Think tool MCP Servers
Brandon Charleson mentions that the Think tool can be used as an MCP Server and he points to a Think tool MCP Server implementation on GitHub done by DannyMac180. Even before Anthropic shared Think tool, there have been reasoning MCP servers that let you do complex reasoning as a sub-task.
With such a tool available via MCP, a system prompt is needed to activate such a reasoning tool. A user could prompt an AI agent to invoke this to validate steps in a workflow, for example, as a checking tool for policies.
Conclusion
Think tool, combined with optimized prompts that provide reasoning examples to guide it, significantly enhances Claude's performance in complex agentic tasks. Anthropic has shown that interleaving AI reasoning with tool-calling makes AI agents more intelligent and capable.
The method relies on prompting domain-specific instructions, which is both limiting and begs some questions. What if this thinking between tool calls was native to the AI model? Can we train an AI reasoning model to think after it gets complex tool results?
The answer is yes, AI developers should go beyond prompting methods and train this into future AI models. They might wrap tools with a ‘reasoning’ layer that checks tool output and reasons about it.
This is not a Claude-specific feature. Reasoning can be added as a tool within any AI agent, and any AI reasoning model could be invoked in the Think tool. This is powerful enough that other AI models and agents will adopt it.
Future AI agents will interleave tool use and reasoning, combining them in new ways to ensure reliable and correct use in more complex use cases. Thinking as a tool is one more useful design pattern for AI agents.
Nice detailed overview of this idea. Sort of a fake tool, or the model treating itself as a tool and paying special attention.
If that works so well, why not a true notes file-- add both writing and reading to a scratchpad to mimic human task solving? It's not that we forget our prompt and progress so far, but it sure helps to refresh our memory and update status on non-volatile memory. Wouldn't LLMs benefit too? I guess we can actually try this now.