Discussion about this post

User's avatar
Pawel Jozefiak's avatar

The Opus 4.6 versus GPT-5.3-Codex framing misses what's actually happening beneath the model wars. I've been deploying both in production environments, and the interesting shift isn't in benchmark wins - it's in how enterprises are making buying decisions. We're seeing 40% of enterprise apps integrating task-specific agents this year, up from basically zero meaningful deployment in 2024. That's not a model quality story, that's an infrastructure maturity story. The real competition isn't Anthropic vs OpenAI - it's Microsoft Copilot at $21-30/seat trying to justify that pricing against Salesforce Agentforce at $0.10/action.

I watched Klarna replace 700 agents with a multi-agent system handling 2.3M conversations, cutting resolution time from 11 minutes to 2. That's the metric that matters. Not which frontier model scores higher on coding benchmarks, but which stack lets you ship automation that actually works without a team of ML engineers babysitting it.

No posts

Ready for more?