GPT-4.5 - The “Vibe” Release

GPT-4.5 brings higher EQ but not IQ and also a hefty price tag. Much effort with modest gains - is traditional pre-training scaling done?

Mar 01, 2025

A whale jumping out of the water

AI-generated content may be incorrect. — Figure. AI art –oil painting of whaling for a sperm whale. GPT-4.5 is the biggest AI model yet released, and I am reminded of the whale analogy Microsoft CTO Kevin Scott made last April about this AI model.

GPT-4.5 Released

“our largest and best model for chat yet” - OpenAI on GPT-4.5

After long anticipation, OpenAI has released their next GPT model, GPT-4.5, their "best model yet" for chat purposes, emphasizing its enhanced scaling in unsupervised learning that enables the model to better recognize patterns, draw connections, and generate creative insights.

The principle feature of GPT-4.5 is that it scaled up pre-training significantly from the previous GPT-4o model, and it has more parameters than any prior AI model. Massively expanding unsupervised learning gave GPT-4.5 a broader knowledge base, deeper world understanding, and better language fluidity than GPT-4o, its predecessor. As a result, it excels at general-purpose applications, having a more natural conversational feel with an improved personality.

GPT-4.5, like traditional LLMs, doesn't process information by breaking down complex problems step by step, so it is not a reasoning model like o1 or o3-mini. Instead, it responds like prior LLMs, based on language intuition drawn from its extensive training data.

GPT-4.5 is useful for tasks like refining writing, assisting with solving everyday problems, all while hallucinating (making factual errors) less frequently than prior models. While OpenAI portrays GPT-4.5 as a significant upgrade in knowledge and usability, with better language fluidity and nuance, it misses the mark in math and reasoning benchmarks. Also, GPT-4.5 is priced shockingly high, begging the question of whether this more polished conversational AI is worth it.

GPT-4.5 Benchmarks and Capabilities

"This model won't crush on benchmarks" - Sam Altman

GPT-4.5 has EQ (emotional intelligence) but not the IQ you might expect. It’s better in creative writing, more worldly, and is packed with more innate knowledge, but when it comes to math and problem-solving, it lacks the native step-by-step reasoning of AI reasoning models. However, not only is it behind some reasoning models, it’s behind recent AI model releases like DeepSeek’s V3 and Claude 3.7 Sonnet non-thinking mode.

GPT-4.5’s performance as shown on benchmarks is a mixed bag. On academic benchmarks, GPT-4.5 shows moderate gains over GPT-4o; GPT-4.5 trails on math and reasoning compared with AI reasoning models like o3-mini; on coding, GPT-4.5 shows solid capability but not as strong as Claude 3.7 Sonnet or Grok-3:

On MMLU, GPT-4.5 scores 89.6% versus 88.7% for GPT-4o.
On GPQA, GPT-4.5 achieves 71.4%, a leap from GPT-4o’s 53.6% and beating Gemini 2.0 Pro (64.7%), but trailing Claude 3.7 Sonnet (78.2%) and o3-mini (77%).
On AIME 2024 math benchmark, GPT-4.5 scores 36.7% better than GPT-4o, but o3-mini scored 87.3% in “high” reasoning mode, Claude 3.7 Sonnet obtained 61% without thinking mode.
On SWE-Bench Verified coding benchmark, GPT-4.5 scores 38.0%, better than GPT-4o (30.7%) but far behind Claude 3.7 Sonnet (70%). Notably, both the Grok-3 and Claude 3.7 Sonnet release announcements leaned in on code generation use cases, but GPT-4.5 did not.

GPT-4.5 has enhanced multimodal understanding and can seamlessly manage combined text and image queries, scoring 74.4% on MMMU. Combining visual with its vast knowledge marks a notable step up in multimodal AI capability.

A screenshot of a graph

AI-generated content may be incorrect. — Figure 2. Benchmark results for GPT-4.5 and other frontier AI models. GPT-4.5 does well on knowledge (GPQA) and multi-modality (MMMU) benchmarks but falls short on math and coding compared to AI reasoning models and base (non-reasoning) Grok-3 and Claude 3.7 Sonnet.

If Claude 3.7 Sonnet is your smart CS major and technical writing assistant, GPT-4.5 is more like your Liberal Arts major, not as good on math or coding, but a worldly and articulate conversationalist.

This is why OpenAI leaned in on marketing to those ‘vibe’ strengths, pointing away from the objective benchmarks, and more towards the human preference for GPT-4.5's tone, clarity, and engagement over previous models. OpenAI stated that testers interacting with GPT-4.5 found it “feels more natural” and the model better follows user intent, demonstrates greater EQ (emotional intelligence) in its replies.

The Vibe Check

“The first model that feels like talking to a thoughtful person” - OpenAI CEO Sam Altman, on GPT-4.5

The AI community reactions, like the benchmarks, have been a mixed bag.

The positive reactions have been around its improved language abilities. It’s been called a “Midjourney moment” for writing. AI commentator Andrew Curran said GPT-4.5 would “set new standards in writing and creative thought.” GPT-4.5’s conversational style is warmer and more intuitive, and GPT-4.5 has a nuanced understanding of context and user cues. It also shows improved visual understanding.

Andrei Karpathy said, “Everything is a little bit better and it's awesome, but also not exactly in ways that are trivial to point to” and posted a thread of pairwise comparisons of tone on his X thread. But his own poll showed people often preferred GPT-4o over GPT-4.5.

Some negative reactions are from those upset with the high price of this model and the fact that it's just marginally better at most tasks, in particular coding. The knowledge cutoff is October 2023, which degrades its utility for coding with the latest APIs and libraries.

The main negative reaction is that GPT-4.5 is underwhelming compared with the expectations for the next generation GPT model. Live by the hype; die by the hype. Teknium is brutal:

Guys it’s been 2+ years and 1000s' of times more capital has been deployed since GPT-4. What the hell happened?

GPT-4.5 Training

GPT-4.5 was trained by scaling up pre-training and following it with supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), similar to training used to make GPT-4o. OpenAI hasn’t shared exactly how big GPT-4.5 is, but it could be many trillions of parameters, given the cost they are charging for API access.

OpenAI used a lot of compute resources to train this model. OpenAI even spoke of training across multiple data centers.

The release of GPT-4.5 explains the news articles last fall about OpenAI’s Orion model showing small gains over GPT-4o, because GPT-4.5 is the Orion model. Orion was a big AI model training effort started in early 2024 and originally intended to be the next-generation GPT model, GPT-5. But it fell short.

We now know why they were disappointed in Orion: As a stand-alone AI model, GPT-4.5 is only marginally better than GPT-4o; it’s not bad, but not a huge leap forward either. OpenAI didn’t release it earlier because it fell short of expectations, so OpenAI pursued the promising direction of AI reasoning models instead, while trying to figure out what to do with Orion.

However, the Claude 3.7 and Grok-3 releases forced OpenAI to respond. So OpenAI released Orion / GPT-4.5 as a gap-filler until they release GPT-5 later in the Spring.

OpenAI says GPT-4.5 will serve as a solid foundation for future reasoning and tool-using agents once combined with advanced reasoning techniques. They will build GPT-5 on GPT-4.5 as a foundation.

Access and Pricing

GPT-4.5 is accessible to ChatGPT Pro users now and will roll it out to ChatGPT Plus and other paid plans next week.

The API pricing for GPT-4.5 is shockingly high compared to other AI models: $75.00 per million token input, $150 per million token output. This is 100 times the price for Gemini 2.0 Pro and 30 times what GPT-4o costs. While it supports function calling and tool use, the API pricing makes any AI agent use prohibitively expensive. Don’t bother.

OpenAI published the GPT-4.5 System Card, which shares results of AI safety testing but says little else about GPT-4 training, architecture and benchmarks.

Conclusion – The End of Unsupervised Scaling?

With every new order of magnitude of compute comes novel capabilities. GPT‑4.5 is a model at the frontier of what is possible in unsupervised learning. – Open AI

For ChatGPT users, GPT-4.5 is a nice upgrade from GPT-4o for some general-purpose AI tasks. Its more human-like EQ and creativity make it useful in coaching, education, and content creation tasks that benefit from a more understanding AI.

In terms of AI model competition, GPT-4.5 can be seen as OpenAI’s answer to Claude 3.7 Sonnet and Grok-3, reasserting its leadership in general-purpose AI. Each top-tier model has its strengths: Grok 3 with thinking stands out in advanced reasoning; Claude 3.7 Sonnet excels at coding; and GPT-4.5 delivers superior conversational skills and broad knowledge.

GPT-4.5 was trained by scaling pre-training significantly, but the payback from massively scaled AI training was modest, dealing a blow to the “scaling is all you need” thesis: If GPT-4.5 is bigger and was trained longer, why isn't this much better?

We know little about GPT-4.5 training details, but GPT-4.5’s results indicate that bigger is not always much better. Quality of data and quality of training play a role. For example, it may be that in training GPT-4.5 to be a generalist model, they used insufficient high-quality coding-specific data, resulting in weaker coding abilities.

Based only on pre-training scaling, GPT-4.5 is a transitional AI model, and OpenAI will pivot in how AI models will scale. OpenAI stated, “scaling unsupervised learning continues to drive AI progress,” but OpenAI also said this is their last non-reasoning model.

Future frontier models will be AI reasoning models. They will scale on both pre-training and reasoning, integrating the knowledge and language faculty gains of unsupervised scaling with new reasoning abilities from reasoning alignment and test-time compute methods.

OpenAI will complete that transition with GPT-5, an AI model that scales both pre-training and reasoning, to be released later this Spring. Memo to OpenAI: It better be good.

AI Changes Everything

Discussion about this post