The one bench nobody wants to hill-climb
welcome to benchnerfing era, sonnet 5 weaker than sonnet 4.6
Anthropic rolled out Claude Sonnet 5 as its most capable agentic model yet, touting stronger reasoning and tool use than Sonnet 4.6 at a lower price point, but independent checks quickly surfaced regressions on benchmarks that fall outside the usual optimization targets.
The one bench nobody wants to hill-climb
welcome to benchnerfing era, sonnet 5 weaker than sonnet 4.6
A vocal AI developer highlighted weaker scores versus the previous Sonnet on non-targeted tests, prompting others to ask whether the new model even clears the bar set by GLM-5.2.
Some commentators argue the incremental gains do not justify the 5.0 label and might better suit a point release like 4.8, keeping the conversation focused on what actually moved forward.
Users in the replies dismiss Anthropic's Claude Sonnet 5 release as a nerfed downgrade with worse performance than prior models like Opus 4.8 because it seems driven by marketing and greed instead of genuine improvements.
No Digg Deeper questions have been answered for this story yet.
I knew it was going to be an insane nothingburger, because there's currently a soft ban on frontier capabilities
but I genuinely don't understand why they didn't call it Sonnet 4.8 or Sonnet 4.9, because this artificially nerfed piece of shit is not worthy of the 5.0 naming
Claude 5 has so far been the worst launch by Anthropic
Fable 5 isn't available and Sonnet 5 was nerfed to death
like does it even beat GLM-5.2?
I knew it was going to be an insane nothingburger, because there's currently a soft ban on frontier capabilities
but I genuinely don't understand why they didn't call it Sonnet 4.8 or Sonnet 4.9, because this artificially nerfed piece of shit is not worthy of the 5.0 naming
but we all know that Anthropic's .5 iterations are the real generational jumps. so it's fine
Sonnet 3.5 >>> Sonnet 3 Sonnet 4.5 >>> Sonnet 4 Sonnet 5.5 >>> Sonnet 5
(in terms of relative improvements to the previous models, in raw scores they are obviously better)
I knew it was going to be an insane nothingburger, because there's currently a soft ban on frontier capabilities
but I genuinely don't understand why they didn't call it Sonnet 4.8 or Sonnet 4.9, because this artificially nerfed piece of shit is not worthy of the 5.0 naming

turns out it does (barely) at like 3x the price
it gets even worse:
tl;dr: Sonnet 5 is cheaper per token, but more expensive per solved problem – and still lags behind Opus 4.8 in overall intelligence.
Thats honestly disappointing and not a good release.

@0xVita @scaling01 One of these days, I need to watch these movies.

@scaling01 feels like haiku-5 instead, I think they should have gone with that
roughly matching sonnet-4.6 perf on medium for half the cost is cool, lot of enterprises are going to love that

@scaling01 Thank god the mahdi is here to tell us the truth we @Presidentlin were worried you got hit by a bus when you didn’t break the news first

@kimmonismus It's truly a horrible release. What's the point?

@scaling01 i think you mean best launch?

@scaling01 We have entered the age of artificially-limited frontier model regression.
We can thank the gov for that.
US is no longer going to be a safe haven for rapid AI innovation, unfortunately.

@scaling01 I agree, it was an enormous letdown. It doesn't reach Opus capabilities and it's in the opposite Pareto frontier quadrant, the worst quadrant in token efficiency at least for the benchmarks

@scaling01 This is probably haiku size, Opus 4.6 was the original Sonnet 5. They are just greedy

@teortaxesTex Liang Wenfeng if you can hear me obliterate the evil west V4.1 PRO MAX.finalversion to put an end to this charrade

@scaling01 then people would ask for sonnet 5… best they just pump out this filler model now and focus their attention on the big guns again, who cares about anything other than the SOTA

@scaling01 A theory (idk how evidenced) I’ve heard is that the integer versions are new pretrains and the .x versions are RL checkpoints
That’d make sense, the sweet spot would be after a significant amount of RL from usage data on the new base model
That’d make way too much sense tho

@scaling01 I opened up the system card, saw this, and stopped giving a fuck, lol

@scaling01 by a hair

@scaling01 What do you mean nerfed? Safeguards or just generally a weak model?