OpenAI internal optimization cuts inference costs by half, running logged-out ChatGPT traffic on a couple hundred GPUs · Digg

OpenAI internal optimization cuts inference costs by half, running logged-out ChatGPT traffic on a couple hundred GPUs · Digg

Posts from X

Most Activity

VIEWS2.9KBOOKMARKS13LIKES65RETWEETS6

Chubby♨️@kimmonismus

OpenAI reportedly found new inference optimizations that more than halved the cost of running its models!

According to The Information, engineers told colleagues this month that the techniques helped power ChatGPT for visitors without free or paid accounts using only a couple hundred Nvidia GPUs at one point.

The exact method is unclear. It could involve quantization, KV caching, batching, routing simpler queries to cheaper models, or some mix of all of those.

The business angle is bigger than the technical detail: OpenAI ended Q1 with a 39% gross margin and wants to reach 52% by year-end. Lower inference costs give it room to either improve margins, raise ChatGPT usage limits, or cut API pricing pressure on developers.

OpenAI's moat is increasingly becoming inference and cost advantage, especially against Anthropic.

6h2.9K6513

REPLIES5

Lisan al Gaib@scaling01

yet they can't cut costs for users in half

intelligence too cheap to meter has died long ago when they saw their first billion

Stephanie Palazzolo@steph_palazzolo

OpenAI engineers earlier this month developed an optimization that cut inference costs in half for models it was applied to.

After the optimization was applied to logged-out ChatGPT traffic, it reduced the number of GPUs needed to power that traffic to a couple hundred.

7h1.2K122

Stephanie Palazzolo@steph_palazzolo

More on the optimization + what it could mean for OpenAI's gross margins or usage limits here:

https://www.theinformation.com/newsletters/ai-agenda/openai-discovers-new-way-cut-inference-costs-half

7h2.2K196

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

wow did they discover speculative decoding or something? margins go up again!

Stephanie Palazzolo@steph_palazzolo

OpenAI engineers earlier this month developed an optimization that cut inference costs in half for models it was applied to.

After the optimization was applied to logged-out ChatGPT traffic, it reduced the number of GPUs needed to power that traffic to a couple hundred.

7h76792

Nathan Lambert@natolambert

@AndrewCurran_ probably under a narrow-ish set of cirmcumstances or something, and then it gets reported like this

Andrew Curran@AndrewCurran_

OpenAI has found a way to cut inference costs in half.

7h330140

Rohan Paul@rohanpaul_ai

The Information reports that OpenAI has cut inference costs by more than half on some existing models, while logged-out ChatGPT traffic ran on only a couple hundred Nvidia GPUs.

The obvious guesses include quantization, KV-cache changes, batching, speculative decoding, and routing easy queries cheaper.

If true, it will be a huge core competitive lever, lower cost can raise margins, expand usage limits, or reduce pressure on API pricing.

For some context, OpenAI’s adjusted gross margin fell to 33% in 2025 from 40% in 2024, after inference costs quadrupled.

Some reporting now puts Q1-2026 at 39%, with a 52% target by year-end.

Anthropic looks similar at roughly 44%, so frontier labs remain far below mature software economics.

---

theinformation .com/newsletters/ai-agenda/openai-discovers-new-way-cut-inference-costs-half

1h94051

Solipsnitsyn@solipsnitsyn

@steph_palazzolo ohh so that's why 5.5 became so catastrophically stupid

7h93117

Michael@MichaelStolarz

@steph_palazzolo worded a bit poorly >visitors who didn't have a free or paid account what kind of account type is left then? i'm guessing grant/research/trial? would be better to know certainly instead of inferring

7h1.2K1

🍓🍓🍓@iruletheworldmo

@AndrewCurran_ do you know if this is part of the reported price wars?

7h8345

The Information@theinformation

OpenAI engineers figured out a way to more than halve the cost of inference, thanks to some newly-discovered optimizations.

Read more in today's AI Agenda: https://thein.fo/4gjQ42l

7h3.4K189

Loquitur Ponte Sublicio@loquitur_ponte

@AndrewCurran_ How we train and run AI is probably really really inefficient given the make it up as we went process.

Lot of gains to come on running the existing physical tech better / algorithmic improvements alone...

7h5031

cheaty@cheatyyyy

@AndrewCurran_ awfully convenient timing is all i'm going to say, not doubting OpenAI at all but this is a hilarious coincidence

what better way to cut inference costs in half than to double throughput

7h1385

Lisan al Gaib@scaling01

@RitsFur you mean like charging 30$ for a model that costs like 2$ to serve?

7h901

The Hero of KVcache@HeroOfKVcache

@steph_palazzolo >it's quantization again

7h7906

Ed Zitron@edzitron

@Jessicalessin @theinformation I think the story suggests this is explicitly for logged out users vs across the stack, especially given how casual the discussion is (“some colleagues”)

5h4296

StolenAngel@MoisasADR

@AndrewCurran_ Did they explain how? My concern is that they might label some form of system quantification as "optimization."

7h1352

Paweł J Lisowski@PawelJLisowski

@kimmonismus Getting harder and harder for anthropic to justify those prices. I dont think AI industry altogether is bubble, but both anthropic and openai both are starting to feel like it.

6h972

theo@crthpl_

@MichaelStolarz @steph_palazzolo no, people who are not logged in at all who just go to the website

7h952

Jessica Lessin@Jessicalessin

OpenAI Halves Inference Costs???

What do you AI experts think about this latest development.. more incremental gains or something bigger?

Discuss over at @theinformation

https://www.theinformation.com/forum/posts/1205

5h73800

Jessica Lessin@Jessicalessin

Um this seems big. @steph_palazzolo

https://www.theinformation.com/articles/openai-discovers-new-way-cut-inference-costs-half?utm_source=ti_app&rc=hwneun

7h37910