The reasoning tax

We added extended thinking to one routing call recently, in the middle of building an AI assistant for SupplyIQ, an AI-powered analytics and surveillance platform. The assistant talks to EPCIS data: shipment events, location events, temperature readings, aggregation records across a distribution network. Ask it where a pallet is, it queries the database and answers in text. Ask it for cold chain compliance across thirty distribution centers, it queries, calculates, and picks the right output: a heatmap by region, a Sankey tracing product flow, a line chart over time, or a table of violations. The agent decides which. Ask it something ambiguous, and an agent has to figure out what kind of question it even is before it can answer anything.

Response time on that routing call jumped more than tenfold.

Streaming to reduce perceived wait time

We tried streaming to make the wait feel shorter. It helped much less than we expected, and the reason matters.

Streaming with reasoning models works at the protocol level. The API sends tokens as they’re generated. But during the thinking phase, those tokens are internal. The model is reasoning through the problem in a scratchpad that isn’t meant to be shown to users. That scratchpad runs for ten, fifteen, twenty seconds before any visible response starts appearing. So from a UX standpoint, you get a spinner for however long the thinking takes, then text begins. If thinking takes 18 seconds, streaming bought you nothing except the last few seconds feeling slightly smoother.

The mistake is treating this as a streaming problem. It’s a latency-source problem. Streaming fixes generation latency. It doesn’t touch thinking latency. These are different things, and conflating them means you keep trying streaming solutions on a problem that streaming can’t solve. The answer we landed on was showing a progress indicator with specific status messages during the thinking phase, “searching events,” “aggregating by location,” “building chart,” so users knew the app was working. Not fast, but at least legible.

The routing decision is where reasoning earns its cost

Our assistant has an agent whose only job is to look at an incoming question and decide three things: what kind of question is this, what tool should answer it, and what should the output look like.

“Where is shipment XYZ right now?” is a lookup. Call the EPCIS query tool, return a location. Text response, maybe a map.

“How many pallets arrived at Chicago last week?” is a calculation. Query events, filter by location and time window, aggregate. Return a number, probably as a stat card.

“Show me cold chain compliance across all distribution centers for the last 30 days” is something else entirely. It needs a query, aggregation, business logic around what counts as a violation, and then a chart decision. A heatmap showing which facilities are struggling? A bar chart ranking sites by violation count? A line chart tracking compliance over time? The answer depends on what the data shows and what the user is actually trying to understand.

That last category is where we got burned. The routing agent kept misclassifying calculation questions as lookups, because the questions were phrased like lookups. “What’s the temperature variance on batch 442?” sounds like a question with a single answer. It’s actually a multi-step aggregation with a chart at the end. A standard model got this wrong about 30% of the time. Extended thinking brought misclassification to around 5%. The improvement was real, and for a supply chain application where a wrong answer could mean missing a cold chain breach, the improvement mattered.

But we’d applied reasoning everywhere, not just to routing.

Where reasoning doesn’t belong

Once the routing agent decides “this is a compliance calculation, render a bar chart by facility,” the steps downstream are structured execution. Query the EPCIS events, apply the filter, aggregate violations, build the chart config. These are decisions with obvious right answers once the routing is done. Throwing extended thinking at them is paying for deep analysis on problems that have no ambiguity.

We split the pipeline. The routing and intent-classification steps use extended thinking. Everything downstream uses a standard model. Cost per query dropped by about 60% compared to our first version where reasoning ran everywhere. Latency dropped by a similar proportion on the steps that didn’t need it.

The visualization agent specifically does not need reasoning. Its input is already structured: here’s the data, here’s the chart type the routing agent decided on, render it. The interesting decision was already made upstream. Reasoning there adds latency and cost with no quality benefit.

The rule we settled on: use reasoning where the cost of misclassification is high and where the correct answer isn’t obvious from the surface of the question. The routing decision in our application qualifies on both counts. The chart rendering step qualifies on neither.

That’s where the reasoning tax makes sense to pay. Everywhere else, don’t.