RAG doesn't do math

Sometimes a user asks something like: “We have all our sales data sitting in our warehouse. Can we use RAG so the team can ask things like what’s the average deal size in Europe last quarter?”

The short answer is no, not the way you’re thinking. Here is the longer answer.

What RAG does, and what it doesn’t

RAG is two things glued together. The retrieval part finds documents (or chunks, or rows, or whatever) that look relevant to the question. The generation part hands those documents to an LLM, which writes an answer based on them.

The LLM is the only thing producing the answer. Retrieval just decides what context the LLM gets to see.

This works fine when the answer is in the retrieved content. “What does our refund policy say about damaged items?” The retrieved chunk contains the answer. The LLM rephrases it. The model doesn’t have to compute anything.

When the answer requires arithmetic, this falls apart.

LLMs are bad at math, and worse at admitting it

If you hand an LLM a hundred rows of sales data and ask it for the average, it does not actually compute an average. It looks at the numbers, gets a vague sense of magnitude, and produces a number that “looks right.” The result might be close. It might be wildly wrong. The model has no way to tell, and neither does the user.

The failure mode is worse than just inaccuracy. The model returns a confident-sounding answer with a specific number. Users assume that number was calculated. They make decisions based on it. By the time someone checks the math, the wrong figure has already been quoted in a meeting.

I’ve seen this happen with totals (off by 30%), with averages (off by an order of magnitude), with counts where the model just listed items from memory and reported the length. It happens with every frontier model. It happens after fine-tuning. The problem is structural.

What actually works

For numerical questions, the right pattern is not RAG. It’s tool use.

The user asks “what’s the average deal size in Europe last quarter.” The LLM does not try to compute the answer. Instead, it generates a SQL query (or calls a function, or writes a Python snippet) that would compute the answer. That query runs against the actual database. The result comes back. The LLM then writes a sentence around the result.

User question
   ↓
LLM (writes SQL)
   ↓
Database (executes the query)
   ↓
LLM (writes a sentence around the result)

The model does what it’s good at, which is writing English and writing SQL. The database does what it’s good at, which is computing. Nothing pretends to be doing something it can’t.

This pattern goes by several names. Text-to-SQL. Function calling with a math tool. Code interpreter mode. They’re all variants of the same idea: get the LLM out of the calculation business.

When you genuinely need both

Plenty of real questions need retrieval and computation together. “What does our refund policy say, and how many refunds did we actually process last month?” That’s RAG for the first half and SQL for the second.

A production system needs to route. A small classifier (or the LLM itself in a planner step) decides whether the question is best answered by retrieving documents, running a query, or some combination. The orchestration is the actual product. The model is just one part of it.

The takeaway

When someone asks whether RAG can do their analytics use case, the honest answer is usually that RAG is the wrong tool for analytics. Analytics is a database query problem. RAG is a documents problem.

You can ship the wrong system if you don’t notice the difference. It looks like it works in demos. It produces wrong numbers in production, and you won’t catch which ones until somebody asks where the 12% revenue increase figure came from and nobody can reproduce it.