Just nowRead time 9 min

Per-token billing is lying to you: a 2026 guide to measuring LLM cost per task

Your provider invoice says you spent $400 on tokens last month, but it cannot tell you which feature ate the budget, which user is unprofitable, or which model is actually cheapest for your real workload. Per-token billing measures inputs; your business runs on tasks. Here is a source-checked guide to per-task LLM cost observability in 2026 — the metrics that matter, how to attribute cost across traces, and the tools (Langfuse, Helicone, Portkey, Datadog) that get you there without building a custom analytics team.

AI developer-tools Model news

LLM cost observability per task cover

This is the third piece in a cluster on the 2026 LLM price war. The first, the strategic price-war analysis, argued that no single provider is dominant on price. The second, the API routing and fallback guide, argued that provider choice should be a reversible, config-driven decision. Both ended on the same operational claim: per-task cost, not per-token cost, is the only honest metric. This piece is the "how" of that claim — how to actually measure what a task costs you, so you can act on the price war instead of just reading about it.

My read, after going through the observability documentation and practitioner sources, is blunt: most teams cannot answer "what did summarizing this document actually cost us?" because their billing is token-level and their product is task-level. That gap is where the real money leaks. Per-task observability is the single highest-leverage thing you can build in 2026 if you spend more than a few hundred dollars a month on LLM APIs — more impactful than switching providers, more impactful than adding fallback, because it tells you whether either of those actually helped.

The core problem: your bill is in the wrong unit

LLM providers bill you per token: so many million input tokens, so many million output tokens, each at a published rate. That is the unit they invoice in. But your business does not ship tokens. It ships tasks: a summary, a classification, a code review, an agent run that looped three times. A single task can fan out into many token-charged calls — a prompt, a retrieval, a retry, a tool call, another model for verification — and your invoice flattens all of that into two numbers per model per month.

The consequence: the invoice tells you what you spent, but it cannot tell you why. It cannot tell you which feature is expensive, which user is unprofitable, which model looked cheap per token but ended up expensive per task because it retried constantly, or whether the fallback you added last week actually saved money or just shifted spend to a more expensive secondary. You are flying blind on the one decision the price war keeps forcing you to make: which provider to use for which work.

This is the gap per-task observability closes. Instead of aggregating tokens up to a monthly invoice, you aggregate tokens down to a task: every LLM call inside a task is tagged with the task id, and at the end you know what that specific task cost across every model, retry, and tool call it touched.

The metrics that actually matter

Forget vanity dashboards. The metrics that change decisions are:

Cost per completed task, by task type. What does a "summarize" task actually cost, end to end, including retries? This is the number that lets you compare models fairly — a model 3x cheaper per token that needs 2x more calls to converge is not cheaper, and you can only see that at the task level.
Cost per task type, by model. Same task type, different models, ranked by what they actually cost you to complete the work — not by their price sheet. This is the input to cost-based routing.
P50 / P95 cost, not just average. Average cost hides the long tail. A task type with a cheap average but an expensive P95 — because some inputs blow up the context window or trigger retries — is a budget risk. Track percentiles, not just means.
Cost per user (or per feature, per tenant). This is the number that tells you which users or features are profitable. Without it, you cannot price a product, set quotas, or decide which feature to invest in.
Error and refusal rate, by model. A model that refuses or errors 15% of the time and triggers fallback is effectively more expensive than its per-token price suggests, because the fallback call is also billed.

Notice what is not on this list: total monthly token spend. That number belongs on your invoice, not on your decision dashboard. It tells you the score, not the play.

How per-task attribution actually works

The mechanism is always the same, regardless of tool: a trace.

A trace is a tree of calls that share a root — the task the user or system initiated. When a user asks "summarize this document," you start a trace. Inside that trace, you log every LLM call, retrieval, tool call, and retry as a child span. Each span records its inputs, outputs, token counts, latency, model, and cost (computed from the provider's published price). When the trace completes, you sum the costs of all its spans to get the task's true cost.

This is exactly what Langfuse, Helicone, Portkey, Datadog LLM Observability, and the rest are doing under the hood. The differences are in ergonomics, hosting model, and how much of the instrumentation they automate. The shared principle — and the reason per-task attribution is even possible — is that cost is computed from token counts × published prices, then aggregated by trace, not read off a monthly invoice.

The 2026 tool landscape (with honest caveats)

The observability space is noisy. Here are the options worth knowing, with their actual sweet spots:

Tool	Type	Sweet spot	Source
Langfuse	Open-source, self-host or hosted	Per-trace cost attribution via OpenTelemetry spans; popular to self-host	Langfuse token & cost tracking docs
Helicone	Hosted proxy	Lightweight request logging, caching, cost tracking with low engineering overhead	Helicone vs Langfuse vs Cekura comparison
Portkey	Hosted gateway + observability	Cost observability plus the routing/fallback from the previous cluster piece	Portkey AI cost observability guide
Datadog LLM Observability	Hosted, in existing Datadog	Estimated cost per request inside dashboards teams already use for APM	Datadog LLM Observability cost docs
Braintrust	Hosted eval + observability	Combines cost tracking with evaluation; useful if you also run evals	Braintrust best tools for tracking LLM costs 2026

The honest caveat: most "best LLM observability tools 2026" listicles are vendor-affiliated or content-marketing driven. The capability claims above are verifiable on each vendor's own docs (Langfuse, Portkey, Datadog especially); the ranking orders in listicles should be read as marketing, not as neutral evaluations. A common practitioner pattern reported in the sources is pairing a gateway tool (Helicone or Portkey) for cost/routing with an evaluation tool (Phoenix, TruLens, or Braintrust) for quality — because cost without quality context is half the picture.

The sharp edges the marketing underplays

A few risks worth knowing before you adopt:

Cost estimates are estimates. Every tool computes cost as token count × a stored price. If the tool's price table is stale, or a provider changes pricing and the tool lags, your "cost" is wrong. Treat tool-reported cost as indicative, and reconcile against your real invoice monthly. Langfuse and Portkey document this explicitly; some tools do not.
Agent traces are where budgets actually blow up. A single user-facing task that loops an agent five times, calls retrieval twice, and verifies with a second model can spend 10x the naive "one LLM call" estimate. If you only instrument the outer call, you see 1/10 of the real cost. Trace the whole tree.
Caching distorts cost-per-task in both directions. A cache hit makes a task look artificially cheap (good for your dashboard, bad for your capacity planning). A cold cache miss makes a task look artificially expensive. Track cache-hit rate alongside cost, or your averages will mislead you.
Attribution without quality is half a decision. "Model A is cheaper per task" is useless if Model A also produces worse outputs that require a human re-do. Always pair cost with a quality signal (eval score, user feedback, re-do rate) before you route on cost alone.
Self-hosting trades cloud cost for ops cost. Self-hosted Langfuse on one EC2 instance is a popular pattern, but you now operate a database, an ingestion path, and upgrades. For small teams the hosted tier is often cheaper in total cost of ownership than the "free" self-hosted option once you count ops time.

How to actually adopt this in 2026

The practical path I would give a team right now:

Instrument traces before you pick a tool. Even if you only log to a database you already have, start tagging every LLM call with trace_id, task_type, user_id, model, tokens in/out, and latency. This is the data every tool needs; collecting it now means adopting a tool later is a config change, not a refactor.
Compute cost from published prices, not from the invoice. Multiply token counts by the provider's current published price. This gives you per-call cost immediately, before any tooling, and it is the building block every tool uses anyway.
Pick the lightest tool that covers your stack. If you are already on Datadog for APM, its LLM Observability add is the least-new-thing option. If you want open-source and trace-level depth, self-host Langfuse. If you want it hosted and paired with routing, Portkey closes the loop with the fallback patterns from the previous piece.
Build the four decision dashboards, not ten vanity ones. Cost per task type (by model), cost per user/feature, P95 cost, and error/refusal rate by model. Anything that does not change a routing or pricing decision is noise.
Reconcile monthly. Compare your tool's estimated spend against your real invoice. If they drift more than a few percent, your price table or instrumentation is off.

My take

The 2026 price war keeps dropping per-token prices, but it also keeps making the per-token number less useful as a decision metric — because the cheaper tokens get, the more of them an agent will happily spend, and the harder it is to see what a task actually cost you. The teams that win the price war are not the ones who chase the cheapest provider. They are the ones who can see, in real time, what each task actually costs across every provider, retry, and tool call — and route accordingly.

If you only build one piece of LLM infrastructure in 2026, make it per-task cost observability. It is the prerequisite for everything else: cost-based routing, per-user pricing, feature profitability, and honest answers when the next provider price sheet drops and someone asks "should we switch?" Without per-task numbers, that question is a guess. With per-task numbers, it is a config change.

This piece closes a three-part cluster on the 2026 LLM price war. If you have not yet, read the strategic price-war analysis for the "why" and the API routing and fallback guide for the "how to move between providers." This piece is the "how to see what it actually costs." For a maintained reference of the providers involved, see our AI pricing data page.

Sources

RAG did not solve hallucinations — it moved them: a 2026 guide to diagnosing why your retrieval-augmented generation fails in production

Your RAG demo worked on three PDFs and broke on the real corpus. That is not a mystery; it is the predictable cost of treating retrieval as a default instead of an engineering decision. Industry analysis in 2026 finds that when RAG fails, the failure point is retrieval roughly seven times in ten — not generation. Here is a source-checked diagnostic guide to production RAG in 2026: where it actually breaks (chunking, embedding, retrieval, staleness), the metrics that locate the break, and why RAG did not eliminate hallucinations so much as relocate them somewhere harder to see.

Every 512 tokens is not a chunking strategy: a 2026 practical guide to choosing how to split your documents for RAG

Chunking is the single highest-leverage and most under-treated decision in a RAG pipeline, and most teams leave it on the default. Here is a source-checked 2026 guide to the five chunking strategies that actually matter — fixed, recursive, semantic, late, and proposition-based — when to use each, the retrieval-quality tradeoffs, and why the right answer is never 'whatever the tutorial used.'