Just nowRead time 8 min

Stop hard-coding one LLM provider: a 2026 guide to API routing and fallback

In the 2026 LLM price war, the cheapest provider changes every month. The teams that win are not the ones who pick the right lab — they are the ones whose code can move when the leaderboard moves. Here is a practical, source-checked guide to multi-provider API routing: when to use a gateway (LiteLLM, OpenRouter, Portkey, Vercel AI Gateway), how to structure a fallback chain, and the four routing patterns that actually matter for production.

AI developer-tools Model news

LLM API routing and gateway cover

This is the practical follow-up to my earlier 2026 LLM API price war analysis. That piece argued the strategic point: equivalent intelligence is getting cheaper, no single provider is dominant on price, and the win is building vendor-portable workflows. This piece is the "how": how to actually route between OpenAI, Anthropic, Google, and challengers like DeepSeek or Z.AI without rewriting your code every time a price sheet changes.

My read, after going through the available gateway documentation and comparison sources, is blunt: most teams do not need a fancy setup. They need one thin abstraction, one fallback chain, and the discipline to track per-task cost instead of per-token cost. The gateway market (LiteLLM, OpenRouter, Portkey, Vercel AI Gateway, and newer entries like Bifrost) is real and useful, but a lot of teams reach for a gateway when a 50-line wrapper would do. Let me separate the two.

The core problem routing solves

Without routing, your code looks like this: you import the OpenAI SDK, hard-code a model id, and call it. When OpenAI raises prices, or Anthropic releases something cheaper for your workload, or your provider has an outage, you face a multi-week refactor across every call site. That refactor cost is the hidden tax of provider lock-in.

Routing solves three concrete problems:

Price mobility. When the leaderboard moves, you move traffic in a config file, not in source code.
Reliability. When one provider returns 429 (rate limit) or 5xx, your request automatically retries against a second provider instead of failing in front of a user.
Capability matching. Different tasks deserve different models — a cheap fast model for classification, a strong model for synthesis, a vision model for images — and routing lets you express that as policy instead of sprinkling if/else across your codebase.

The third one is underrated. Most teams only think about routing for failover, but task-based routing is where the real cost savings live once you have multiple providers wired in.

When you need a gateway vs. a thin wrapper

Two honest tiers:

Tier 1 — a thin wrapper is enough. If you have one main provider and one fallback, and your call volume is modest, write a 50-line client that exposes a single complete(prompt, { task, model_hint }) function and internally picks a provider. Store provider credentials and routing rules in env vars or a small config. This forces every call site to go through one boundary, which is the only structural commitment that matters. You get price mobility and basic fallback without a new dependency.

Tier 2 — you want a real gateway. The moment you need cross-provider observability (per-task token cost, latency, error rate by provider), automatic load balancing across more than two providers, protocol conversion (Anthropic Messages API ↔ OpenAI Chat Completions), caching, or a UI for non-engineers to change routing, a gateway pays for itself. The 2026 options worth knowing:

Gateway	Type	Sweet spot	Source
LiteLLM	Open-source, self-host or hosted	Broad provider coverage (100+), OpenAI-compatible unified API	BerriAI/litellm on GitHub
OpenRouter	Hosted marketplace	One API key, many models, pay-as-you-go across providers	OpenRouter
Portkey	Hosted + observability	Production fallback configs, prompt management, analytics	Portkey docs
Vercel AI Gateway	Hosted, edge-native	If you are already on Vercel and want routing close to your deploy	Vercel AI Gateway
Bifrost	Newer entrant	Automatic failover emphasized	Bifrost comparisons

The honest caveat: most "best LLM gateway 2026" listicles are vendor-affiliated or content-marketing driven. The provider-coverage and OpenAI-compatible-API claims for LiteLLM are verifiable on its GitHub repo and docs; the others I would verify against each vendor's own docs before committing. Treat ranking orders in listicles as marketing, not as neutral evals.

The four routing patterns that actually matter

Regardless of which tool you pick, the routing logic falls into four patterns. Most production setups combine several.

1. Fallback chain (reliability)

Ordered list of providers; if the first fails (429, 5xx, timeout), try the next. This is the minimum viable routing pattern.

fallback: [anthropic-claude, openai-gpt, deepseek-v4]

The key engineering detail: define "failure" precisely. A 429 is a clear retry signal. A 200 with a truncated or refused answer is not a retry signal unless you add output validation. Many teams set up fallback for HTTP errors but forget that a provider can return a "successful" empty or refusal response that still needs fallback to be useful.

2. Cost-based routing (price)

Pick the cheapest provider that clears your quality bar for this task. This is where per-task cost tracking pays off — you cannot do cost-based routing if you do not know which provider is actually cheapest per completed task on your real workload, retries included.

task: "summarize"     → cheapest model that passes summarize eval
task: "code-generate" → strong model, accept higher cost
task: "classify"      → smallest fast model

3. Latency-based routing (speed)

Route to whichever provider responds fastest for this request, often region-aware. Useful for user-facing chat where perceived latency matters more than marginal cost.

4. Weighted / load balancing (throughput)

Split traffic across providers by percentage to stay under rate limits or to spread spend. Useful at scale; overkill for small teams.

The practical pattern I would recommend for most teams in 2026: start with pattern 1 (fallback chain) for reliability, add pattern 2 (cost-based routing) once you have per-task cost data, and ignore patterns 3 and 4 until you have a concrete latency or throughput problem.

A concrete fallback config (Portkey-style)

To make this less abstract, here is the shape of a real fallback config, adapted from Portkey's documented pattern. The exact field names vary by vendor, but the structure is universal:

{
  "strategy": { "mode": "fallback" },
  "targets": [
    { "provider": "anthropic", "model": "claude-sonnet-4-6", "override_params": { "max_tokens": 1024 } },
    { "provider": "openai", "model": "gpt-5-4" },
    { "provider": "deepseek", "model": "deepseek-v4" }
  ]
}

What this buys you: if Anthropic returns 429 or 5xx, the gateway automatically retries against OpenAI, then DeepSeek. Your application code calls one endpoint; the gateway handles the rest. That is the entire value proposition in one paragraph. Source for this pattern: Portkey's Claude Code + Anthropic integration docs, which documents the fallback mode explicitly.

Protocol conversion: the hidden tax

One non-obvious cost: OpenAI and Anthropic do not share a request/response shape. OpenAI uses Chat Completions; Anthropic uses Messages. If you route between them, something has to translate. Three options:

A gateway that translates (LiteLLM, Portkey, API7, Vercel AI Gateway all do this). You write to one shape; the gateway converts.
Provider SDKs that share an interface (the Vercel AI SDK and LangChain both abstract this, at the cost of being limited to what their abstractions expose).
Manual translation in your wrapper. Works for simple cases, gets painful fast for tool calls, images, and streaming.

For most teams, option 1 (a translating gateway) or option 2 (an SDK abstraction) is the right call. Manual translation is a maintenance trap.

The sharp edges that are not in the marketing copy

A few risks the gateway vendors underplay:

A gateway is a new dependency and a new failure mode. If your gateway goes down, all your providers are unreachable behind it. For self-hosted LiteLLM this means you operate an additional service; for hosted gateways this means you inherit their outage surface. Treat gateway uptime as part of your reliability budget.
Protocol conversion is lossy at the edges. Tool calls, structured outputs, image inputs, and streaming semantics do not map perfectly between providers. Test your actual prompts through the conversion layer, not just "hello world."
Caching can quietly serve stale answers. Many gateways offer semantic caching to cut cost. A cache hit that returns a subtly outdated answer is worse than a fresh call, especially for anything time-sensitive or user-specific. Cache aggressively for classification and summarization; cache cautiously (or not at all) for personalized or factual-Q&A tasks.
Cheaper-on-paper can cost more in retries. A model that is 3x cheaper but needs 2x more calls to converge on a usable answer is not actually cheaper. Per-task cost, not per-token cost, is the only honest metric.
Region and account risk does not vanish with a gateway. If your gateway is hosted in one region and that region loses access to a provider, your fallback chain is only as good as the regions your gateway can reach. Diversity in providers and in deployment regions is the real reliability measure.

How to actually adopt this in 2026

The practical adoption path I would give a team right now:

First, build the boundary — even if you do not add a gateway yet. Refactor every model call to go through one internal function or module. This is the single highest-leverage change and costs nothing.
Add a second provider behind that boundary as a fallback. Even one fallback dramatically improves reliability for user-facing features.
Instrument per-task cost, latency, and error rate by provider. You cannot optimize what you cannot see.
Only then evaluate a gateway. If observability, protocol conversion, or multi-provider load balancing is now painful, adopt LiteLLM (self-host), Portkey (hosted), Vercel AI Gateway (if on Vercel), or OpenRouter (broadest marketplace). If it is not painful, the thin wrapper continues to win.
Re-evaluate routing quarterly. Prices move. New models ship. The leaderboard inverts. Your routing config should be a living document, reviewed when a major model launches or a provider changes pricing.

My take

The teams that come out of the 2026 price war in the best shape are not the ones who picked the "best" gateway. They are the ones who built the boundary early — one internal place where provider choice lives — and then filled in routing, fallback, and cost tracking behind that boundary as they learned what they actually needed. A gateway is a tool, not a strategy. The strategy is making provider choice a reversible, config-driven decision.

If you want a maintained reference for the providers in this article, including the ones tracked on our site, see our AI pricing data page. And if you have not yet, the companion piece on why this matters strategically is worth reading first — this guide is the "how," that one is the "why." For the third piece in this cluster — how to measure what each task actually costs you, so you can act on routing decisions instead of guessing — see Per-token billing is lying to you: a 2026 guide to measuring LLM cost per task.

Sources

RAG did not solve hallucinations — it moved them: a 2026 guide to diagnosing why your retrieval-augmented generation fails in production

Your RAG demo worked on three PDFs and broke on the real corpus. That is not a mystery; it is the predictable cost of treating retrieval as a default instead of an engineering decision. Industry analysis in 2026 finds that when RAG fails, the failure point is retrieval roughly seven times in ten — not generation. Here is a source-checked diagnostic guide to production RAG in 2026: where it actually breaks (chunking, embedding, retrieval, staleness), the metrics that locate the break, and why RAG did not eliminate hallucinations so much as relocate them somewhere harder to see.

Every 512 tokens is not a chunking strategy: a 2026 practical guide to choosing how to split your documents for RAG

Chunking is the single highest-leverage and most under-treated decision in a RAG pipeline, and most teams leave it on the default. Here is a source-checked 2026 guide to the five chunking strategies that actually matter — fixed, recursive, semantic, late, and proposition-based — when to use each, the retrieval-quality tradeoffs, and why the right answer is never 'whatever the tutorial used.'