← Blog

Multi-provider LLM failover — the patterns that actually work in production

· 7 min read
resiliencefailoverarchitecturemulti-provider

The pitch is intuitive: if Claude is down, route to GPT. If GPT is down, route to Gemini. Treat LLM providers like cloud regions, build a load balancer in front of them, and your AI features stay up while individual vendors have bad hours. We have seen this pitch many times. We have seen many implementations of it that did not actually work in production.

This post is about which failover patterns work, which look like they work but fail in subtle ways, and what the real cost of multi-provider resilience is.

The four patterns

Most multi-provider failover architectures fall into one of four shapes:

  1. Hot-hot routing — every request gets routed to whichever provider is “best” right now.
  2. Hot-cold failover — primary provider always; fallback only if primary is unhealthy.
  3. Hot-cold with retry-and-shed — primary first; on failure, try fallback within the same user-visible window.
  4. Multi-provider quorum — call multiple providers in parallel, return the first or pick the best.

Each has trade-offs. Three of them work in production for some use cases. One of them is the source of most failures.

Hot-hot routing — the trap

The pitch: send each request to whichever provider currently has the best latency, lowest error rate, or most available capacity. A simple load balancer with health checks.

This pattern fails because prompts are not interchangeable across providers. A prompt tuned for Claude does not produce equivalent output from GPT or Gemini. Different models have different system-prompt conventions, different tool-use formats, different tokenizer cost shapes, different safety behaviors, different output formatting habits. Routing the same prompt to different providers based on liveness produces user-visible quality variance that is much more disruptive than the occasional outage you were trying to route around.

We have watched several teams build hot-hot routing for LLMs, deploy it, and quietly remove it three months later because the inconsistency in output quality across providers became a top user complaint. The outages it routed around were a small fraction of the experience; the inconsistency it introduced was a permanent property of every request.

Hot-hot routing works for cloud regions because regions of the same provider produce identical compute. It does not work for LLM providers because different providers produce non-identical models.

Use this pattern only if your prompts are simple enough that all providers produce indistinguishable outputs, which in practice is rare and shrinking — anything involving tool use, structured output, or stylistic consistency disqualifies it.

Hot-cold failover — the simple case

The pitch: use Claude as the primary provider. If Claude is unavailable, fall back to a secondary (a different Claude tier, or a different provider with a separately maintained prompt). On every request, try the primary first; on failure, try the fallback.

This works, with two important constraints:

The fallback prompt must be separately maintained. You cannot reuse the Claude prompt against a different provider and expect equivalent results. The fallback path needs its own prompt, its own validation, and its own quality bar — possibly lower than the primary’s, which is fine if you are honest with users about it.

The fallback’s quality may be visibly worse. Users who see a degraded response during a Claude outage are getting a degraded response. Communicating this — “We’re temporarily using a backup model” — is more important than hiding it.

The simple version of hot-cold is:

try:
    response = call_claude(prompt_v1)
    return response
except (Overloaded, ServiceUnavailable):
    response = call_fallback(prompt_v2)
    return mark_degraded(response)

The cost is:

The benefit is real availability gains during single-provider outages. We have seen teams measure 0.5–2% additional uptime year-over-year from a well-implemented hot-cold pattern, which is meaningful for products with strong uptime SLAs.

Hot-cold with retry-and-shed — the real production pattern

The pitch: try the primary, retry briefly within budget, then fall back.

In production, hot-cold without retries fails too eagerly. A 529 from Claude is often a microspike; the next request five seconds later succeeds. Falling back on the first error wastes the primary’s recovery in favor of the fallback’s worse output.

The realistic shape is:

deadline = now + 8 seconds  # budget for primary
attempt = 0

while now < deadline:
    response = call_claude(prompt_v1)
    if success: return response
    if 4xx: break  # client error, fallback won't fix it
    sleep(jittered_backoff(attempt))
    attempt += 1

# primary did not recover within budget — shed to fallback
response = call_fallback(prompt_v2)
return mark_degraded(response)

The 8-second budget is calibrated against typical user latency tolerance. For interactive UI, 8 seconds is roughly the upper bound of what users will wait before perceiving the product as broken. Below 4 seconds, the budget is too tight to absorb microspikes; above 12 seconds, the user has already given up.

This pattern handles the three common failure shapes well:

The cost is somewhat higher latency for the requests that hit the backoff, which is acceptable because the alternative (failing fast and falling back) produces worse output more often.

Multi-provider quorum — the expensive option

The pitch: call multiple providers in parallel, return the first response, or pick the best by some quality measure.

This works for high-stakes use cases but is genuinely expensive. You are paying for multiple providers’ inference on every request, regardless of whether you use the result. For consumer-scale use cases, the math is rarely justifiable.

A common variant: call providers in parallel only for queries where quality matters more than cost (e.g., a customer-facing answer where wrong is worse than expensive), and fall back to single-provider for cheaper queries (e.g., internal tools, batch processing). This works, but the routing decision adds complexity to your code that has to be maintained over time.

The most successful production version of quorum we have seen uses it only for a small fraction of high-value queries — single digits of percent — and uses hot-cold-with-retry for everything else. The pattern combines the cost discipline of hot-cold with the resilience of quorum where it actually matters.

What to monitor

If you implement multi-provider failover, monitor:

Without this monitoring, you have no idea whether your failover is helping. We have seen teams with elaborate failover that, after measurement, turned out to make user latency strictly worse during normal operation in exchange for marginal availability gains during rare outages.

When not to do this at all

Multi-provider failover is not free. Concrete costs:

For most products, especially those with relaxed availability requirements (B2B SaaS with monthly SLAs, internal tools, batch pipelines), a simpler retry-and-fail strategy is more cost-effective than multi-provider. Tell users “AI features are temporarily unavailable, try again in a few minutes” during the small number of hours per quarter that any one provider is down. Most users accept this; very few accept “AI features are sometimes inconsistent because we’re routing across providers.”

Multi-provider is the right choice when:

When two of those three are not true, single-provider with good retry discipline produces a better product.

Watching this dashboard during failover decisions

If you are building multi-provider failover with Claude as one of the providers, this dashboard is one of the cleanest signals available for “is Claude having a bad day.”

The integration shape we recommend: poll the public Statuspage summary.json (which feeds this dashboard) on a 30-second cache. When the API component’s indicator is major or critical, your failover layer can preemptively shift more traffic to the fallback rather than waiting to hit individual 529s. This converts an opaque “the system seems slow” signal into a deterministic “the system is publicly degraded, route to fallback” signal — and saves you the budget of 529 retries that would just fail anyway.

Cross-reference with the same approach to OpenAI and Gemini status pages and you have the lightweight version of multi-provider routing without paying for parallel inference. Most of the value of multi-provider resilience comes from this kind of preemptive routing, not from runtime quorum. The status pages are the cheapest input to make those decisions well.

Share this post