The pitch is intuitive: if Claude is down, route to GPT. If GPT is down, route to Gemini. Treat LLM providers like cloud regions, build a load balancer in front of them, and your AI features stay up while individual vendors have bad hours. We have seen this pitch many times. We have seen many implementations of it that did not actually work in production.
This post is about which failover patterns work, which look like they work but fail in subtle ways, and what the real cost of multi-provider resilience is.
The four patterns
Most multi-provider failover architectures fall into one of four shapes:
- Hot-hot routing — every request gets routed to whichever provider is “best” right now.
- Hot-cold failover — primary provider always; fallback only if primary is unhealthy.
- Hot-cold with retry-and-shed — primary first; on failure, try fallback within the same user-visible window.
- Multi-provider quorum — call multiple providers in parallel, return the first or pick the best.
Each has trade-offs. Three of them work in production for some use cases. One of them is the source of most failures.
Hot-hot routing — the trap
The pitch: send each request to whichever provider currently has the best latency, lowest error rate, or most available capacity. A simple load balancer with health checks.
This pattern fails because prompts are not interchangeable across providers. A prompt tuned for Claude does not produce equivalent output from GPT or Gemini. Different models have different system-prompt conventions, different tool-use formats, different tokenizer cost shapes, different safety behaviors, different output formatting habits. Routing the same prompt to different providers based on liveness produces user-visible quality variance that is much more disruptive than the occasional outage you were trying to route around.
We have watched several teams build hot-hot routing for LLMs, deploy it, and quietly remove it three months later because the inconsistency in output quality across providers became a top user complaint. The outages it routed around were a small fraction of the experience; the inconsistency it introduced was a permanent property of every request.
Hot-hot routing works for cloud regions because regions of the same provider produce identical compute. It does not work for LLM providers because different providers produce non-identical models.
Use this pattern only if your prompts are simple enough that all providers produce indistinguishable outputs, which in practice is rare and shrinking — anything involving tool use, structured output, or stylistic consistency disqualifies it.
Hot-cold failover — the simple case
The pitch: use Claude as the primary provider. If Claude is unavailable, fall back to a secondary (a different Claude tier, or a different provider with a separately maintained prompt). On every request, try the primary first; on failure, try the fallback.
This works, with two important constraints:
The fallback prompt must be separately maintained. You cannot reuse the Claude prompt against a different provider and expect equivalent results. The fallback path needs its own prompt, its own validation, and its own quality bar — possibly lower than the primary’s, which is fine if you are honest with users about it.
The fallback’s quality may be visibly worse. Users who see a degraded response during a Claude outage are getting a degraded response. Communicating this — “We’re temporarily using a backup model” — is more important than hiding it.
The simple version of hot-cold is:
try:
response = call_claude(prompt_v1)
return response
except (Overloaded, ServiceUnavailable):
response = call_fallback(prompt_v2)
return mark_degraded(response)
The cost is:
- Two prompts to maintain.
- One contract with a second provider (with its own quotas, billing, and integration).
- Quality variance during fallback periods.
The benefit is real availability gains during single-provider outages. We have seen teams measure 0.5–2% additional uptime year-over-year from a well-implemented hot-cold pattern, which is meaningful for products with strong uptime SLAs.
Hot-cold with retry-and-shed — the real production pattern
The pitch: try the primary, retry briefly within budget, then fall back.
In production, hot-cold without retries fails too eagerly. A 529 from Claude is often a microspike; the next request five seconds later succeeds. Falling back on the first error wastes the primary’s recovery in favor of the fallback’s worse output.
The realistic shape is:
deadline = now + 8 seconds # budget for primary
attempt = 0
while now < deadline:
response = call_claude(prompt_v1)
if success: return response
if 4xx: break # client error, fallback won't fix it
sleep(jittered_backoff(attempt))
attempt += 1
# primary did not recover within budget — shed to fallback
response = call_fallback(prompt_v2)
return mark_degraded(response)
The 8-second budget is calibrated against typical user latency tolerance. For interactive UI, 8 seconds is roughly the upper bound of what users will wait before perceiving the product as broken. Below 4 seconds, the budget is too tight to absorb microspikes; above 12 seconds, the user has already given up.
This pattern handles the three common failure shapes well:
- Microspike: retries succeed within budget, primary serves the request, no fallback needed.
- Sustained outage: budget elapses, fallback serves with degraded marker.
- Client error (4xx): break out immediately, do not waste budget on retries that cannot succeed.
The cost is somewhat higher latency for the requests that hit the backoff, which is acceptable because the alternative (failing fast and falling back) produces worse output more often.
Multi-provider quorum — the expensive option
The pitch: call multiple providers in parallel, return the first response, or pick the best by some quality measure.
This works for high-stakes use cases but is genuinely expensive. You are paying for multiple providers’ inference on every request, regardless of whether you use the result. For consumer-scale use cases, the math is rarely justifiable.
A common variant: call providers in parallel only for queries where quality matters more than cost (e.g., a customer-facing answer where wrong is worse than expensive), and fall back to single-provider for cheaper queries (e.g., internal tools, batch processing). This works, but the routing decision adds complexity to your code that has to be maintained over time.
The most successful production version of quorum we have seen uses it only for a small fraction of high-value queries — single digits of percent — and uses hot-cold-with-retry for everything else. The pattern combines the cost discipline of hot-cold with the resilience of quorum where it actually matters.
What to monitor
If you implement multi-provider failover, monitor:
- Failover rate — percentage of requests that fall back to the secondary. During normal operation, should be very low (under 1%). Spikes indicate primary issues.
- Quality delta — some metric of output quality on the fallback path vs. primary. If you cannot measure this, you cannot tell whether your fallback is actually serving users or making things worse.
- Cost per request, broken down by path — primary, primary-with-retries, fallback, fallback-with-retries. The retry path is often the most expensive and the most overlooked.
- End-to-end latency, broken down by path — fallback paths are often slower than primary paths even on success.
Without this monitoring, you have no idea whether your failover is helping. We have seen teams with elaborate failover that, after measurement, turned out to make user latency strictly worse during normal operation in exchange for marginal availability gains during rare outages.
When not to do this at all
Multi-provider failover is not free. Concrete costs:
- Two API contracts and two billing relationships.
- Two prompt corpora to maintain in lockstep.
- Two safety reviews if your use case is sensitive.
- Two sets of monitoring and alerting.
- A failover code path that is, by definition, the least-tested code path in your application.
For most products, especially those with relaxed availability requirements (B2B SaaS with monthly SLAs, internal tools, batch pipelines), a simpler retry-and-fail strategy is more cost-effective than multi-provider. Tell users “AI features are temporarily unavailable, try again in a few minutes” during the small number of hours per quarter that any one provider is down. Most users accept this; very few accept “AI features are sometimes inconsistent because we’re routing across providers.”
Multi-provider is the right choice when:
- Your contracted availability target is high enough that single-provider outages would breach it.
- Your use case can tolerate quality variance on the fallback path.
- You have the engineering bandwidth to maintain two prompt paths.
When two of those three are not true, single-provider with good retry discipline produces a better product.
Watching this dashboard during failover decisions
If you are building multi-provider failover with Claude as one of the providers, this dashboard is one of the cleanest signals available for “is Claude having a bad day.”
The integration shape we recommend: poll the public Statuspage summary.json (which feeds this dashboard) on a 30-second cache. When the API component’s indicator is major or critical, your failover layer can preemptively shift more traffic to the fallback rather than waiting to hit individual 529s. This converts an opaque “the system seems slow” signal into a deterministic “the system is publicly degraded, route to fallback” signal — and saves you the budget of 529 retries that would just fail anyway.
Cross-reference with the same approach to OpenAI and Gemini status pages and you have the lightweight version of multi-provider routing without paying for parallel inference. Most of the value of multi-provider resilience comes from this kind of preemptive routing, not from runtime quorum. The status pages are the cheapest input to make those decisions well.