← Blog

What 99.9% uptime actually buys you — three weeks of math

· 5 min read
uptimeslamath

Uptime percentages are tiny numbers. The difference between 99.5% and 99.9% looks like rounding error. Converted to time, the gap between those two numbers is the difference between a service that has 3.6 hours of downtime per month and one that has 43 minutes — a real, operational difference that changes how you architect on top of it.

This is the conversion table we wish more dashboards published, with notes on what each level actually buys you in practice.

The conversion

Per month (30 days = 43,200 minutes):

UptimeAllowed downtime per monthPer quarterPer year
99%7.2 hours21.6 hours3.65 days
99.5%3.6 hours10.8 hours1.83 days
99.9%43 minutes2.16 hours8.76 hours
99.95%21 minutes1.08 hours4.38 hours
99.99%4.3 minutes13 minutes52.6 minutes
99.999%26 seconds79 seconds5.26 minutes

The shape that matters: each “9” you add roughly cuts the allowed downtime by 10×. The cost of building toward each additional “9” goes up much faster than 10×.

Reading the table

A few practical translations:

99% — “we will have a bad day every quarter.” 7.2 hours of monthly downtime is roughly one bad workday per month. Quite a lot of consumer software, including many SaaS products in their first year, operates around this level. It is not catastrophic, but it is visible to users.

99.5% — “we will have a bad afternoon every month.” 3.6 hours per month is one extended outage during business hours, or a few shorter ones spread out. A reasonable target for early-stage products with limited operational maturity.

99.9% — “three nines.” 43 minutes per month. The conventional target for production B2B services. Achievable with careful engineering, on-call rotations, and deliberate operational practice. Most modern cloud services operate around this level.

99.95% — “three and a half nines.” 21 minutes per month. The target most public uptime SLAs from major cloud vendors come in at, with credit clauses below it. Genuinely hard to hit; requires multi-region deployments, mature incident response, and careful change management.

99.99% — “four nines.” 4.3 minutes per month. The target for critical infrastructure — payment processors, financial trading systems, telecom. Requires global redundancy, regional failover, and significant investment in operational practice. Few organizations operate any service at this level for extended periods.

99.999% — “five nines.” 26 seconds per month. The aspiration for a small set of mission-critical systems where outages translate to lost lives or hundreds of millions in losses. Reaching this level reliably is a research problem, not an engineering project.

What this means for AI provider availability

Most public AI services, including Anthropic’s API, operate in the high 99.9% range over long windows, with occasional bad months that drop them lower. Looking at the 30-day uptime numbers we publish, recent months have ranged roughly between 99.7% and 99.95% across components, depending on the month and the specific component.

If you are building on top of an AI service whose track record is roughly 99.9%, the implication is plan for 30–60 minutes of downtime per month. That is the realistic operational envelope. Architect your application to either degrade gracefully during those windows, or make peace with users seeing failures during them.

The math compounds across multiple dependencies. A service that depends on three independent 99.9% upstreams has an effective availability of 0.999³ = 99.7%, which converts to 2.16 hours per month rather than 43 minutes. Each additional dependency further reduces the achievable ceiling. This is one reason multi-provider redundancy can pay off — it converts an AND of vendor uptimes into an OR.

The cost curve, qualitatively

Going from 99% to 99.5% is mostly about not shipping obvious bugs. It costs a couple of engineers paying attention.

Going from 99.5% to 99.9% is about deploying carefully and having on-call. It costs a small operations function and a working incident-response practice.

Going from 99.9% to 99.95% is about multi-region failover, careful change management, and a culture of pre-mortem thinking. It costs significant engineering time on resilience features that do not directly produce user-visible value.

Going from 99.95% to 99.99% is about geographic redundancy, fault isolation, mature dependency management, and an organization-wide commitment to availability over feature velocity. The cost crosses into “this dominates engineering bandwidth” territory.

Going from 99.99% to 99.999% is mostly research. The number of organizations operating production systems at this level for sustained periods is small.

The cost curve is famously not linear. Each additional “9” is dramatically more expensive than the last. This is why most engineering teams aim for one specific tier and budget against it, rather than trying to push toward higher tiers indefinitely.

What error budgets do

Error budgets — the SRE practice popularized by Google — turn the uptime conversion table into a real operational practice. The premise is simple: if your target is 99.9% and the month so far has been clean, you have remaining budget for some downtime later in the month. If the month has already burned its budget, you should freeze risky changes for the rest of the month and focus on stability.

The error-budget framing makes the abstract uptime number actionable. Instead of “we should be more reliable,” teams can say “we have used 38 minutes of our 43-minute monthly budget; let’s hold the next deploy until next month.” The budget makes trade-offs explicit.

For consumers of an upstream provider, the same framing applies. If your application’s target is 99.9% and your upstream is 99.9%, your error budget is shared with the upstream — every minute of provider downtime that hits your users is a minute of your budget gone, regardless of whether it was your code’s fault. This is one reason architects sometimes push to inflate the dependency target (“we want to use a 99.99% upstream”) to preserve budget for their own internal failures.

The honest version of marketing claims

When a company says “99.99% uptime SLA” in marketing materials, the relevant questions are:

We have written elsewhere about why our uptime number is lower than the marketing figure. The short version: we do not exclude partial outages or minor degradation. Marketing math typically does. Both are defensible; readers should know the difference exists.

The pragmatic recommendation

For most products building on top of AI providers, the right targets and assumptions are:

The uptime number is more useful as a budget than as a target. The budget tells you what the realistic envelope is, and the realistic envelope tells you how to architect.

Share this post