← Blog

Why our status page does not lag — the cache strategy in two layers

· 7 min read
cachingperformanceengineeringkv

The single worst failure mode for a status dashboard is being slow during the exact moment users need it. If claude.ai is having a bad day and a user types “is claude down” into Google and lands on this page, the page absolutely cannot also be slow. A status page that lags during outages is worse than no status page — it confirms the user’s worst suspicion (“everything is broken, even the meta-thing”) at the moment they were hoping for clarity.

We hit that bar by being aggressive about caching, and disciplined about which layer the cache lives in. This post walks through the two layers, the trade-offs, and the specific TTL choices that turned out to matter.

The constraint

Every page render is server-side rendered. We made this choice deliberately. Static generation would be slightly faster for most reads, but it would lag the underlying data — every render needs to reflect Statuspage state from at most two minutes ago. ISR (incremental static regeneration) would compromise on freshness across edge regions in unpredictable ways. SSR with aggressive caching gives us tight bounds on both freshness and latency.

The constraint, then: every render must either hit a fresh cache (sub-100ms) or fall through to upstream APIs (a few hundred ms but rare). The shape of the system has to make the cache hit the common case.

Layer 1 — Cloudflare KV, in front of every upstream

Cloudflare KV sits in front of every external API call the dashboard makes. Specifically:

Each of these is read on the path of a page render. Misses fall through to the upstream. Hits return immediately.

The choice of Cloudflare KV over Vercel KV (which is now Redis-backed) was driven by three things:

The TTL choices are not arbitrary. 120 seconds for Statuspage data is calibrated against the upstream’s documented CDN cache (~10 seconds) and the user’s tolerance for staleness (most users will not notice a 2-minute-old status indicator during a stable period, and during an active incident the on-page “Refreshed: HH:MM:SS UTC” timestamp tells them exactly how stale). 30 minutes for latency probes matches the cron cadence (5 minutes) plus a buffer. 5 minutes for RSS matches typical RSS reader poll intervals.

Layer 2 — single combined KV key for Statuspage data

The most impactful single change we made was collapsing three separate KV keys (summary, incidents, unresolved) into one combined key, written and read as a unit.

Before:

page render → 3 parallel KV reads → render
on miss     → 3 parallel API calls → 3 parallel KV writes → render

After:

page render → 1 KV read → render
on miss     → 3 parallel API calls → 1 KV write → render

The optimization is small in raw latency terms — 3 parallel KV reads at 30ms each is still bound by the slowest, so call it 30–40ms total. The combined version is ~30–40ms. Marginally faster.

What it actually bought us is operational simplicity during partial cache misses. With three keys, you can have one expire and the others not, leaving you in a state where the summary thinks there is one incident and the incident list disagrees. Single-key writes mean the data is either entirely fresh or entirely stale, never half of each.

That property — atomic freshness — turned out to be the difference between a dashboard you can trust during fast-moving incidents and one that occasionally shows internally-inconsistent state. We learned this the hard way during a multi-component incident where the summary key was 90 seconds newer than the incident list key, and the page rendered with a major incident’s title but the wrong status indicator.

Non-blocking writes

Every write to KV is non-blocking — the call site does not await the write, and the write’s promise is not surfaced. If KV is unavailable, the write fails silently and the next request will trigger a refresh upstream.

This removes an entire failure mode: KV is not on the user’s critical path, ever. A KV outage during a Claude outage would be the worst possible compounding event — user lands on the dashboard during an Anthropic incident, KV is also down, and the page either errors or hangs. Non-blocking writes mean the worst case is “every render hits upstream Statuspage directly, latency is a few hundred ms instead of sub-100, but the page still renders.”

The trade is that during sustained high traffic, a KV outage degrades to “every request hits upstream.” Statuspage’s public API is documented as having no rate limit, which makes this trade acceptable. Most providers’ public APIs have rate limits that would not survive the fall-through pattern; for them, this design would not be safe.

Why not “stale-while-revalidate” everywhere

A reasonable question. SWR is the standard pattern for this kind of dashboard: serve the cached value immediately, then refresh in the background, so every reader gets fresh data and zero waiting.

We use SWR-like patterns in two places (RSS, latency cache) but not for the main page render. The reason is the freshness contract on the page header: when the dashboard says “Refreshed: 14:32:18 UTC,” that timestamp is the summary.page.updated_at from Anthropic, not our cache write time. If we used pure SWR, the page would frequently render with a fresh-looking timestamp that was actually 119 seconds old in our cache. The user would see a recent timestamp and a stale page.

The current model — block on the freshness check, fetch upstream if expired, then render — gives the user a strong guarantee that the timestamp they are reading is no more than 120 seconds behind reality. It costs us a few hundred ms on cache miss, which is acceptable because the cache miss is rare under any realistic traffic.

Cron-driven cache warming

The latency probe runs on a 5-minute Vercel Cron trigger. This is independent of user requests — even if no one visits the page, the data is fresh on every read because the cron has just written it. The on-demand “refresh” button on the latency widget is a courtesy, not a primary path.

This pattern is the right shape for any data that has to be globally consistent and updated on a schedule. It moves the work off the user-request path entirely. The user’s GET to /api/latency is always reading from KV; it does no upstream work, regardless of how stale the data is. The cron handles freshness asynchronously.

The trade is cost: we pay for the cron run every 5 minutes whether anyone is watching or not. This is fine because the cron itself is cheap (one outbound HTTP call to check-host.net plus one KV write), and the predictable cost is easier to budget than user-driven cost.

What this strategy cannot do

Two limitations worth flagging.

The cache cannot hide upstream silence. If Anthropic stops updating their Statuspage during an active incident, our page will display whatever the most recent state was, with the timestamp disclosing how stale it is. The user has to read the timestamp.

The cache also cannot fix incorrectness — only freshness. If Anthropic posts an incorrect incident classification, we will faithfully cache and serve that incorrect classification. We have no signal that lets us second-guess the upstream classification, and we are not trying to build one.

The cache strategy is, at its core, a discipline about being honest about what is fresh. Every TTL is documented on the Methodology page, the timestamp on every cached read is exposed somewhere in the UI, and the failure modes degrade gracefully (slow rather than wrong). That is most of what a status dashboard owes its readers.

Share this post