Best-of-three vs median — picking the right summary statistic for global latency probes

The latency widget on the homepage shows one number per country. Behind that number, we run up to three independent probe nodes per country, every five minutes. Three measurements, one display. The reduction is not a default — it is a deliberate choice of summary statistic, and different choices would tell you very different things.

This is a short post on which statistic we picked, why, and what the alternatives would have meant.

The three numbers, briefly

For each probe cycle, each country produces up to three results: one per node. A result is either a valid latency in milliseconds, or a timeout / connect-error.

The summary statistics on the table:

Best (minimum) of valid results. What we picked.
Median of valid results. The middle number, ignoring failures.
Mean (arithmetic average) of valid results. The arithmetic mean.
Worst (maximum) of valid results. The slowest probe.
All three displayed separately. No reduction at all.

Each tells the user a different thing. Each has different failure modes.

Why we picked best-of-three

The user-experience question we are trying to answer is: “if a person in this country tries to reach claude.ai right now, will they succeed, and how fast?”

The most useful answer is the answer at the best-routed path. Internet routing is heterogeneous; a user in Singapore is not constrained to the worst route, they will get the route their ISP picked, which is influenced by peering arrangements that the user did not choose. The best-of-three result approximates “what does this country look like when routing is working.” It is the answer most users will experience.

The alternative interpretations:

Median would tell you “the middle of the three measurements.” This is a robust estimator of central tendency, but for our purpose it is closer to a worst-case-minus-one than to an average user experience. Robust to outliers, but the outliers are the data point — a single bad node is precisely what we want to filter out, not what we want to track.
Mean would average the three. If one node is timing out and the other two are at 200ms, the mean penalizes by including the timeout (or, if you exclude failures, the mean and median converge on two readings, which is just noise).
Worst would tell you the slowest path. Useful if you wanted to model a pessimistic user. Not useful for “is the front door open.”

Best-of-three matches the question.

The cost of best-of-three

Two costs, both real.

It hides probe-side problems. If two of three nodes are dying and one is fine, our display shows the one fine node and the user has no signal that the probes are degrading. We mitigate this by alerting internally on multi-node failure rates, but the public dashboard does not surface it.

It can mask gradual route degradation. If route quality is degrading on two of three paths but the third is still healthy, our display shows the third’s healthy number. This is appropriate for the user-experience question (the third path is still routable) but it is a less sensitive signal for “is something gradually getting worse” than median or mean would be.

We accept these costs because the alternative — surfacing pessimistic numbers that do not match what users experience — would generate confusion (“you say 800ms but I am seeing 200ms here”) that erodes trust in the dashboard. A pessimistic number that contradicts the user’s reality is worse than no number.

What we discard

Failures (timeouts, connect-errors) are excluded from the best-of selection. If two nodes return a valid latency and one times out, we pick the best of the two valid latencies. The timeout is treated as missing data, not as “infinitely high latency.”

This is an explicit judgment. Including timeouts as max-value would dramatically inflate the best-of statistic when one node fails — a single timeout would push the displayed number from 200ms to something like 30,000ms (the connect timeout). That would be wrong: most users in the country are not seeing 30-second timeouts; they are seeing 200ms via the working path.

We display “timeout” only when every node fails. That is the threshold at which the user-experience claim “the front door is open” is no longer defensible.

Multi-country aggregation

Within a single country, the rules above apply. Across countries, no aggregation: the table shows 17 separate rows, one per country, each with its own best-of-three.

Resisting the urge to aggregate across countries was a real choice. A “global average latency” number is tempting because it makes the dashboard feel like it has a single big number. We decided against it because the average would be heavily weighted by which countries we sampled, and the country list is not weighted by traffic — sampling Helsinki and São Paulo at equal weight when the population using Claude in each is wildly different would produce a number with no obvious meaning.

Better to show the rows, sort them by latency, and let the reader pick out their own country.

Multi-cycle smoothing

A separate question: should we smooth across time as well as across nodes?

We do not. The displayed number is always the most recent cycle’s best-of-three, with no rolling average over the last hour or last several cycles. This means the number flickers more than a smoothed version would — if a country’s best path briefly degrades, the table shows the degradation in the next 5-minute cycle.

The choice is for responsiveness over smoothness. During a fast-moving incident, smoothing would lag the user’s experience. A user in Mumbai who is currently seeing 1500ms wants the table to say 1500ms now, not “smooth average of 350ms” because the previous five cycles were healthy.

The cost is some visual jitter on quiet days when one cycle gets a bad routing draw. We treat that jitter as honest noise rather than something to hide.

The general principle

Pick the summary statistic that matches the question you want the dashboard to answer. The latency widget is answering “is the front door open and fast for users in this country right now,” and best-of-three is the closest single number to that question.

If we were building a different widget — say, an SLA-tracking widget for an enterprise customer who needed worst-case bounds — we would pick a different statistic. Worst-of-three or 95th-percentile-over-time would be appropriate for that use case. Same data, different summary, different question being answered.

The mistake is to pick a summary because it is the conventional one (medians are popular, the law of averages, etc.) without checking whether it matches what the dashboard is for. Most “default” choices in metric dashboards are the wrong choice for at least some panels in the dashboard. Asking the question explicitly — what is this number for, and which statistic best answers that question — produces dashboards that are more honest and more useful.

For latency to claude.ai, in this dashboard, the answer is best-of-three. Everywhere else on the site, the same exercise produces the choices we made. Nothing on the dashboard is the result of “we just picked the default.”

Subscribe to Updates

The three numbers, briefly#

Why we picked best-of-three#

The cost of best-of-three#

What we discard#

Multi-country aggregation#

Multi-cycle smoothing#

The general principle#

Share this post