A flat datacenter network has no hierarchy to blame
James Hamilton’s June post on flat datacenter networks reports numbers that should end the fat-tree era: 69% fewer routers, 33% higher throughput, 40% less power, 27% lower operating cost, and a failure curve where losing 1% of routers costs roughly 1% of capacity. The arXiv paper behind it — Bernardi, Mahajan, Seshadhri and colleagues, “RNG: Flat Datacenter Networks at Scale” (arXiv:2604.15261) — reports the design as the new default network for most workloads at Amazon. Then the post and the paper both name the price of all that: paths through a random graph are less predictable than paths through a tree, which makes troubleshooting harder.
That last clause is a fault-localization problem. I had already built a fault localizer — for GPU clusters, not networks — and the question was whether it stayed relevant pointed at a network it was not designed for.
Why the topology stops helping you
In a fat-tree, the topology is the diagnostic. Hop distance and layer position tell you where a fault lives: a sick spine switch takes down a predictable cone of traffic, and the shape of the damage points back up the hierarchy at the cause. That works because the tree forces traffic through a small number of known choke points.
RNG removes the choke points on purpose. The paper’s structure is a quasi-random expander of degree 64, maximum path length 5, with more than 50 edge-disjoint paths between any endpoint pair, and a forwarding scheme called Spraypoint that sprays each flow across all of them. The property that makes the network cheap and survivable — every pair reachable many ways, no spine left to lose — is the same property that erases the topological signal you would use to find a fault. Lose 1% of routers and you lose 1% of capacity; the network will not tell you which 1%. The paper says so directly, and I confirmed it by reading the construction: hop distance, the thing a tree-shaped monitor leans on, is structurally dead.
So the cheap network ships with one operational bill attached. When something degrades, the topology that used to answer “where” has nothing to say.
The part that transferred for free
Tessera was built for GPU clusters: shard observability across thousands of correlated shards. I took the layer that does the localizing — designed for cluster shards — and pointed it at a network. That layer never relied on hop distance or hierarchy, the cluster’s or the network’s. It localizes to a shared-resource fault domain by watching many monitored entities at once, finding where they stop agreeing, and attributing the disagreement to the physical resource they have in common — a tomographic solver over a fault-domain incidence hypergraph, not a walk over the wiring. None of that asks how many hops apart two things are.
That made the bet cheap to state. Repoint the leaf from “cluster shard” to “network path-class,” repoint the fault domain from “host or rack” to “shuffler, panel, or room,” and the same engine should localize a network fault on a topology where hop distance means nothing. The reusable idea is not the GPU plumbing. It is a bank of statistical detectors that share one alpha budget and hold the false-positive rate at a guaranteed level across however many entities they watch, and that idea does not care whether the entities are shards or path-classes. What Tessera-RNG had to add was the part specific to this problem: a model of the RNG fabric, the per-path-class signal contract, and a tomographic solver that swaps the engine’s hop-distance search for a noisy-OR set-cover over the incidence hypergraph. Whether that combination actually localized a fault was the open question.
It transferred. On a synthetic fabric built from the paper’s published numbers at full scale — 960 ToRs, about 1,456 monitored path-class leaves, roughly 514,000 weighted incidence edges — a single shuffler driving a common-mode shift ranks first among suspected fault domains, a clean fabric selects nothing under FDR control, and the whole pipeline is byte-identical on replay. The engine’s own hop-distance localizer would have returned noise; the set-cover over physical incidence returns the shuffler.
The detector fires before a threshold would
A path-class emits a handful of signals per tick, and you could read them as a row of dashboards with a red line on each. That reading misses the whole mechanism. A static threshold is a fixed line on one metric, checked each tick on its own: p99 latency crosses forty milliseconds, page someone. It has no memory. A signal drifting steadily toward trouble while every tick stays nominally green never trips it, and two signals that each look fine alone but have stopped moving together never trip it either. At fleet scale the static threshold has no good setting: a line tight enough to catch real shifts pages constantly, and a line loose enough to stay quiet catches nothing.
What ported over from Tessera is a different object. Each signal on each path-class is watched by a betting e-process — an anytime-valid sequential test that bets against the hypothesis that nothing has changed and compounds its winnings tick over tick, accumulating evidence instead of judging each tick in isolation. A slow drift that never crosses a single-tick line still compounds wealth in the process until it fires, and that is the literal content of catching a trend before it crosses a threshold. Because the test is anytime-valid — Ville’s inequality bounds the false-alarm probability no matter how often you look — you can evaluate it every tick. Re-check a static threshold that often and its false-alarm rate climbs with every look; the e-process carries no such penalty.
Two of the detectors see shifts that no per-metric line can. One watches the joint distribution across the signals through a covariance learned from healthy traffic, and fires when they stop co-varying the way the clean fabric taught them to — a correlation flip with every marginal mean and variance unchanged, invisible by construction to any threshold on any single metric. The other watches for periodicity, an oscillation that develops with no change in level or spread at all.
“Normal,” meanwhile, is not one number. The calibration substrate keys a separate baseline to each (hour-of-day, day-of-week, traffic-class) cell and pre-whitens temporal autocorrelation with a per-signal AR(p) model, so a detector compares against the normal for that hour and that traffic rather than a single global line. The fleet layer then combines the per-leaf e-values hierarchically and applies e-BH to control the false-discovery rate at a target q across all 10³ to 10⁶ leaves at once, under arbitrary dependence. The guarantee is on the rate of false discoveries across the whole fleet, which is what makes watching a million correlated entities tractable instead of a pager that never stops.
The assumptions the paper never published
The transfer rests on two things I should state plainly, because neither appears in the published work. The first is the fabric itself. RNG documents the topology, the routing, the ShuffleBox optics, and the deployment scale, so I rebuilt a fabric from those numbers — 960 ToRs, degree 64, the Spraypoint edge-disjoint path counts — as a synthetic stand-in. It is a simulacrum, not a copy: it reproduces the structure the paper describes, and it may still differ from Amazon’s real network in ways the paper never exposed.
The second is the telemetry, and this is the load-bearing assumption. The paper treats operations as out of scope, so there is no published signal contract — nothing that says which per-path-class measurements a real fabric emits. I chose five, one vector per path-class per tick: p99 latency, retransmit rate, loss rate, ECMP imbalance, and path completion — conventional network-health measurements. The method’s validity does not rest on the signals being exactly these five. It rests on the correlation structure underneath them: when a shared resource degrades, the path-classes that traverse it move together, and the FDR control plus the set-cover invert that co-movement back to the resource. Any telemetry carrying the same co-movement would serve.
So the honest status is narrow: the math is FDR-controlled and the fault domain is identifiable on a simulacrum of the published fabric, under a telemetry contract that is plausible but unconfirmed. That is unfalsified, not validated. The one thing that would move it in either direction is a trace from a real fabric — which is exactly what the paper did not publish.
Where the data contradicted me
The parts worth writing down are the places the synthetic fabric refused to agree with my design.
I started multi-fault localization with a binary set-cover that picks the minimal set of fault domains explaining the selected leaves. The first end-to-end test with two simultaneous faults falsified it immediately: the cover claimed a single ToR leaf, reached through a 0.1-weight incidence membership, as the sole culprit for both faults. That is the kind of answer that looks decisive and is wrong. I rebuilt the cover to score candidates by marginal log-likelihood-ratio against the already-picked set, which recovers both faults and reduces exactly to the old scorer at the first pick. The test caught the defect; I did not.
I also drafted the detection and attribution floors before measuring them, and the predictions were wrong twice. A room-level fault that “should” be attributable at the smallest injected magnitude turned out to detect on every seed yet attribute on none of them — the real boundary sits around magnitude 1.5 to 2, not 1. So the published floor reads as a reliable alarm with an unreliable culprit, because that is what the measurement said. I replaced the guesses with the observed numbers and let the uncomfortable one stand.
The third was a fuller exposure model I was sure was more correct. I built the full-support variant first and measured it, and it collapsed cross-kind localization: a binary fire-or-quiet scorer drowns when a fault domain touches 63 heavily diluted leaves at weight 1/63. I reverted it and recorded the condition under which it would be worth revisiting (a magnitude-aware member model).
What is left
The localizer cost almost nothing to move from GPUs to a network, because it never depended on which machine it was watching — only that many redundant entities should agree, and a fault domain is wherever they stop. The redundancy that makes a random-graph network cheap is the same redundancy that localization feeds on; the property the paper sells as graceful degradation is, read from the operator’s side, a dense correlation signal waiting to be inverted. The one thing I cannot buy without a live fabric is the five-signal contract the paper left out — the single measurement that would turn “unfalsified” into “validated,” and the only one that would settle whether any of this holds against a real network.