“If the lottery is an intensification of chance, a periodic infusion of chaos into the cosmos, would it not be desirable for chance to intervene at all stages of the lottery and not merely in the drawing?” — Jorge Luis Borges, The Lottery in Babylon
Open any recent image-generation paper and the central claim usually rests on a single number — the Fréchet Inception Distance (FID). FID is the closest thing image generation has to an arbiter: a half-unit shift reorders the leaderboard; a decade of recipes have been justified by single-Inception-unit gains; and budgets in the low millions of GPU-hours hinge on which architecture lands a few decimals lower. But behind every reported FID number sits a chain of pseudo-random draws (parameter initialisation, minibatch order, per-step Gaussian noise injected by the training loss, hardware stochasticity, and the initial noise drawn at sampling time), any of which could have produced a potentially different score had the seed been different. Conventional wisdom considers this variance in FID to be negligible, especially for well-trained models. In this paper we show that the FID reproducibility gap is real and is a serious concern.
Each time one trains a generative model and reports its FID, one is playing two lotteries. The training lottery runs once during training: one draws an initialisation, a data ordering, and the per-step noise that the loss injects at every gradient step, and what comes out is one trained network among many that could have been produced by different seeds. The generation lottery runs on the trained network: one draws an initial noise xT ∼ 𝒩(0, I) to seed the sampler, generates a sample set, and scores it. Practitioners have learned to mitigate the second lottery — redrawing the initial noise across several seeds and reporting an error bar or an averaged FID score — but no amount of resampling on a single trained network says anything about where a re-trained network run would have landed. The training lottery stays hidden behind the one ticket we actually drew. Diffusion makes the problem worse: the flow-matching or score-matching loss redraws a fresh Gaussian ε ∼ 𝒩(0, I) at every gradient step, so the training noise never settles — it is a permanent random injection that an independent training run would resolve differently, not transient noise that longer training averages out. Scale offers no automatic remedy: neural scaling laws characterise how the mean loss falls with parameters and tokens, leaving the seed-induced spread around that mean unspecified. Zhang et al. recently showed that independently-trained diffusion networks converge to nearly the same noise-to-image mapping. But does this also hold for the FID metric computed over a set of generated images?
Each lottery defines an axis of FID variance: a training axis of N independent training runs and a generation axis of K sampling seeds per run. To measure both, we score the resulting N×K panel of FID evaluations. The panel below renders a converged SiT-B/2: every column is one trained network, every dot is one FID evaluation, and the column-to-column spread already overshoots the within-column spread at a glance. On this panel we first decompose the training lottery into its independent random sources, then sweep four practitioner-controlled axes (classifier-free guidance, compute, model size, and learning rate, transferred across widths by hyperparameter transfer) to test whether any of them tightens it. Across several hundred SiT networks from S through XL on ImageNet 256×256, the training axis dominates the evaluation axis at every scale we probe and none of the four knobs closes the gap; a single-seed Inception FID therefore sits on a noise floor that recipe-level gains regularly fall below. A closer look at this floor turns up two practical handles nonetheless: per-cell classifier-free-guidance tuning (GS-FID) halves the relative noise floor but reshuffles which seeds win (Spearman ρ = 0.73), and a lucky training seed reaches the same FID with up to 2× less compute than an unlucky one.
Every reported FID is the payout of two spins: pull the training reels to draw a network, then spin the generation reels to score it. Same recipe each pull. A different number every time.
Which moves your FID more: the model you happened to train, or the samples you drew from it? The two are nowhere near equal, and the larger source of variance is the one error bars rarely capture.
A training run draws on three independent random sources: the initialisation, the data order, and the per-step noise of the flow-matching loss. We freeze two of them and vary the third to see how much each one moves the FID on its own.
Train the same recipe many times and keep only the luckiest seed. That alone reaches a target FID up to 2× sooner, with no change to the model, the data, or the code.
Does more compute or a bigger model tighten the spread? Across every checkpoint and every size, the relative noise floor stays inside a narrow band.
Tuning classifier-free guidance per seed pair (GS-FID) cuts the relative spread in half. But the seeds it favours are not the ones unguided FID picked.
Transferred across model sizes (μP), the best learning rate is a flat-bottomed valley, not a point: a band of nearby rates that all reach essentially the same guided FID.
Is the seed lottery an FID quirk? We re-scored the same networks under six common generative-model metrics. Every one of them moves with the training seed.
Enter a reported FID and how many independently-trained seeds stand behind it. The house returns the 95% error bar the seed lottery hides, at the paper’s ≈1.3% floor. Add seeds; watch it tighten by √N.
If FID is a game of chance, here is how to play it honestly.
Resampling a fixed network shrinks evaluation noise but leaves the dominant training variance untouched; only multi-seed training reaches below the ≈1.3% CoV floor.
Variance stays inside a 1–2% band across SiT-S/B/L/XL from 200k to 2M steps. Gaps below the band may just be seed noise; a cheap first-pass check before you commit to multiple seeds.
GS-FID is more reliable, but its winners differ from unguided’s (ρ ≈ 0.73). Evaluate and tune with the same FID you intend to report.
Transfer the learning rate across model sizes and GS-FID returns a window of adjacent rates that all give similar FID; seed variance blurs the optimum into a plateau.
GS-FID finds the per-cell optimal guidance scale in a logarithmic number of evaluations and gives the most reliable comparisons under CFG.