The FID Lottery — Quantifying Hidden Randomness in Generative Model Evaluation

Introduction

“If the lottery is an intensification of chance, a periodic infusion of chaos into the cosmos, would it not be desirable for chance to intervene at all stages of the lottery and not merely in the drawing?” — Jorge Luis Borges, The Lottery in Babylon

Open any recent image-generation paper and the central claim usually rests on a single number — the Fréchet Inception Distance (FID). FID is the closest thing image generation has to an arbiter: a half-unit shift reorders the leaderboard; a decade of recipes have been justified by single-Inception-unit gains; and budgets in the low millions of GPU-hours hinge on which architecture lands a few decimals lower. But behind every reported FID number sits a chain of pseudo-random draws (parameter initialisation, minibatch order, per-step Gaussian noise injected by the training loss, hardware stochasticity, and the initial noise drawn at sampling time), any of which could have produced a potentially different score had the seed been different. Conventional wisdom considers this variance in FID to be negligible, especially for well-trained models. In this paper we show that the FID reproducibility gap is real and is a serious concern.

Each time one trains a generative model and reports its FID, one is playing two lotteries. The training lottery runs once during training: one draws an initialisation, a data ordering, and the per-step noise that the loss injects at every gradient step, and what comes out is one trained network among many that could have been produced by different seeds. The generation lottery runs on the trained network: one draws an initial noise x_T ∼ 𝒩(0, I) to seed the sampler, generates a sample set, and scores it. Practitioners have learned to mitigate the second lottery — redrawing the initial noise across several seeds and reporting an error bar or an averaged FID score — but no amount of resampling on a single trained network says anything about where a re-trained network run would have landed. The training lottery stays hidden behind the one ticket we actually drew. Diffusion makes the problem worse: the flow-matching or score-matching loss redraws a fresh Gaussian ε ∼ 𝒩(0, I) at every gradient step, so the training noise never settles — it is a permanent random injection that an independent training run would resolve differently, not transient noise that longer training averages out. Scale offers no automatic remedy: neural scaling laws characterise how the mean loss falls with parameters and tokens, leaving the seed-induced spread around that mean unspecified. Zhang et al. recently showed that independently-trained diffusion networks converge to nearly the same noise-to-image mapping. But does this also hold for the FID metric computed over a set of generated images?

The Five Sources of Randomness Training then sampling a generative model is a chain of pseudo-random draws. The training lottery fixes one network from four coupled sources — ① initialisation, ② data order, ③ per-step noise and ④ hardware drift — and the generation lottery then redraws ⑤ the initial noise for every image. Common practice accounts for only ⑤; we study all five.

Each lottery defines an axis of FID variance: a training axis of N independent training runs and a generation axis of K sampling seeds per run. To measure both, we score the resulting N×K panel of FID evaluations. The panel below renders a converged SiT-B/2: every column is one trained network, every dot is one FID evaluation, and the column-to-column spread already overshoots the within-column spread at a glance. On this panel we first decompose the training lottery into its independent random sources, then sweep four practitioner-controlled axes (classifier-free guidance, compute, model size, and learning rate, transferred across widths by hyperparameter transfer) to test whether any of them tightens it. Across several hundred SiT networks from S through XL on ImageNet 256×256, the training axis dominates the evaluation axis at every scale we probe and none of the four knobs closes the gap; a single-seed Inception FID therefore sits on a noise floor that recipe-level gains regularly fall below. A closer look at this floor turns up two practical handles nonetheless: per-cell classifier-free-guidance tuning (GS-FID) halves the relative noise floor but reshuffles which seeds win (Spearman ρ = 0.73), and a lucky training seed reaches the same FID with up to 2× less compute than an unlucky one.

Admit
One Now play the two lotteries Step onto the floor

Act I — The Two Lotteries

Training Variability Dwarfs Evaluation

♦

Which moves your FID more: the model you happened to train, or the samples you drew from it? The two are nowhere near equal, and the larger source of variance is the one error bars rarely capture.

♠♥♣♦

The Two-Axis Panel · SiT-B/2, 400k steps

25 networks (columns) × 10 sampling seeds (dots). The gold band is the training-lottery ±1σ envelope.

Act II — Three Sources

Which Source Drives the Variance?

♣

A training run draws on three independent random sources: the initialisation, the data order, and the per-step noise of the flow-matching loss. We freeze two of them and vary the third to see how much each one moves the FID on its own.

♠♥♣♦

Between-Seed σ by Single Source

How much of the full spread does each random source reproduce on its own?

Act III — Lucky Seeds

A Lucky Seed Is Worth Compute

♥

Train the same recipe many times and keep only the luckiest seed. That alone reaches a target FID up to 2× sooner, with no change to the model, the data, or the code.

♠♥♣♦

Lucky vs. Unlucky Seed · FID over training

Every faint line is one training seed. When does the luckiest first reach what the unluckiest only attains at 2M steps?

Act IV — The Noise Floor

The 1–2% Floor That Won’t Move

♠

Does more compute or a bigger model tighten the spread? Across every checkpoint and every size, the relative noise floor stays inside a narrow band.

♠♥♣♦

Coefficient of Variation across Scale & Compute

Act V — Guided FID

Guidance Halves the Floor, but Reshuffles the Winners

♦

Tuning classifier-free guidance per seed pair (GS-FID) cuts the relative spread in half. But the seeds it favours are not the ones unguided FID picked.

♠♥♣♦

Who Wins Changes · Unguided → GS-FID seed ranking

25 training seeds re-ranked. 8 of them move at least five places.

Act VI — Hyperparameter Transfer

The Optimum Is a Window, Not a Point

♣

Transferred across model sizes (μP), the best learning rate is a flat-bottomed valley, not a point: a band of nearby rates that all reach essentially the same guided FID.

♠♥♣♦

Learning-Rate Sweep · GS-FID

Mean GS-FID ± seed σ; overlapping error bars mark rates that are statistically tied (the window).

Act VII — Beyond FID

The Lottery Is Not Just FID

♥

Is the seed lottery an FID quirk? We re-scored the same networks under six common generative-model metrics. Every one of them moves with the training seed.

♠♥♣♦

Seed CoV across six metrics

Dot = median; bar = p10–p90; thin line = min–max (76 cells each).

The Cashier’s Window

Price Your Own FID

♠

Enter a reported FID and how many independently-trained seeds stand behind it. The house returns the 95% error bar the seed lottery hides, at the paper’s ≈1.3% floor. Add seeds; watch it tighten by √N.

The House Rules

Advice for Practitioners

♠

If FID is a game of chance, here is how to play it honestly.

More evaluations can’t substitute for more training runs.

Resampling a fixed network shrinks evaluation noise but leaves the dominant training variance untouched; only multi-seed training reaches below the ≈1.3% CoV floor.
Treat any FID gap below ≈2% as inconclusive.

Variance stays inside a 1–2% band across SiT-S/B/L/XL from 200k to 2M steps. Gaps below the band may just be seed noise; a cheap first-pass check before you commit to multiple seeds.
Guided and unguided FID disagree on the best seeds and hyper-parameters.

GS-FID is more reliable, but its winners differ from unguided’s (ρ ≈ 0.73). Evaluate and tune with the same FID you intend to report.
Under guided FID, the best learning rate is a flat region, not a single value.

Transfer the learning rate across model sizes and GS-FID returns a window of adjacent rates that all give similar FID; seed variance blurs the optimum into a plateau.
Use golden-section search to pick the CFG scale.

GS-FID finds the per-cell optimal guidance scale in a logarithmic number of evaluations and gives the most reliable comparisons under CFG.

Training Variability Dwarfs Evaluation

Which Source Drives the Variance?

A Lucky Seed Is Worth Compute

The 1–2% Floor That Won’t Move

Guidance Halves the Floor, but Reshuffles the Winners

The Optimum Is a Window, Not a Point

The Lottery Is Not Just FID

Price Your Own FID

Advice for Practitioners

More evaluations can’t substitute for more training runs.

Treat any FID gap below ≈2% as inconclusive.

Guided and unguided FID disagree on the best seeds and hyper-parameters.

Under guided FID, the best learning rate is a flat region, not a single value.

Use golden-section search to pick the CFG scale.