We systematically study masked diffusion models versus autoregressive (AR) models in data-constrained settings—where training involves repeated passes over limited data. While AR models initially outperform diffusion models at low compute (particularly near the Chinchilla-optimal point), this advantage disappears as training continues beyond this regime. When data must be reused across multiple epochs, diffusion models significantly outperform AR models, achieving lower validation loss and superior downstream performance. We attribute this to implicit data augmentation: masked diffusion exposes the model to diverse token orderings and prediction tasks, unlike AR's fixed left-to-right factorization.
Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored. In this paper, we systematically study masked diffusion models in data-constrained settings—where training involves repeated passes over limited data—and find that they significantly outperform AR models when compute is abundant but data is scarce. Diffusion models make better use of repeated data, achieving lower validation loss and superior downstream performance. We interpret this advantage as implicit data augmentation: masked diffusion exposes the model to a diverse distribution of token orderings and prediction tasks, unlike AR's fixed left-to-right factorization. We find new scaling laws for diffusion models and derive a closed-form expression for the critical compute threshold at which diffusion begins to outperform AR. These results suggest that when data, not compute, is the bottleneck, diffusion models offer a compelling alternative to the standard AR paradigm.
1. Diffusion models surpass autoregressive models given sufficient compute. Across a wide range of unique token budgets, AR models initially outperform diffusion models at low compute, but quickly saturate. Beyond a critical compute threshold, diffusion models continue improving and ultimately achieve better performance.
2. Diffusion models benefit far more from repeated data. While AR models can effectively use repeated data for up to 4 epochs, diffusion models can be trained on repeated data for up to 100 epochs with repeated data almost as effective as fresh data.
3. Diffusion models have a much higher effective epoch count. We find \(R_D^* \approx 500\) for diffusion models compared to \(R_D^* \approx 15\) for AR models, suggesting diffusion models can benefit from repeated data over far more epochs without major degradation.
4. Critical compute point follows a power law with dataset size. We derive a closed-form expression \(C_{\text{crit}}(U) = 2.12 \times 10^{15} \cdot U^{2.174}\) that predicts when diffusion becomes the favorable modeling choice for any given dataset size.
5. Diffusion models yield better downstream performance. The validation loss improvements translate to consistent gains across diverse downstream language tasks.
We analyze the trade-off between model parameters and training epochs. The contour plots below show validation loss as a function of both axes for 100M unique tokens. Diffusion models achieve their best loss at 500 epochs, while AR models reach their best at just 50 epochs. AR models begin to overfit at high epoch counts, while diffusion models show no signs of overfitting.
We evaluate how the utility of unique data decays with increased repetition across different compute budgets. Diffusion models consistently exhibit a substantially slower decay rate than AR models, suggesting they are better able to extract value from repeated data.
Training curves for different epoch counts using the same total compute show that AR models overfit with increased repetition (diverging loss curves), while diffusion models exhibit overlapping curves, indicating much greater robustness to data repetition.
Predicted validation loss for AR models (left) and Diffusion models (right) under compute-optimal settings, extrapolated to larger compute budgets. Dotted lines indicate the hypothetical case where repeated data is as valuable as new data. For AR, this holds up to about 4 epochs; for diffusion, up to 100 epochs, showing that diffusion models are much more robust to data repetition.
We derive a critical compute frontier that follows a power law: \(C_{\text{crit}}(U) \propto U^{2.174}\). This gives practitioners a clear guideline for when diffusion models should be preferred over autoregressive models.
We evaluate the best-performing models on diverse downstream benchmarks. Across tasks and data scales, diffusion models consistently achieve higher accuracy than their AR counterparts, validating that the data efficiency gains in validation loss translate into stronger downstream performance.
Benchmarks | Random Baseline | 100M unique tokens | 500M unique tokens | ||
---|---|---|---|---|---|
AR | Diffusion | AR | Diffusion | ||
ARC-Easy | 25.00 | 35.63 | 37.84 | 43.79 | 45.95 |
BoolQ | 50.00 | 46.00 | 49.38 | 51.87 | 55.26 |
COPA | 50.00 | 56.33 | 59.00 | 67.00 | 64.83 |
HellaSwag | 25.00 | 27.37 | 30.24 | 32.28 | 35.33 |
PiQA | 50.00 | 60.94 | 60.72 | 65.71 | 65.61 |
RACE | 25.00 | 25.28 | 28.96 | 28.28 | 31.44 |
WinoGrande XL | 50.00 | 48.87 | 50.97 | 50.61 | 51.51 |
SciQ | 25.00 | 58.05 | 68.67 | 67.82 | 79.13 |
Lambada | 00.00 | 10.91 | 15.19 | 15.07 | 22.30 |
For practitioners, our takeaway is simple:
If you are compute-constrained, use autoregressive models;
if you are data-constrained, use diffusion models.