Diffusion Beats Autoregressive
in Data-Constrained Settings

Carnegie Mellon University Lambda
*Equal contribution

Diffusion vs AR Pareto frontier in data-constrained settings.

We systematically study masked diffusion models versus autoregressive (AR) models in data-constrained settings—where training involves repeated passes over limited data. While AR models initially outperform diffusion models at low compute (particularly near the Chinchilla-optimal point), this advantage disappears as training continues beyond this regime. When data must be reused across multiple epochs, diffusion models significantly outperform AR models, achieving lower validation loss and superior downstream performance. We attribute this to implicit data augmentation: masked diffusion exposes the model to diverse token orderings and prediction tasks, unlike AR's fixed left-to-right factorization.

Abstract

Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored. In this paper, we systematically study masked diffusion models in data-constrained settings—where training involves repeated passes over limited data—and find that they significantly outperform AR models when compute is abundant but data is scarce. Diffusion models make better use of repeated data, achieving lower validation loss and superior downstream performance. We interpret this advantage as implicit data augmentation: masked diffusion exposes the model to a diverse distribution of token orderings and prediction tasks, unlike AR's fixed left-to-right factorization. We find new scaling laws for diffusion models and derive a closed-form expression for the critical compute threshold at which diffusion begins to outperform AR. These results suggest that when data, not compute, is the bottleneck, diffusion models offer a compelling alternative to the standard AR paradigm.

Key Findings

1. Diffusion models surpass autoregressive models given sufficient compute. Across a wide range of unique token budgets, AR models initially outperform diffusion models at low compute, but quickly saturate. Beyond a critical compute threshold, diffusion models continue improving and ultimately achieve better performance.

2. Diffusion models benefit far more from repeated data. While AR models can effectively use repeated data for up to 4 epochs, diffusion models can be trained on repeated data for up to 100 epochs with repeated data almost as effective as fresh data.

3. Diffusion models have a much higher effective epoch count. We find \(R_D^* \approx 500\) for diffusion models compared to \(R_D^* \approx 15\) for AR models, suggesting diffusion models can benefit from repeated data over far more epochs without major degradation.

4. Critical compute point follows a power law with dataset size. We derive a closed-form expression \(C_{\text{crit}}(U) = 2.12 \times 10^{15} \cdot U^{2.174}\) that predicts when diffusion becomes the favorable modeling choice for any given dataset size.

5. Diffusion models yield better downstream performance. The validation loss improvements translate to consistent gains across diverse downstream language tasks.

Validation Loss Contours: AR vs Diffusion

We analyze the trade-off between model parameters and training epochs. The contour plots below show validation loss as a function of both axes for 100M unique tokens. Diffusion models achieve their best loss at 500 epochs, while AR models reach their best at just 50 epochs. AR models begin to overfit at high epoch counts, while diffusion models show no signs of overfitting.

AR validation loss contour
Autoregressive: Best loss at 50 epochs (3.71)
Diffusion validation loss contour
Diffusion: Best loss at 500 epochs (3.55)

Data Value Decay with Repetition

We evaluate how the utility of unique data decays with increased repetition across different compute budgets. Diffusion models consistently exhibit a substantially slower decay rate than AR models, suggesting they are better able to extract value from repeated data.

Data value decay analysis
Decay rate of data value under repetition. Diffusion models show much slower decay, reflecting greater robustness to data repetition.

Training Curves: Robustness to Data Repetition

Training curves for different epoch counts using the same total compute show that AR models overfit with increased repetition (diverging loss curves), while diffusion models exhibit overlapping curves, indicating much greater robustness to data repetition.

AR training curves
AR: Validation loss rises with more epochs (overfitting)
Diffusion training curves
Diffusion: Curves nearly unchanged, robust to repetition

Extrapolated Training Curves

Predicted validation loss for AR models (left) and Diffusion models (right) under compute-optimal settings, extrapolated to larger compute budgets. Dotted lines indicate the hypothetical case where repeated data is as valuable as new data. For AR, this holds up to about 4 epochs; for diffusion, up to 100 epochs, showing that diffusion models are much more robust to data repetition.

Extrapolated validation loss curves for AR and Diffusion models

When to Use Diffusion over AR?

We derive a critical compute frontier that follows a power law: \(C_{\text{crit}}(U) \propto U^{2.174}\). This gives practitioners a clear guideline for when diffusion models should be preferred over autoregressive models.

Loss gap heatmap
Loss Gap Heatmap: Red regions show where diffusion outperforms AR
Critical compute curve
Critical Compute Curve: Power law relationship \(C_{\text{crit}}(U) \propto U^{2.174}\)

Downstream Performance

We evaluate the best-performing models on diverse downstream benchmarks. Across tasks and data scales, diffusion models consistently achieve higher accuracy than their AR counterparts, validating that the data efficiency gains in validation loss translate into stronger downstream performance.

Downstream Results: Best AR vs Diffusion models in data-constrained settings
Benchmarks Random Baseline 100M unique tokens 500M unique tokens
AR Diffusion AR Diffusion
ARC-Easy 25.00 35.63 37.84 43.79 45.95
BoolQ 50.00 46.00 49.38 51.87 55.26
COPA 50.00 56.33 59.00 67.00 64.83
HellaSwag 25.00 27.37 30.24 32.28 35.33
PiQA 50.00 60.94 60.72 65.71 65.61
RACE 25.00 25.28 28.96 28.28 31.44
WinoGrande XL 50.00 48.87 50.97 50.61 51.51
SciQ 25.00 58.05 68.67 67.82 79.13
Lambada 00.00 10.91 15.19 15.07 22.30

Key Takeaway

For practitioners, our takeaway is simple:

If you are compute-constrained, use autoregressive models;
if you are data-constrained, use diffusion models.