Diffusion Beats Autoregressive in Data-Constrained Settings

Diffusion vs AR Pareto frontier in data-constrained settings.

We systematically study masked diffusion models versus autoregressive (AR) models in data-constrained settings—where training involves repeated passes over limited data. While AR models initially outperform diffusion models at low compute (particularly near the Chinchilla-optimal point), this advantage disappears as training continues beyond this regime. When data must be reused across multiple epochs, diffusion models significantly outperform AR models, achieving lower validation loss and superior downstream performance. We attribute this to implicit data augmentation: masked diffusion exposes the model to diverse token orderings and prediction tasks, unlike AR's fixed left-to-right factorization.

Abstract

Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored. In this paper, we systematically study masked diffusion models in data-constrained settings-where training involves repeated passes over limited data and find that they significantly outperform AR models when compute is abundant but data is scarce. Diffusion models make better use of repeated data, achieving lower validation loss and superior downstream performance. We find new scaling laws for diffusion models and derive a closed-form expression for the critical compute threshold at which diffusion begins to outperform AR. Finally, we explain why diffusion models excel in this regime: their randomized masking objective implicitly trains over a rich distribution of token orderings, acting as an implicit data augmentation that AR's fixed left-to-right factorization lacks. Our results suggest that when data, not compute, is the bottleneck, diffusion models offer a compelling alternative to the standard AR paradigm.

Key Findings

1. Diffusion models surpass autoregressive models given sufficient compute. Across a wide range of unique token budgets, AR models initially outperform diffusion models at low compute, but quickly saturate. Beyond a critical compute threshold, diffusion models continue improving and ultimately achieve better performance.

2. Diffusion models benefit far more from repeated data. While AR models can effectively use repeated data for up to 4 epochs, diffusion models can be trained on repeated data for up to 100 epochs with repeated data almost as effective as fresh data.

3. Diffusion models have a much higher effective epoch count. We find $R_D^* \approx 500$ for diffusion models compared to $R_D^* \approx 15$ for AR models, suggesting diffusion models can benefit from repeated data over far more epochs without major degradation.

4. Critical compute point follows a power law with dataset size. We derive a closed-form expression $C_{\text{crit}}(U) = 2.12 \times 10^{1.956} \cdot U^{2.174}$ that predicts when diffusion becomes the favorable modeling choice for any given dataset size.

5. Diffusion models yield better downstream performance. The validation loss improvements translate to consistent gains across diverse downstream language tasks.

6. Exposure to different token orderings helps explain diffusion's data efficiency. By adding explicit data augmentations to AR training, we find that diffusion models' advantage arises from their exposure to a diverse set of token orderings.

Validation Loss Contours: AR vs Diffusion

We analyze the trade-off between model parameters and training epochs. The contour plots below show validation loss as a function of both axes for 100M unique tokens. Diffusion models achieve their best loss at 500 epochs, while AR models reach their best at just 50 epochs. AR models begin to overfit at high epoch counts, while diffusion models show no signs of overfitting.

AR validation loss contour — **Autoregressive:** Best loss at 50 epochs (3.71)

Diffusion validation loss contour — **Diffusion:** Best loss at 500 epochs (3.55)

Data Value Decay with Repetition

We evaluate how the utility of unique data decays with increased repetition across different compute budgets. Diffusion models consistently exhibit a substantially slower decay rate than AR models, suggesting they are better able to extract value from repeated data.

Data value decay analysis — Decay rate of data value under repetition. Diffusion models show much slower decay, reflecting greater robustness to data repetition.

Training Curves: Robustness to Data Repetition

Training curves for different epoch counts using the same total compute show that AR models overfit with increased repetition (diverging loss curves), while diffusion models exhibit overlapping curves, indicating much greater robustness to data repetition.

AR training curves — **AR:** Validation loss rises with more epochs (overfitting)

Diffusion training curves — **Diffusion:** Curves nearly unchanged, robust to repetition

Extrapolated Training Curves

Predicted validation loss for AR models (left) and Diffusion models (right) under compute-optimal settings, extrapolated to larger compute budgets. Dotted lines indicate the hypothetical case where repeated data is as valuable as new data. For AR, this holds up to about 4 epochs; for diffusion, up to 100 epochs, showing that diffusion models are much more robust to data repetition.

Extrapolated validation loss curves for AR and Diffusion models

When to Use Diffusion over AR?

We derive a critical compute frontier that follows a power law: $C_{\text{crit}}(U) \propto U^{2.174}$. This gives practitioners a clear guideline for when diffusion models should be preferred over autoregressive models.

Loss gap heatmap — **Loss Gap Heatmap:** Red regions show where diffusion outperforms AR

Critical compute curve — **Critical Compute Curve:** Power law relationship $C_{\text{crit}}(U) \propto U^{2.174}$

Downstream Performance

We evaluate the best-performing models on diverse downstream benchmarks. Across tasks and data scales, diffusion models consistently achieve higher accuracy than their AR counterparts, validating that the data efficiency gains in validation loss translate into stronger downstream performance.

Downstream Results: Best AR vs Diffusion models in data-constrained settings
Benchmarks	Random Baseline	100M unique tokens		500M unique tokens
		AR	Diffusion	AR	Diffusion
ARC-Easy	25.00	35.63	37.84	43.79	45.95
BoolQ	50.00	46.00	49.38	51.87	55.26
COPA	50.00	56.33	59.00	67.00	64.83
HellaSwag	25.00	27.37	30.24	32.28	35.33
PiQA	50.00	60.94	60.72	65.71	65.61
RACE	25.00	25.28	28.96	28.28	31.44
WinoGrande XL	50.00	48.87	50.97	50.61	51.51
SciQ	25.00	58.05	68.67	67.82	79.13
Lambada	00.00	10.91	15.19	15.07	22.30

Why is Diffusion More Data-Efficient than Autoregressive?

Exposure to different token orderings helps explain diffusion’s data efficiency. By adding explicit data augmentations to AR training, we find that diffusion models’ advantage arises from their exposure to a diverse set of token orderings. Essentially, the randomized masking in diffusion’s objective serves as implicit data augmentation, allowing it to generalize beyond the fixed left-to-right factorization of AR models.

Random Orderings in Diffusion Models — *Validation loss improves as the number of token orderings $N$ increases in AR training. At $N=16$, performance approaches that of diffusion models.*

Key Takeaway

For practitioners, our takeaway is simple:

If you are compute-constrained, use autoregressive models;
if you are data-constrained, use diffusion models.

BibTeX

@article{prabhudesai2025diffusion,
  title={Diffusion Beats Autoregressive in Data-Constrained Settings},
  author={Prabhudesai, Mihir and Wu, Mengning and Zadeh, Amir and Fragkiadaki, Katerina and Pathak, Deepak},
  journal={arXiv preprint arXiv:2507.15857},
  year={2025}
}

Diffusion Beats Autoregressivein Data-Constrained Settings