Week 07 · April 2026

Generative Modeling via Drifting - One Step Is All You Need

April 12, 2026 · by Satish K C 7 min read
Deep Learning Generative Models Computer Vision
Built by the Author Kravhal - autonomous agents that run your business workflows end-to-end. Pay per outcome.
Get Early Access

The Paper

"Generative Modeling via Drifting" was published in February 2026 by Mingyang Deng, He Li, Tianhong Li, Yilun Du, and Kaiming He - researchers at MIT and Harvard University. The central claim is that generative modeling does not need iterative inference at all. The authors introduce Drifting Models, a new paradigm where the generator's output distribution evolves during training through a learned drifting field, and at inference time, a single forward pass produces the final sample. On ImageNet 256x256, the model achieves an FID of 1.54 in latent space and 1.61 in pixel space - both state-of-the-art for one-step generators.

The Problem Before This Paper

Diffusion and flow-based models produce high-quality images but require hundreds of iterative denoising steps at inference time. DiT-XL/2 needs 500 function evaluations (NFE) to reach an FID of 2.27 on ImageNet 256x256. SiT-XL/2 with REPA needs the same 500 steps for 1.42 FID. Consistency models and distillation methods attempt to reduce step count but typically sacrifice quality - and many still require a pre-trained multi-step teacher model. GANs offer single-step generation but have historically struggled with training instability and mode collapse, with StyleGAN-XL reaching only 2.30 FID and BigGAN reaching 6.95 FID on the same benchmark. No existing paradigm cleanly delivered both one-step inference and state-of-the-art quality without auxiliary teacher models.

What They Built

Drifting Models define a drifting field V that governs how generated samples should move to better match the data distribution. The field is composed of two opposing forces: an attraction term that pulls generated samples toward nearby data points, and a repulsion term that pushes them away from other generated samples. The equilibrium condition - when the generated distribution matches the data distribution - is guaranteed by the field's anti-symmetric property: V_{p,q} = -V_{q,p}, meaning the field vanishes exactly when p equals q. During training, the network f_theta maps noise to samples, and the optimizer updates weights by regressing toward "drifted targets" - the current output shifted by the estimated field V.

V_{p,q}(x) = V+_p(x) - V-_q(x)
Loss = ||f_theta(epsilon) - stopgrad(f_theta(epsilon) + V(f_theta(epsilon)))||^2

The attraction and repulsion forces use a kernel function k(x,y) = exp(-||x-y||/tau) with softmax normalization over mini-batch samples. The architecture is a DiT-style transformer with patch size 2 operating in the latent space of a pre-trained SD-VAE encoder (32x32x4 latent resolution). A key design choice is the feature encoder - a ResNet-style MAE pre-trained on the latent space - that extracts multi-scale features for computing the drifting field. The entire system trains end-to-end without any teacher model, distillation, or adversarial loss.

k(x, y) = exp(-||x - y|| / tau)
Equilibrium: V_{p,q} = 0 when p = q (anti-symmetric property)

Key Findings

Results

On ImageNet 256x256, Drifting Model L/2 in latent space achieves FID 1.54 with Inception Score 258.9 using a single forward pass. In pixel space, the L/16 variant reaches FID 1.61 with IS 307.5 - matching PixelDiT/16 (1.61 FID at 400 steps) exactly, but in one step. For comparison, StyleGAN-XL achieves 2.30 FID and BigGAN reaches 6.95 FID, both also single-step. Among multi-step methods, DiT-XL/2 achieves 2.27 FID at 500 NFE, while SiT-XL/2 with REPA pushes to 1.42 FID at the same step count. Training scales predictably: the B/2 model improves from 3.36 FID at 100 epochs to 1.75 FID at 1280 epochs, and upgrading to L/2 at 1280 epochs reaches the final 1.54.

Why This Matters for AI and Automation

My Take

The elegance of this work is in the formulation. Rather than trying to compress a multi-step process into fewer steps (distillation) or stabilize adversarial training (GANs), Drifting Models reframe the problem entirely: let the optimizer itself be the iterative process, and let inference be a single deterministic mapping. The anti-symmetry property providing a natural equilibrium condition is a clean theoretical contribution - the ablation results showing how quickly quality degrades without it (8.46 to 177.14 FID) confirm this is not a cosmetic design choice but a structural requirement. The fact that the same framework transfers directly to robotics policy generation strengthens the claim that this is a genuine paradigm shift, not an image-specific trick.

The open question is scaling behavior. The current results use ImageNet 256x256 - a well-studied benchmark but far from the resolution and diversity demands of production text-to-image systems. Whether the drifting field formulation remains stable and effective at 1024x1024 resolution with text conditioning, and whether it can match the diversity and controllability of classifier-free guided diffusion at scale, will determine whether this paradigm moves from research milestone to production deployment. The kernel-based field computation also raises questions about mini-batch sensitivity and compute cost at very large batch sizes.

Discussion question: If one-step generators now match the quality of 500-step diffusion models on standard benchmarks, what remaining advantages - if any - do iterative methods retain that could keep them relevant in production systems?

Read the Paper on arXiv →
← Back to all papers
Share