Week 07 · Generative Modeling via Drifting · AI & Automation Chronicle

The Paper

"Generative Modeling via Drifting" was published in February 2026 by Mingyang Deng, He Li, Tianhong Li, Yilun Du, and Kaiming He - researchers at MIT and Harvard University. The central claim is that generative modeling does not need iterative inference at all. The authors introduce Drifting Models, a new paradigm where the generator's output distribution evolves during training through a learned drifting field, and at inference time, a single forward pass produces the final sample. On ImageNet 256x256, the model achieves an FID of 1.54 in latent space and 1.61 in pixel space - both state-of-the-art for one-step generators.

The Problem Before This Paper

Diffusion and flow-based models produce high-quality images but require hundreds of iterative denoising steps at inference time. DiT-XL/2 needs 500 function evaluations (NFE) to reach an FID of 2.27 on ImageNet 256x256. SiT-XL/2 with REPA needs the same 500 steps for 1.42 FID. Consistency models and distillation methods attempt to reduce step count but typically sacrifice quality - and many still require a pre-trained multi-step teacher model. GANs offer single-step generation but have historically struggled with training instability and mode collapse, with StyleGAN-XL reaching only 2.30 FID and BigGAN reaching 6.95 FID on the same benchmark. No existing paradigm cleanly delivered both one-step inference and state-of-the-art quality without auxiliary teacher models.

What They Built

Drifting Models define a drifting field V that governs how generated samples should move to better match the data distribution. The field is composed of two opposing forces: an attraction term that pulls generated samples toward nearby data points, and a repulsion term that pushes them away from other generated samples. The equilibrium condition - when the generated distribution matches the data distribution - is guaranteed by the field's anti-symmetric property: V_{p,q} = -V_{q,p}, meaning the field vanishes exactly when p equals q. During training, the network f_theta maps noise to samples, and the optimizer updates weights by regressing toward "drifted targets" - the current output shifted by the estimated field V.

V_{p,q}(x) = V+_p(x) - V-_q(x)
Loss = ||f_theta(epsilon) - stopgrad(f_theta(epsilon) + V(f_theta(epsilon)))||^2

The attraction and repulsion forces use a kernel function k(x,y) = exp(-||x-y||/tau) with softmax normalization over mini-batch samples. The architecture is a DiT-style transformer with patch size 2 operating in the latent space of a pre-trained SD-VAE encoder (32x32x4 latent resolution). A key design choice is the feature encoder - a ResNet-style MAE pre-trained on the latent space - that extracts multi-scale features for computing the drifting field. The entire system trains end-to-end without any teacher model, distillation, or adversarial loss.

k(x, y) = exp(-||x - y|| / tau)
Equilibrium: V_{p,q} = 0 when p = q (anti-symmetric property)

Key Findings

State-of-the-art one-step generation. Drifting Model L/2 achieves FID 1.54 (latent) and 1.61 (pixel) on ImageNet 256x256 with a single forward pass - beating all prior one-step methods including iMeanFlow (1.72 FID) and AdvFlow (2.38 FID).
Competitive with 500-step diffusion models. The one-step FID of 1.54 is within striking distance of SiT-XL/2+REPA (1.42 FID at 500 NFE) and LightningDiT (1.35 FID at 500 NFE), while requiring 500x fewer function evaluations.
Anti-symmetry is critical. Ablation studies show that breaking the balance between attraction and repulsion causes FID to collapse: 1.5x attraction bias degrades FID from 8.46 to 41.05, attraction-only reaches 177.14 FID.
Scales beyond images to robotics. A one-step Drifting Policy matched or exceeded 100-step Diffusion Policy performance across single-stage and multi-stage robot manipulation tasks.

Results

On ImageNet 256x256, Drifting Model L/2 in latent space achieves FID 1.54 with Inception Score 258.9 using a single forward pass. In pixel space, the L/16 variant reaches FID 1.61 with IS 307.5 - matching PixelDiT/16 (1.61 FID at 400 steps) exactly, but in one step. For comparison, StyleGAN-XL achieves 2.30 FID and BigGAN reaches 6.95 FID, both also single-step. Among multi-step methods, DiT-XL/2 achieves 2.27 FID at 500 NFE, while SiT-XL/2 with REPA pushes to 1.42 FID at the same step count. Training scales predictably: the B/2 model improves from 3.36 FID at 100 epochs to 1.75 FID at 1280 epochs, and upgrading to L/2 at 1280 epochs reaches the final 1.54.

Why This Matters for AI and Automation

Latency reduction. One-step inference means real-time image generation becomes trivially achievable. Applications that currently batch diffusion steps - product image generation, design automation, synthetic data pipelines - can run 100-500x faster at the same quality level.
No teacher dependency. Unlike distillation approaches (consistency distillation, progressive distillation), Drifting Models train from scratch. This eliminates the need to first train an expensive multi-step teacher, simplifying the training pipeline and reducing total compute.
Robotics implications. The demonstrated transfer to robot manipulation policies suggests the paradigm generalizes beyond image synthesis. Any domain currently using diffusion-based planners or policy generators - warehouse automation, robotic assembly, autonomous navigation - could benefit from the same one-step speedup.
Kaiming He's involvement. This is a signal paper. He's track record (ResNet, MAE, Mask R-CNN) means this paradigm will receive significant follow-up attention from the research community.

My Take

The elegance of this work is in the formulation. Rather than trying to compress a multi-step process into fewer steps (distillation) or stabilize adversarial training (GANs), Drifting Models reframe the problem entirely: let the optimizer itself be the iterative process, and let inference be a single deterministic mapping. The anti-symmetry property providing a natural equilibrium condition is a clean theoretical contribution - the ablation results showing how quickly quality degrades without it (8.46 to 177.14 FID) confirm this is not a cosmetic design choice but a structural requirement. The fact that the same framework transfers directly to robotics policy generation strengthens the claim that this is a genuine paradigm shift, not an image-specific trick.

The open question is scaling behavior. The current results use ImageNet 256x256 - a well-studied benchmark but far from the resolution and diversity demands of production text-to-image systems. Whether the drifting field formulation remains stable and effective at 1024x1024 resolution with text conditioning, and whether it can match the diversity and controllability of classifier-free guided diffusion at scale, will determine whether this paradigm moves from research milestone to production deployment. The kernel-based field computation also raises questions about mini-batch sensitivity and compute cost at very large batch sizes.

Discussion question: If one-step generators now match the quality of 500-step diffusion models on standard benchmarks, what remaining advantages - if any - do iterative methods retain that could keep them relevant in production systems?

Read the Paper on arXiv →