Introduction - Reinforcement Learning for Noise Steering in Diffusion-Based Driving Models

Diffusion Models for Action Generation

Diffusion models have recently achieved impressive results in action generation, spanning both robotic manipulation and autonomous driving.

On the left, Diffusion Policy demonstrates one of the earliest successes: it reformulates control as a denoising process, generating smooth and consistent action sequences directly from image observations. By iteratively refining noisy action samples, the policy predicts future motion over a time horizon in a stable and expressive way.

On the right, we see how diffusion models naturally capture multi-modal behaviors in driving. The learned latent distribution becomes multi-peaked, allowing the model to represent diverse possible futures—such as turning left, going straight, or turning right. This inherent multi-modality makes diffusion particularly well-suited for complex, decision-rich driving scenarios.

Drivers and road environments differ widey around the world.

Different drivers have different driving styles.

Even within the same city, people exhibit diverse driving styles — from calm and conservative to aggressive and hurried — leading to very different trajectory preferences.

Across countries, the contrast becomes even larger:

cities like those in China feature dense, multimodal traffic with frequent interactions, while the U.S. often has wider roads, clearer lanes, and more structured flow.

These variations in driving style, traffic culture, and visual appearance create a significant distribution shift. As a result, a driving model trained in one style or region may struggle when deployed in another.

This motivates our work: enabling diffusion-based driving policies to adapt flexibly across different driver behaviors and road environments without retraining the entire model.

Comparison Between Different Finetuning Methods

Method	Core Idea	Modified Part	Cons
Direct Optimization	Update diffusion weights to maximize reward	Model params	Expensive in diffusion (deep nested backprop), unstable, may break imitation
Rejection Sampling	Sample multiple actions → pick best via value	No params changed	High compute, slow for real-time
Residual Policy	Train small residual on top of frozen base	Add-on module	Limited improvement

Reinforcement learning has been used in several ways to finetune diffusion-based policies, but each existing approach has notable limitations:

1. Direct Optimization

This method directly updates the diffusion model’s parameters to maximize reward. While expressive, it is extremely expensive — diffusion requires deep, nested backpropagation and often suffers from unstable gradients, which can break the learned imitation behavior.

2. Rejection Sampling

This approach keeps the diffusion model frozen and instead samples many candidate actions, selecting the one with the highest value. The downside is high computational cost, making it too slow for real-time autonomous driving.

3. Residual Policy

A small residual network is trained on top of the frozen diffusion model to adjust the actions slightly. It is lightweight and easy to apply, but typically provides limited performance gains.

Overall, existing methods are either computationally expensive, too slow, or only mildly effective.

This gap motivates our approach — aiming for a simple, stable, and sample-efficient RL finetuning method for diffusion-based driving models.