Method - Reinforcement Learning for Noise Steering in Diffusion-Based Driving Models

Learning to Choose noise = Control the driving style

Diffusion models are strong generative policies, but directly fine-tuning them is expensive and unstable. Instead of modifying diffusion weights, we take advantage of a key observation:

the latent noise space is much larger and more expressive than the action space.

Each noise sample w represents a different latent behavior mode—some leading to smoother, conservative trajectories, others producing sharper or more aggressive turns.

This means that choosing the noise is equivalent to choosing the driving style.

Our method learns a lightweight policy π_w that selects the noise before the diffusion model generates actions. By operating entirely in this latent-noise space:

No nested backprop through the diffusion process
No modification to the diffusion model weights
Stable and efficient RL training
Full control over driving behavior through noise selection

In other words, we reformulate reinforcement learning from action-space optimization to latent-noise steering. This allows us to adapt driving styles flexibly—conservative, neutral, or aggressive—while keeping the diffusion model completely frozen.

Diffusion Policy Finetuning via Latent Noise RL Steering

Our method builds on DiffusionDrive[4] and applies DSRL[1] that steers the latent noise instead of modifying the diffusion model itself. This design enables fine-grained control over driving behavior while keeping training stable, efficient, and entirely offline.

1. Observation Encoding

The system first extracts Bird’s-Eye-View (BEV) features from multi-camera images, along with 3 seconds of history information.

These features summarize the scene geometry, surrounding agents, and past motion, and are used as input to both the diffusion policy and the latent-noise actor.

2. Latent Noise Steering via DSRL

Instead of optimizing actions directly, we introduce a policy π_w that learns to select latent noise samples:

Each noise vector w corresponds to a different latent behavior mode (e.g., conservative, neutral, aggressive).
The diffusion model interprets this noise and generates an action trajectory accordingly.

During training:

The latent-noise actor maximizes the value of the selected noise:

\mathcal{L}{\pi^w} = \mathbb{E}{s\sim\mathcal{B}}\left[\alpha \log \pi^w(w|s) – Q^w(s,w)\right]

The action critic Q^A learns the value of diffusion-generated actions:

\mathcal{L}_{Q^A} = \frac{1}{2} \sum_{i=1}^{N_c} \mathbb{E}_{B} \left[ \left( Q^{A}_{\theta_i}(s,a) – y \right)^2 \right]

By performing RL in the noise space, we avoid backpropagating through the diffusion model and do not modify its weights.

3. Diffusion Policy Model

A frozen DiffusionDrive model receives:

Observation features
Latent noise selected by π_w

It then generates a full trajectory through its denoising process. Since the diffusion model remains unchanged, our method is:

Compatible with any pretrained diffusion policy
Efficient (no nested gradient computation)
Stable compared to action-space RL

4. Optional VLM Guidance

We are exploring integrating a Vision-Language Model (VLM) to provide high-level planning cues:

The VLM generates natural-language instructions (via chain-of-thought reasoning).
These instructions may guide the latent-noise policy toward human-preferred behaviors.

5. Improved Trajectory Quality

Because the noise space contains diverse behavioral modes, steering noise enables:

Smoother turns
Safer lane changes
More consistent trajectories
Better generalization under different driving styles and environments

Our experiments confirm that controlling noise leads to better driving trajectories without retraining or altering the diffusion model.