Future Work

1. VLM-Guided Noise Steering

We plan to integrate VLM-guided instructions into the noise-steering process. A key challenge is mapping natural-language guidance to the latent noise or action space.

2. IRL-Based Observation→Reward Network

NAVSIM is not a real simulator, which limits online RL. We aim to use inverse reinforcement learning (IRL) to learn an observation-to-reward model, enabling more realistic RL training.

3. Diffusion Rollout for Data Augmentation

Human driving logs are biased toward high-reward behaviors. We plan to use diffusion models to rollout diverse trajectories and generate richer, more varied training data.

4. Exploring Alternative RL Methods (e.g., GRPO [6])

Diffusion policies generate entire trajectories at each timestep, making group-advantage methods like GRPO a promising direction for improving learning stability and efficiency.

5. Flow-Matching as an Alternative Action Head

Flow-matching action heads are increasingly popular in robotic manipulation. We plan to explore using a flow-matching-based action head for autonomous driving.