1. VLM-Guided Noise Steering
We plan to integrate VLM-guided instructions into the noise-steering process. A key challenge is mapping natural-language guidance to the latent noise or action space.
2. IRL-Based Observation→Reward Network
NAVSIM is not a real simulator, which limits online RL. We aim to use inverse reinforcement learning (IRL) to learn an observation-to-reward model, enabling more realistic RL training.
3. Diffusion Rollout for Data Augmentation
Human driving logs are biased toward high-reward behaviors. We plan to use diffusion models to rollout diverse trajectories and generate richer, more varied training data.
4. Exploring Alternative RL Methods (e.g., GRPO [6])
Diffusion policies generate entire trajectories at each timestep, making group-advantage methods like GRPO a promising direction for improving learning stability and efficiency.
5. Flow-Matching as an Alternative Action Head
Flow-matching action heads are increasingly popular in robotic manipulation. We plan to explore using a flow-matching-based action head for autonomous driving.