Overview
This project explores how multimodal reasoning can improve motion prediction with safety in autonomous driving. We integrate vision, audio, and traffic rule understanding using vision-language models (VLMs) to enhance decision-making. Unlike traditional models that rely purely on visual input, our approach enables reasoning under occlusion and rare-event conditions. By incorporating sound cues and legal knowledge, the system performs safer and more informed motion predictions. The goal is to reduce collision risk, especially in unseen or ambiguous situations as illustrated in the following two situations:
Situation 1: Traffic Rule Awareness

In this scenario, the vehicle at the bottom sees a green light and no visible obstacles, so it proceeds through the intersection. While the stop sign is clearly present, traditional models—trained on frequent visual patterns—fail to react properly because such stop-sign-plus-occlusion combinations are rare in the training set. The car also cannot see the incoming vehicle from the left due to occlusion, leading to a collision. Our VLM-enhanced model reasons that a stop sign implies a legal and safety requirement to stop, even if the path ahead looks clear. By doing so, it avoids violating traffic rules and prevents collisions with unseen cross-traffic.
Situation 2: Audio Awareness

In this scenario, a vision-only model fails to detect an approaching ambulance occluded by a tree. Relying solely on visual input and a green light, the vehicle moves forward, unaware of the emergency vehicle, resulting in a collision. This highlights a key limitation in traditional systems when dealing with rare or occluded events. Our sound-aware model, enhanced by VLM reasoning, hears the siren and infers the presence of an emergency vehicle. It proactively stops before the intersection, ensuring safety even without direct visual confirmation.
Problem Statement
Traditional motion prediction models in autonomous driving rely heavily on visible cues and statistically frequent patterns in training data. However, they often fail in rare but critical scenarios—such as when road users or signs are occluded, or when auditory signals like sirens are present but not seen. These limitations lead to unsafe decisions and legal violations, particularly at intersections.
We aim to enhance motion prediction by integrating vision-language models (VLMs) capable of reasoning about traffic rules and audio events. By incorporating multimodal inputs—including vision, sound, and language-grounded traffic knowledge—our model learns to infer appropriate behaviors even under occlusion or ambiguity. The objective is to generate socially compliant, legally sound, and safe motion trajectories in complex driving scenes.
Related Work
LLM-Augmented MTR [1] – Leverages GPT-4V with TC-MAP BEV renderings and prompts to inject traffic knowledge into motion forecasting.
MotionLM [2] – Formulates multi-agent motion forecasting as autoregressive token generation using a Transformer decoder with scene-conditioned and agent-conditioned embeddings.
HiVT: Hierarchical Vector Transformer [3] – Introduces a hierarchical Transformer for multi-agent motion prediction using vectorized scene features.
HDGT: Heterogeneous Driving Graph Transformer [4] – Encodes driving scenes as heterogeneous graphs to capture agent-lane-sign interactions via type-specific Transformers.
Reference
[1] Xiaoji Zheng, Liwu Xu, Zhijie Yan, Yuanrong Tang, Hao Zhao, Chen Zhong, Bokui Chen, and Jiangtao Gong. Large language models powered context-aware motion prediction, 2024.
[2] Ari Seff, Brian Cera, Dian Chen, Mason Ng, Aurick Zhou, Nigamaa Nayakanti, Khaled S. Refaat, Rami Al-Rfou, and Benjamin Sapp. MotionLM: Multi-Agent Motion Forecasting as Language Modeling. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 8545–8556, Los Alamitos, CA, USA, October 2023. IEEE Computer Society.
[3] Zikang Zhou, Luyao Ye, Jianping Wang, Kui Wu, and Kejie Lu. Hivt: Hierarchical vector transformer for multi-agent motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
[4] Xiaosong Jia, Penghao Wu, Li Chen, Yu Liu, Hongyang Li, and Junchi Yan. Hdgnet: Heterogeneous driving graph transformer for multi-agent trajectory prediction via scene encoding. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023.
[5] Scott M. Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yungwei Benjamin Sapp, C. Qi, Yin Zhou, Zoey Yang, Aurelien Chouard, Pei Sun, Jiquan Niang, Vijay Vasudevan, Alexander McCauley, Jonathon Shlens, and Drago Anguelov. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9690–9699, 2021.
[6] Ming-Fang Chang, John W Lambert, Patsorn Sangkloy, Jasjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, and James Hays. Argoverse: 3D tracking and forecasting with rich maps. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[7] Gemma Roig, Jonathan L. McCormack, and others. AudioSet: An ontology and large-scale dataset for audio event recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2017. https://doi.org/10.1109/ICASSP.2017.7952561
[8] Zhai, Y., Zhou, H., Xu, X., Ma, Z., Liu, Z., & Qiao, Y. (2021). Self-Supervised Audio-Visual Learning Using Cross-Modal Contrastive Learning and Sound Source Localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 1559–1568.