Overview
This project investigates vision–language–augmented multi-modal, multi-agent motion prediction for autonomous driving. We integrate agent history, inter-agent relationships, camera and LiDAR perception, and explicit traffic-rule reasoning distilled from vision–language models (VLMs) to improve safety and robustness in complex traffic scenes. Unlike traditional motion prediction methods that rely primarily on historical trajectories or visual appearance, our approach incorporates high-level semantic and legal constraints, enabling the model to reason beyond what is directly observable. By jointly forecasting the future trajectories of all agents while respecting traffic rules and scene context, the proposed system aims to reduce collision risk, especially in ambiguous, occluded, or rare scenarios.

The figure illustrates the motivation and benefits of multi-agent motion prediction with rule-aware reasoning. In challenging intersections, predicting each agent independently can lead to unsafe behaviors when critical context—such as cross-traffic priority or stop-line rules—is visually ambiguous or partially occluded. By jointly forecasting all agents’ futures, the model can account for interactions such as yielding, merging, and crossing, resulting in safer and more reliable planning. This joint reasoning produces smoother, more human-like driving behavior by reducing abrupt braking or hesitation, and enables better utilization of road infrastructure by anticipating how multiple agents will move through lanes, intersections, and crosswalks. Incorporating traffic-rule semantics further ensures that predicted trajectories remain legally compliant, even when visual cues alone are insufficient.
Why Multi-Modality?

Future motion depends on intent, interactions, and scene context that no single signal can fully capture; therefore, fusing heterogeneous modalities makes prediction more accurate, more robust to occlusion and lighting, and more socially consistent across all agents. In our model, agent history encodes maneuver intent (turning, braking, lane changes), agent relationships capture pairwise dynamics (relative position/velocity/yaw for following, merging, cut-ins), multi-view cameras provide rich semantics (lanes, traffic lights, signs, pedestrians, road markings), LiDAR contributes precise 3D geometry and distance that remains reliable under night/glare/bad weather, and external knowledge (HD maps + traffic rules) supplies road layout and right-of-way constraints that shape feasible future trajectories.
Problem Statement
Autonomous driving requires accurately forecasting how multiple road agents (vehicles, pedestrians, cyclists) will move over the next few seconds to enable safe and efficient planning. However, multi-agent motion prediction is challenging because future trajectories are driven not only by each agent’s past motion, but also by interactive behaviors (yielding, merging, cut-ins), scene structure (lanes, intersections, crosswalks), and traffic-control constraints (signals, stop lines, right-of-way). These factors are often partially observed due to occlusion, long-range uncertainty, and rare corner cases, making purely pattern-based predictors brittle and prone to socially inconsistent or unsafe joint forecasts.
The problem addressed in this project is to design a multi-modal, rule-aware multi-agent forecasting model that jointly predicts all agents’ future trajectories while leveraging complementary signals from agent history and relationships, camera and LiDAR perception, HD-map context, and distilled traffic-rule semantics. The objective is to produce accurate, interaction-consistent, and legally plausible trajectory distributions under real-world uncertainty—improving downstream decision-making by reducing collision risk and avoiding rule-violating predictions in complex or ambiguous traffic scenarios.
Related Work
LLM-Augmented MTR [1] – Leverages GPT-4V with TC-MAP BEV renderings and prompts to inject traffic knowledge into motion forecasting.
DGCN_ST_LANE [2] – Lane-based multi-agent trajectory
prediction that uses a dynamic graph convolutional
network over lane graphs to model agent history and
inter-agent interactions
THOMAS [3] – ICLR 2022 multi-agent predictor that
outputs future trajectories as heatmaps and learns a
recombination module to sample scene-consistent,
collision-free joint trajectories.
CASPFormer [4] – BEV-image transformer with deformable
attention that performs multi-modal motion prediction
from rasterized BEV context, achieving state-of-the-art
results on nuScenes.
Reference
[1] Xiaoji Zheng, Liwu Xu, Zhijie Yan, Yuanrong Tang, Hao Zhao, Chen Zhong, Bokui Chen, and Jiangtao Gong. Large language models powered context-aware motion prediction, 2024.
[2] Kailu Wu, Xing Liu, Feiyu Bian, Yizhai Zhang, and Panfeng Huang. An integrating comprehensive trajectory prediction with risk potential field method for autonomous driving, 2024.
[3] Thomas Gilles, Stefano Sabatini, Dzmitry Tsishkou, Bogdan Stanciulescu, and Fabien Moutarde. THOMAS: Trajectory heatmap output with learned multi-agent sampling.
In International Conference on Learning Representations (ICLR), 2022.
[4] Harsh Yadav, Maximilian Schaefer, Kun Zhao, and Tobias Meisen.
CASPFormer: Trajectory prediction from BEV images with deformable attention.
In Pattern Recognition: 27th International Conference, ICPR 2024, Kolkata, India, Proceedings, Part XVII, volume 15317 of Lecture Notes in Computer Science. Springer, 2025.
[5] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027, 2019.
[6] Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela Rus, and Song Han. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation, 2024.