| Paper | Advantage | Disadvantage |
| LLM-Augmented MTR[1] | 1. Introduces high-level semantic reasoning via GPT-4V. 2. Improves sample efficiency by injecting external knowledge into MTR. 3. Provides interpretable text-based scene understanding. | 1. Relies on online LLM prompting, leading to high inference cost and latency. 2. Limited robustness to rare or long-horizon interaction patterns not well captured by prompts. 3. Not optimized for dense multi-agent joint prediction at scale. |
| DGCN_ST_LANE[2] | 1. Explicitly models agent interactions with spatio-temporal graph convolutions. 2. Strong lane-aware motion constraints using HD-map information. 3. Efficient inference without large models. | 1. Lacks high-level semantic or rule reasoning. 2. Performance degrades in complex intersections with ambiguous right-of-way. 3. Limited long-horizon consistency due to local interaction modeling. |
| THOMAS[3] | 1. Unified transformer architecture for multi-agent trajectory prediction. 2. Strong global attention over agents and time. 3. Simple end-to-end design. | 1. Relies primarily on trajectory history and map cues. 2. Lacks explicit traffic-rule or semantic supervision. 3. Less interpretable and sensitive to rare corner cases. |
| CASPFORMER[4] | 1. Rich scene modeling with lane-level and agent-level context fusion. 2. Strong performance on complex multi-agent scenarios. 3. Transformer-based interaction reasoning. | 1. Computationally heavy with slow inference. 2. No explicit rule or legality constraints. 3. Hard to deploy in real-time settings. |
| Ours | 1. Integrates BEV-based multi-modal perception with JEPA self-supervised learning. 2. Injects traffic-rule semantics via offline VLM distillation, enabling rule-aware prediction without test-time LLM cost. 3. Achieves strong long-horizon accuracy (low minFDE) with interpretable embeddings. | 1. Current performance depends on the capacity of the VLM teacher. 2. Additional preprocessing for BEV rendering and rule distillation. 3. Evaluated on nuScenes, which may under-represent highly interactive scenarios. |
Table 2. Comparison Among SOTA Methods.
Summary of Methods
Prior work explores different trade-offs in multi-agent motion prediction. LLM-Augmented MTR enhances trajectory forecasting with high-level semantic reasoning from large vision–language models, improving contextual awareness but incurring high inference cost. DGCN_ST_LANE focuses on structured interaction modeling and lane constraints, offering efficiency but limited semantic reasoning. THOMAS provides a unified transformer framework for joint prediction, yet lacks interpretability and explicit rule awareness. CASPFormer achieves strong performance through rich scene modeling but suffers from heavy computation and slow inference.
In contrast, our method combines BEV-based multi-modal perception, JEPA self-supervised representation learning, and traffic-rule reasoning distilled from a vision–language model, enabling interaction-consistent, rule-aware predictions while avoiding test-time LLM usage.
Conclusion and Future Directions
Our current results rely on distillation from Qwen2.5-VL-7B-Instruct, which may limit the accuracy of extracted traffic-rule semantics; leveraging a stronger teacher such as Qwen2.5-VL-32B-Instruct is a promising direction for further improvement. In addition, nuScenes presents relatively simple traffic patterns, with many scenes dominated by straight driving, which may not fully stress-test the benefits of rule-aware reasoning. Given that JEPA pretraining and traffic-rule distillation already enable strong scene understanding, it is worth investigating whether agent history and explicit relationship modalities can be reduced or removed without degrading performance. Finally, since most existing methods still depend heavily on agent history, an exciting future direction is to predict all agents’ trajectories using only ego-centric perception, pushing toward more scalable and deployable motion prediction systems.