Datasets
- WOMD [5]:
- The WOMD (World Open Motion Dataset) provides comprehensive data on traffic behaviors and agent movements in urban environments. It includes sensor data, annotations, and video recordings, which are used to study and predict autonomous vehicle behaviors and pedestrian interactions in complex environments.
- This dataset is particularly useful for modeling interaction dynamics and motion forecasting under varying environmental conditions. It uses 9 seconds (1 history, 8 future)
 
- Argoverse [6]:
- Argoverse is a large-scale dataset used for motion prediction and traffic understanding. It provides high-definition maps, 3D LiDAR data, and visual inputs. This dataset is widely used in the research of autonomous driving systems.
- It focuses on real-world urban driving scenarios, emphasizing vehicle trajectory prediction, sensor fusion, and agent interaction modeling in dynamic urban settings. It uses 16 seconds (11 history, 5 future)
 
- AudioSet[7]:
- AudioSet is a large-scale dataset for audio event recognition, containing 2 million 10-second sound clips from YouTube videos across 527 audio event classes.
- It supports general-purpose audio classification tasks, including speech recognition, environmental sound classification, and music analysis.
 



Metrics
We evaluate model performance using three standard trajectory forecasting metrics: minADE, minFDE, and Miss Rate (MR). These metrics assess the spatial accuracy of predicted agent trajectories compared to the ground truth, based on the best of multiple rollout predictions.
- minADE (Minimum Average Displacement Error): Measures the best average L2 distance between predicted and ground-truth positions over all timesteps. It is defined as

- minFDE (Minimum Final Displacement Error): Measures the L2 distance at the final timestep TTT between the ground-truth position and the closest predicted endpoint among all rollouts:

- Miss Rate (MR): Indicates whether any of the predicted endpoints falls within a 2-meter radius of the ground truth at the final timestep. It is defined as

Baseline Results
We choose MotionLM[2] as our main baseline model because we build upon it, a strong autoregressive joint motion predictor that formulates multi-agent forecasting as a language modeling task over discrete motion tokens. MotionLM achieves state-of-the-art performance on the WOMD interactive benchmark by generating joint, multimodal trajectories in a temporally causal and tokenized fashion—without relying on latent variables or predefined anchors. Below shows some results of this baseline from the paper.

Figure 1 illustrates a lane-changing scenario where an agent in the adjacent lane yields to the lane-changing vehicle, with the primary predicted trajectory (green to blue) showing smooth merging and the secondary (orange to purple) representing a delayed change. Figure 2 shows a pedestrian interaction, where the pedestrian crosses behind a vehicle—model predictions vary based on the vehicle’s progression speed, with the primary mode allowing safe passage. Figure 3 depicts a cyclist-vehicle interaction at an intersection: the most probable rollout predicts the vehicle yielding to the crossing cyclist, while the secondary rollout assumes the vehicle turns ahead of the cyclist’s approach.
Expected Results
Our model builds upon the MotionLM framework by integrating vision-language model (VLM)-based traffic-rule-aware reasoning, inspired by the methodology proposed in LLM-Augmented MTR[1]. Specifically, we incorporate traffic-rule understanding—such as stop signs and auditory cues like sirens—into the prediction process. Unlike MotionLM, which relies solely on scene embeddings, our method infuses semantic reasoning into the forecasting pipeline, enabling informed predictions in rule-governed or occluded scenarios. This enhancement is expected to outperform MotionLM in complex decision-making situations.

Figure 4 and Figure 5 show the expected results in two challenging real-world scenarios our model aims to improve. In Figure 4, although the vehicle in the yellow square cannot visually observe the crossing pedestrian, it detects a stop sign ahead and halts before the intersection. Our model then infers that surrounding agents (red trajectories) are also likely to cross or stop if available, preventing unsafe maneuvers. In Figure 5, sirens are detected from an unseen emergency vehicle, prompting the agents in the yellow squares to yield despite no visible obstacle—again, illustrating the value of multimodal reasoning.

In Table 1, our model (Ours (Exp.)) expects a significant improvement over MotionLM in all three metrics (minADE, minFDE, MR), showing benefit of context-aware cues. However, it slightly underperforms compared to LLM-Augmented MTR, which uses an LLM directly for generating traffic-aware embeddings. Unlike their model, ours fine-tunes a VLM encoder on traffic-rule prompts but lacks full-scale LLM reasoning. This trade-off results in improved efficiency while retaining a substantial gain in reasoning capability.