Method

Proposed Model Architecture

To address challenges in motion forecasting under occlusion and rare traffic scenarios, we propose a VLM-augmented multimodal architecture. Our model integrates sound, visual signals, and traffic rule understanding through a vision-language model (VLM) and produces safer, interpretable predictions. It consists of three main components: a Scene Encoder for visual and auditory fusion, a Traffic-Rule-Aware Encoder that prompts the VLM for legal reasoning, and a Decoder that autoregressively predicts future trajectories based on both modalities. This modular design enables rule-aware decision-making even in visually ambiguous scenes.

Figure1. Model Architecture

Starting from the left, the Scene Encoder processes multimodal input (e.g., visual and audio data) using attention and self-attention modules to generate scene embeddings. In parallel, the Traffic-Rule-Aware Encoder uses rasterized transportation maps and prompt engineering to extract legal cues (e.g., stop signs, ambulance rules) from a VLM. These rule-aware embeddings are supervised using cross-entropy and integrated via dual attention with scene features. Finally, the Decoder autoregressively predicts future agent positions using stacked transformer layers, enabling trajectory generation that accounts for occlusion, rare rules, and auditory cues.

Input Data and Scene Encoder

The Scene Encoder takes multimodal input data—such as agent state history, lane centerlines, traffic light signals, agent interactions, audio signals, and other potential modalities like camera images and LiDAR—and encodes each modality independently. These embeddings are fused via cross-modal self-attention to capture spatial and semantic context. The final scene embedding has shape (R, N, L, H), where R is the number of rollouts, N is the number of agents, L is the sequence length, and H is the hidden dimension.

Figure 2. Input Data and Scene Encoder

Agent State History

Tracks past positions, velocities, and headings of agents to model their motion trends and intentions.

Lane Centerlines

Provides HD map lane geometries to help align agent trajectories with driving lanes..

Traffic Light Signals

Encodes the current status and location of nearby traffic lights, affecting vehicle motion legality.

Agent Interactions

Captures relative distances and motion cues between nearby agents to infer cooperative or conflicting behaviors.

Audio Signals

Uses spectrogram features to detect environmental audio cues like sirens or honks, important under occlusion.

Other Potential Modalities

We will also try fusing other potential modalities like front camera image of each agent and the LiDAR point cloud.

Traffic-Rule-Aware Encoder

The Traffic-Rule-Aware Encoder extracts specific rule-related scenarios from input data and rasterizes them into structured BEV maps that reflect agents’ positions, traffic signs, and scene layout. These maps are paired with prompts containing auto-generated scene captions, predefined label categories, and a strict response format—augmented by few-shot examples—to guide the vision-language model (VLM) in inferring contextual rules. The VLM, equipped with a QLoRA adapter, outputs a 17-dimensional embedding known as IAS, which encodes:

Intentions (I): Agent’s intended motion (e.g., STRAIGHT, LEFT-TURN, STOP), represented as a weighted one-hot vector in ℝ⁵.
Affordances (A): Legality or feasibility of actions (e.g., LEFT-ALLOW, STOP-FORCE), represented as a binary vector in ℝ⁸.
Scenario Types (S): High-level traffic context (e.g., INTERSECTION, MERGING), represented as a binary vector in ℝ⁴.

The output IAS embedding is of shape (R,N,L,D), where R is the number of rollouts, N is the number of agents, L is the sequence length, and D=17 is the total embedding dimension. The model is fine-tuned with QLoRA using Binary Cross-Entropy (BCE) loss between pre-defined ground-truth IAS and the predicted IAS embeddings:

This enables the model to align predicted traffic-rule embeddings with structured, interpretable ground truth labels.

Vision-Language Model (VLM)

Receives prompt and BEV-derived raster maps.
Trained using Binary Cross-Entropy Loss to match predefined GT-IAS labels.
Outputs legal-aware traffic embeddings.

Figure 3. Traffic-Rule-Aware Encoder Architecture

Figure 4. Prompt Engineering Schema

Prompt Engineering

Transportation Context Map Rasterization converts dynamic and static traffic elements into BEV maps.
- Captures stop signs, lights, lanes, and agent context to structure the prompt input.
Text Prompt is autogenerated with captions describing the scene and actions
Provides rule definitions and few-shot examples to guide the VLM output.

Audio Data Generation and Integration

To incorporate spatial audio cues into motion forecasting, we simulate audio signals and convert them into informative embeddings. These audio features help the model infer occluded events, such as an approaching emergency vehicle, by mimicking how sounds are perceived in realistic 3D environments.

Figure 5. Audio Data Generation Workflow

Simulated Audio Processing

We define sound source and listener positions within a scene, and apply HRTF filtering (e.g., using SoundSpaces) to generate spatial audio signals. This simulates how a listener would realistically perceive sounds coming from different directions.

Spectrogram Feature Extraction

The simulated audio is converted into log Mel spectrograms through a series of transformations including fbsp, Mel filtering, and logarithmic scaling. These spectrograms are then processed by an ESResNeXt[8] model to extract high-level audio embeddings for integration into the scene encoder.

Decoder

The decoder generates future trajectories for each agent by integrating scene and traffic-rule-aware embeddings information into an autoregressive generation process. It leverages dual cross-attention to fuse two modalities—scene context and legal reasoning—and produces sequential outputs step-by-step. The decoder is trained using teacher forcing and evaluated autoregressively at test time.

Dual Attention Fusion: Combines scene embeddings (R,N,L,H) and traffic-rule-aware embeddings (R,N,L,D) via bi-directional cross-attention for comprehensive context modeling.
Autoregressive Generation: Predicts future coordinates step-by-step, using prior outputs, self-attention, and cross-attention to the input embeddings.
Training & Inference: During training, teacher forcing is used (ground truth fed into next step). At inference, previous predictions are recursively used as inputs.
Output Format: The decoder outputs a motion sequence of shape (R,N,T,2), where 2 corresponds to predicted (x,y) positions at each timestep.

This motion loss maximizes the likelihood of predicted future positions given the agent history A before t and scene context S, averaged over all agents and time steps.

The following shows the decoder details. The decoder takes two inputs—scene embeddings (R,N,L,H) from the Scene Encoder and traffic-rule-aware embeddings (R,N,L,D) from the traffic-rule-aware encoder—each passed through an MLP to align dimensions. These are further fused using a Dual Attention module, which applies bi-directional cross-attention between the two modalities. The resulting fused embeddings are fed into a multi-layer autoregressive decoder composed of self-attention and cross-attention layers, allowing the model to generate future coordinates step-by-step while attending to both past predictions and contextual inputs. During training, teacher forcing is used by feeding in ground-truth positions, while inference is conducted autoregressively, relying on previously predicted outputs.