Experiments

Visualization of Temporally Hierarchical Scene Graph

Visualization of Temporally Hierarchical Scene Graph

VQA Quantitative Results

The figure above presents the quantitative performance of different methods on the VQA task, measured across eight sub-tasks: Overall, 3D Perception, Fixture Location, Fine-Grained Action Localization, Fine-Grained Action Recognition, Fine-Grained How Recognition, Fine-Grained Why Recognition, Gaze Estimation, and Gaze Interaction Anticipation.

We compare four different approaches:

Blind Guess, which relies solely on the question without any scene information, achieves the lowest performance overall, with accuracy typically ranging from 0.1 to 0.3. This highlights the difficulty of answering questions correctly without access to visual context.

Per-Frame Graph (ours) leverages a full scene graph for each frame. It demonstrates a significant improvement over Blind Guess, particularly in Gaze Estimation and Interaction Anticipation, indicating that frame-level graph information effectively captures actions and gaze relationships.

THSG (ours) incorporates a hierarchical pruning strategy. Compared to Per-Frame Graph, it shows further improvement in some tasks, such as Fine-Grained Action Localization and Overall performance, demonstrating the effectiveness of reducing redundant information.

SceneNet[1], as a state-of-the-art baseline, performs well on 3D Perception and Fine-Grained How Recognition but slightly underperforms our Per-Frame Graph in Gaze Estimation and Interaction Anticipation.

Overall, our Per-Frame Graph and THSG methods consistently outperform Blind Guess and SceneNet across most sub-tasks, with notable gains in action recognition, gaze prediction, and interaction anticipation. These results validate the effectiveness of incorporating scene graphs and hierarchical strategies in VQA tasks.

References

[1] A. Taluzzi, et al. “From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge” arXiv:2506.08553 (2025). https://arxiv.org/abs/2506.08553