Understanding long-form videos poses unique challenges for Video Question Answering (VideoQA), as it requires reasoning over actions and intentions that unfold at drastically different timescales — from fleeting, atomic gestures like picking up an apple to prolonged, goal-driven activities such as shopping in a supermarket. Existing vision-language models often fall short in capturing this temporal diversity. To bridge this gap, we introduce temporally hierarchical scene graphs as a structured representation of videos.
Scene graphs abstract visual information into <Subject – Relation – Object> triplets, enabling compact, interpretable, and relational reasoning over visual entities. Traditionally applied to static images, they provide a structured way to represent semantic relationships. Although scene graphs are lossy, we hypothesize that building dynamic scene graphs can drastically reduce inference cost for simple, object-level VideoQA tasks.

By constructing scene graphs at different temporal resolutions — such as frame-level interactions, action segments, and overarching goal sequences — we can retain both short, atomic actions and long, overarching intentions in a unified representation.
This hierarchical approach enables more structured reasoning, facilitates long-range dependencies, and provides a scalable and explainable pathway for tackling diverse VideoQA tasks.

