Introduction

Understanding long-form videos poses unique challenges for Video Question Answering (VideoQA), as it requires reasoning over actions and intentions that unfold at drastically different timescales — from fleeting, atomic gestures like picking up an apple to prolonged, goal-driven activities such as shopping in a supermarket. Existing vision-language models often fall short in capturing this temporal diversity. To bridge this gap, we introduce temporally hierarchical scene graphs as a structured representation of videos.

Scene graphs abstract visual information into <Subject – Relation – Object> triplets, enabling compact, interpretable, and relational reasoning over visual entities. Traditionally applied to static images, they provide a structured way to represent semantic relationships. Although scene graphs are lossy, we hypothesize that building dynamic scene graphs can drastically reduce inference cost for simple, object-level VideoQA tasks.

An example of an image scene graph that captures the semantics of a scene.

By constructing scene graphs at different temporal resolutions — such as frame-level interactions, action segments, and overarching goal sequences — we can retain both short, atomic actions and long, overarching intentions in a unified representation.

This hierarchical approach enables more structured reasoning, facilitates long-range dependencies, and provides a scalable and explainable pathway for tackling diverse VideoQA tasks.

Hypothesized temporally hierarchical scene graph