Video Question Answering (VQA) requires models to understand long, complex videos and answer questions ranging from fine-grained object details to high-level temporal reasoning. Existing systems, however, are often constrained by short context windows and limited long-term memory, causing earlier events to be forgotten and weakening their temporal understanding. These limitations make it challenging to handle real-world videos that span minutes or even hours.
To address these challenges, we introduce Temporally Hierarchical Scene Graphs (THSGs), a structured representation that encodes entities, attributes, and interactions across multiple temporal scales. THSGs provide a compact yet informative summary of video content, preserving key visual information while remaining memory-efficient.
Scene graphs abstract visual information into <Subject – Relation – Object> triplets, enabling interpretable, relational reasoning over entities. While traditionally applied to static images, dynamic scene graphs extend this reasoning to videos, capturing both instantaneous interactions and longer-term activities.
By constructing scene graphs at multiple temporal resolutions — from frame-level interactions to action segments and overarching goal sequences — THSGs retain both short-term atomic actions and long-range intentions in a unified representation. This hierarchical structure not only supports efficient indexing and retrieval of relevant moments but also facilitates graph-based reasoning for VideoQA. When answering a question, the model can navigate the temporal hierarchy to identify pertinent entities and interactions, aggregate evidence across frames, and reason over both short-term and long-term dependencies to produce accurate and explainable answers.

