Temporally Hierarchical Scene Graph Generation for Video Question Answering​

2025 MSCV Capstone project of Daniel Yang, Tianzhi Li
Supervised by Katia Sycara, Yaqi Xie

Motivation

Video Question Answering (VideoQA) requires reasoning over actions that occur at drastically different timescales — from short, atomic gestures like picking up an apple to long-term goals like shopping in the supermarket. Traditional vision-language models struggle to capture such temporal diversity, particularly in long videos where long, overarching intentions and fine-grained actions coexist.
We propose a temporally hierarchical scene graph representation to address this. Scene graphs encode video content as compact <Subject – Relation – Object> triplets, enabling structured and explainable reasoning. By building scene graphs at multiple temporal resolutions — from frame-level interactions to high-level goals — we can preserve both fine-grained actions and overarching intentions.
This approach offers a scalable, interpretable, and resource-efficient method for long-form video understanding, making VideoQA more tractable across diverse tasks and durations.

An example of an image scene graph that captures the semantics of a scene.

Datasets

EgoSchema

The EgoSchema benchmark contains over 5000 very long-form video language understanding questions spanning over 250 hours of real, diverse, and high-quality egocentric video data. Many videos feature complex scenes with cluttered rooms, where we could take full potential of scene graphs.

EgoSchema dataset QA example

NExT-QA

NExT-QA is a VideoQA benchmark targeting the explanation of video content. It challenges QA models to reason about the causal and temporal actions and understand the rich object interactions in daily activities. NExT-QA contains 5,440 videos and about 52K manually annotated question-answer pairs, grouped into causal, temporal, and descriptive questions. The videos are about 35 seconds long and focus on interactions between people and sometimes pets.

NExT-QA examples and question category statistics

Proposed Method

We plan to simultaneously generate a hierarchical structure of scene graphs and utilize them to tackle VideoQA tasks.

Since we don’t have access to higher-level scene graphs beyond frame-by-frame image scene graphs, we have to construct the higher-level scene graphs in a self-supervised manner.

The key source of supervision for scene graph encoding: Variational AutoEncoder (VAE)

We outlined the steps to generate high-level scene graphs that express short clips from image-level scene graphs.

Architecture overview of our proposed temporally hierarchical scene graph generator.
  • Dense low-level scene graph prediction from video frames
    • Densely predict scene graph for each frame with off-the-shelf scene graph generator; collapse identical and continuous scene graphs.
    • Each node has a node embedding, initialized with image embedding & word embedding.
  • Self-supervised scene graph encoding
    • Given the lack of supervision for scene graph encoding, we adopt a Variational Autoencoder (VAE) to learn embeddings in a self-supervised manner.
    • The encoder maps the scene graph into a continuous latent vector, enforcing semantic smoothness across similar graphs.
    • We also adopt a contrastive loss to align the latent vector of temporally adjacent scene graphs.
  • Local action segmentation & scene graph reconstruction
    • Given the sequence of latent vectors from the scene graph encoder, we segment the video into coherent action segments.
    • Aggregate latent vectors within segments via self-attention to obtain high-level segment embeddings.
    • Utilize the VAE decoder to reconstruct segment-level scene graphs, forming a higher-level abstraction.
  • Video QA end-to-end finetuning
    • Tokenize higher-level scene graphs and input them to an LLM for question answering.
    • Finetune the entire system end-to-end, with LoRA-based updates on the LLM and full updates on upstream modules
    • For subsequent QAs, we should be able to discard the original video and answer questions with our hierarchical scene graph structure.

References

[1] Mangalam, K., Akshulakov, R., & Malik, J. (2023). EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding. arXiv:2308.09126. https://arxiv.org/abs/2308.09126

[2] Xiao, Junbin, et al. “Next-qa: Next phase of question-answering to explaining temporal actions.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.

[3] Wang, Guan, et al. “OED: towards one-stage end-to-end dynamic scene graph generation.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[4] Islam, Md Mohaiminul, et al. “Video recap: Recursive captioning of hour-long videos.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[5] Liang, Weixin, Yanhao Jiang, and Zixuan Liu. “GraghVQA: Language-guided graph neural networks for graph-based visual question answering.” arXiv preprint arXiv:2104.10283 (2021).[6] Nag, Sayak, et al. “Unbiased scene graph generation in videos.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.