Proposed Method

We plan to simultaneously generate a hierarchical structure of scene graphs and utilize them to tackle VideoQA tasks.

Since we don’t have access to higher-level scene graphs beyond frame-by-frame image scene graphs, we have to construct the higher-level scene graphs in a self-supervised manner.

The key source of supervision for scene graph encoding: Variational AutoEncoder (VAE)

We outlined the steps to generate high-level scene graphs that express short clips from image-level scene graphs.

Architecture overview of our proposed temporally hierarchical scene graph generator.
  • Dense low-level scene graph prediction from video frames
    • Densely predict scene graph for each frame with off-the-shelf scene graph generator; collapse identical and continuous scene graphs.
    • Each node has a node embedding, initialized with image embedding & word embedding.
  • Self-supervised scene graph encoding
    • Given the lack of supervision for scene graph encoding, we adopt a Variational Autoencoder (VAE) to learn embeddings in a self-supervised manner.
    • The encoder maps the scene graph into a continuous latent vector, enforcing semantic smoothness across similar graphs.
    • We also adopt a contrastive loss to align the latent vector of temporally adjacent scene graphs.
  • Local action segmentation & scene graph reconstruction
    • Given the sequence of latent vectors from the scene graph encoder, we segment the video into coherent action segments.
    • Aggregate latent vectors within segments via self-attention to obtain high-level segment embeddings.
    • Utilize the VAE decoder to reconstruct segment-level scene graphs, forming a higher-level abstraction.
  • Video QA end-to-end finetuning
    • Tokenize higher-level scene graphs and input them to an LLM for question answering.
    • Finetune the entire system end-to-end, with LoRA-based updates on the LLM and full updates on upstream modules
    • For subsequent QAs, we should be able to discard the original video and answer questions with our hierarchical scene graph structure.