We plan to simultaneously generate a hierarchical structure of scene graphs and utilize them to tackle VideoQA tasks.
Since we don’t have access to higher-level scene graphs beyond frame-by-frame image scene graphs, we have to construct the higher-level scene graphs in a self-supervised manner.

We outlined the steps to generate high-level scene graphs that express short clips from image-level scene graphs.

- Dense low-level scene graph prediction from video frames
- Densely predict scene graph for each frame with off-the-shelf scene graph generator; collapse identical and continuous scene graphs.
- Each node has a node embedding, initialized with image embedding & word embedding.
- Self-supervised scene graph encoding
- Given the lack of supervision for scene graph encoding, we adopt a Variational Autoencoder (VAE) to learn embeddings in a self-supervised manner.
- The encoder maps the scene graph into a continuous latent vector, enforcing semantic smoothness across similar graphs.
- We also adopt a contrastive loss to align the latent vector of temporally adjacent scene graphs.
- Local action segmentation & scene graph reconstruction
- Given the sequence of latent vectors from the scene graph encoder, we segment the video into coherent action segments.
- Aggregate latent vectors within segments via self-attention to obtain high-level segment embeddings.
- Utilize the VAE decoder to reconstruct segment-level scene graphs, forming a higher-level abstraction.
- Video QA end-to-end finetuning
- Tokenize higher-level scene graphs and input them to an LLM for question answering.
- Finetune the entire system end-to-end, with LoRA-based updates on the LLM and full updates on upstream modules
- For subsequent QAs, we should be able to discard the original video and answer questions with our hierarchical scene graph structure.
