Method

Our method comprises two tightly coupled components: a video indexing stage that incrementally constructs temporally evolving scene graphs from video frames, and a video question answering (VQA) stage that performs graph-based retrieval and reasoning over the resulting indexed scene graphs.

1. Video Indexing: Frame-Level Scene Graphs with Temporal Deltas

Scene Graph Definition

We adopt a scene graph representation tailored for temporal indexing.

  • Entity format: (entity_name | entity_type | description)
  • Relationship format: (source | target | description)

Video Indexing Pipeline

Overview of our video indexing pipeline. For each frame, the LLM takes the current image together with the previous frame’s scene graph as input and predicts the scene graph differences. These differences are then accumulated with the previous scene graph to produce the current frame’s scene graph.

Frame-wise graph generation

  • Sample video frames at 1 FPS.
  • Use GPT-4o-mini to extract scene graph information for each frame.
  • Condition extraction on both the current frame image and the previous frame’s scene graph.

Temporal delta extraction

  • Output only differences from the previous frame:
    • newly appeared or disappeared entities and relationships
    • updated descriptions of existing entities or relationships
  • Avoid redundant storage by capturing only meaningful changes.

Scene graph accumulation

  • Merge predicted deltas with the previous frame’s scene graph.
  • Produce the current frame’s complete scene graph while maintaining temporal continuity.

Background object handling

  • List visible objects whose states remain unchanged as background objects.
  • Include only object names without descriptions or relationships to reduce redundancy.

Prompt design for video indexing

  • List all other visible but inactive objects as background.
  • Instruct the model to output only entities and relationships whose presence or description differs from the previous frame.
  • Include concise evidence from the current frame for changed elements.
The prompt used in Video Indexing.

2. Video Question Answering: GraphRAG-Inspired Community Summaries and DRIFT Search

Temporal community construction

  • Cluster frame-level scene graphs into temporal communities to support long-range reasoning.
  • Apply change-point detection with ruptures on node-level temporal sequences.
  • Identify shifts in entity activity or graph structure.
  • Use detected change points to segment the video into coherent temporal intervals.

Multi-scale community summaries

  • Generate hierarchical summaries for each temporal community inspired by GraphRAG.
  • Capture salient entities and key interactions within each community.
  • Preserve higher-level contextual information across longer temporal spans.
  • Enable efficient retrieval at multiple levels of abstraction.

Question answering with DRIFT search

  • Match the input question against community-level summaries.
  • Retrieve the top-K most relevant temporal communities.
  • Generate an initial answer and follow-up queries conditioned on retrieved summaries.
  • Perform localized search within corresponding frame-level scene graphs and temporal deltas.
  • Synthesize the final answer from evidence gathered across multiple temporal scales.
GraphRAG organizes graphs without explicit temporal structure, relying on top-down community segmentation and bottom-up summaries
THSG explicitly models temporal structure, performs bottom-up segmentation based on temporal changes, and generates bottom-up temporal summaries that reflect the evolution of scene graphs over time.