Our method comprises two tightly coupled components: a video indexing stage that incrementally constructs temporally evolving scene graphs from video frames, and a video question answering (VQA) stage that performs graph-based retrieval and reasoning over the resulting indexed scene graphs.
1. Video Indexing: Frame-Level Scene Graphs with Temporal Deltas
Scene Graph Definition
We adopt a scene graph representation tailored for temporal indexing.
- Entity format: (entity_name | entity_type | description)
- Relationship format: (source | target | description)
Video Indexing Pipeline

Frame-wise graph generation
- Sample video frames at 1 FPS.
- Use GPT-4o-mini to extract scene graph information for each frame.
- Condition extraction on both the current frame image and the previous frame’s scene graph.
Temporal delta extraction
- Output only differences from the previous frame:
- newly appeared or disappeared entities and relationships
- updated descriptions of existing entities or relationships
- Avoid redundant storage by capturing only meaningful changes.
Scene graph accumulation
- Merge predicted deltas with the previous frame’s scene graph.
- Produce the current frame’s complete scene graph while maintaining temporal continuity.
Background object handling
- List visible objects whose states remain unchanged as background objects.
- Include only object names without descriptions or relationships to reduce redundancy.
Prompt design for video indexing
- List all other visible but inactive objects as background.
- Instruct the model to output only entities and relationships whose presence or description differs from the previous frame.
- Include concise evidence from the current frame for changed elements.

2. Video Question Answering: GraphRAG-Inspired Community Summaries and DRIFT Search
Temporal community construction
- Cluster frame-level scene graphs into temporal communities to support long-range reasoning.
- Apply change-point detection with ruptures on node-level temporal sequences.
- Identify shifts in entity activity or graph structure.
- Use detected change points to segment the video into coherent temporal intervals.
Multi-scale community summaries
- Generate hierarchical summaries for each temporal community inspired by GraphRAG.
- Capture salient entities and key interactions within each community.
- Preserve higher-level contextual information across longer temporal spans.
- Enable efficient retrieval at multiple levels of abstraction.
Question answering with DRIFT search
- Match the input question against community-level summaries.
- Retrieve the top-K most relevant temporal communities.
- Generate an initial answer and follow-up queries conditioned on retrieved summaries.
- Perform localized search within corresponding frame-level scene graphs and temporal deltas.
- Synthesize the final answer from evidence gathered across multiple temporal scales.


