Related work

OED: Towards One-stage End-to-End Dynamic Scene Graph Generation

Key Idea: Proposes an efficient, single-stage model for dynamic scene graph generation directly from video.

Relevance: Demonstrates the feasibility of learning spatio-temporal relationships end-to-end, but lacks a hierarchical structure or integration with language.

Limitation: Focuses on localized, frame-to-frame dynamics without modeling long-term temporal abstractions.

Video ReCap: Recursive Captioning of Hour-Long Videos

Key Idea: Introduces a hierarchical approach to video captioning via recursive temporal summarization.

Relevance: Shows the power of hierarchical modeling for long-form video understanding.

Limitation: Designed for caption generation, not structured reasoning or graph-based representation.

GraphVQA: Language-Guided Graph Neural Networks for Scene Graph Question Answering

Key Idea: Uses question-guided graph neural networks to perform reasoning over scene graphs for visual question answering.
Relevance: Highlights the benefits of structured, interpretable reasoning using scene graphs aligned with language.
Limitation: Focuses on static graphs without modeling temporal changes or multi-scale event structures.

References

[1] Wang, Guan, et al. “OED: towards one-stage end-to-end dynamic scene graph generation.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[2] Islam, Md Mohaiminul, et al. “Video recap: Recursive captioning of hour-long videos.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[3] Liang, Weixin, Yanhao Jiang, and Zixuan Liu. “GraghVQA: Language-guided graph neural networks for graph-based visual question answering.” arXiv preprint arXiv:2104.10283 (2021).[6] Nag, Sayak, et al. “Unbiased scene graph generation in videos.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.