OED: Towards One-stage End-to-End Dynamic Scene Graph Generation
Key Idea: Proposes an efficient, single-stage model for dynamic scene graph generation directly from video.
Relevance: Demonstrates the feasibility of learning spatio-temporal relationships end-to-end, but lacks a hierarchical structure or integration with language.
Limitation: Focuses on localized, frame-to-frame dynamics without modeling long-term temporal abstractions.
Video ReCap: Recursive Captioning of Hour-Long Videos
Key Idea: Introduces a hierarchical approach to video captioning via recursive temporal summarization.
Relevance: Shows the power of hierarchical modeling for long-form video understanding.
Limitation: Designed for caption generation, not structured reasoning or graph-based representation.
GraphVQA: Language-Guided Graph Neural Networks for Scene Graph Question Answering
Key Idea: Uses question-guided graph neural networks to perform reasoning over scene graphs for visual question answering.
Relevance: Highlights the benefits of structured, interpretable reasoning using scene graphs aligned with language.
Limitation: Focuses on static graphs without modeling temporal changes or multi-scale event structures.
References
[1] Wang, Guan, et al. “OED: towards one-stage end-to-end dynamic scene graph generation.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
[2] Islam, Md Mohaiminul, et al. “Video recap: Recursive captioning of hour-long videos.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
[3] Liang, Weixin, Yanhao Jiang, and Zixuan Liu. “GraghVQA: Language-guided graph neural networks for graph-based visual question answering.” arXiv preprint arXiv:2104.10283 (2021).[6] Nag, Sayak, et al. “Unbiased scene graph generation in videos.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
