Related Work - Dynamic Reconstruction of Non-rigid Scenes from Monocular RGB-D Videos

Novel view synthesis (NVS) is pivotal in the realm of 3D vision due to its multifaceted importance across various applications, including virtual reality (VR), augmented reality (AR), and movie production. It enables the generation of images from any desired viewpoint or timestamp within a scene, thus facilitating immersive experiences and enhancing visual storytelling. Novel view synthesis for dynamic scenes is extremely challenging due to the complex and unpredictable motion of objects, requiring algorithms capable of handling non-rigid deformations and temporal coherence.

Dynamic Fusion[1] is the first dense SLAM system capable of reconstructing non-rigidly deforming scenes in real-time. It learns a rigid canonical space and then predicts a per-frame warp field that transforms the canonical frame into the current observation frame. Their method adapts the warp-field structure to capture newly seen regions and continuously updates the canonical space as new depth data becomes available, enabling robust tracking of non-rigid deformations.

3D Gaussian Splatting[2] has recently emerged as a compelling method for representing 3D scenes. It offers real-time rendering and faster processing compared to NeRF, with reduced training time. 4D Gaussian Splatting[3] introduces an extension of 3D Gaussian splatting to handle dynamic scenes. Inspired by the HexPlane approach, it proposes a decomposed spatio-temporal neural voxel encoding algorithm to efficiently encode Gaussian features. Subsequently, a lightweight MLP (Multi-Layer Perceptron) is utilized to predict Gaussian deformations at novel timestamps.

The Shape of Motion[4] introduces a novel method for reconstructing dynamic scenes, capturing both geometry and motion from monocular videos. It addresses the under-constrained nature of 4D reconstruction by using compact SE(3) motion bases for soft scene decomposition and consolidating noisy signals like monocular depth maps and 2D tracks into a globally consistent representation. This approach disentangles shape and motion, enabling high-fidelity reconstructions even in complex scenarios.

ED3DGS[5] extends 3D Gaussian Splatting (3DGS) to dynamic scenes by deforming a canonical 3DGS across multiple frames. Unlike previous methods that use coordinate-based deformation fields, this approach defines deformation as a function of per-Gaussian and temporal embeddings. It further decomposes deformations into coarse and fine components to capture slow and fast movements and introduces local smoothness regularization to enhance details in dynamic regions, achieving accurate and detailed reconstructions of complex dynamic scenes.

References

[1] Newcombe, Richard A., Dieter Fox, and Steven M. Seitz. “Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
[2] Kerbl, Bernhard, et al. “3d gaussian splatting for real-time radiance field rendering.” ACM Transactions on Graphics 42.4 2023.
[3] Wu, Guanjun, et al. “4d gaussian splatting for real-time dynamic scene rendering.” arXiv preprint arXiv:2310.08528 2023.
[4] Wang, Qianqian, et al. “Shape of motion: 4d reconstruction from a single video”. arXiv preprint arXiv: 2407.13764.
[5] Bae, Jeongmin et al. “Per-Gaussian Embedding-Based Deformation for Deformable 3D Gaussian Splatting”. European Conference on Computer Vision. Springer, Cham, 2025.