4D Scene Reconstruction using Casual Capturing Monocular Camera

Abstract

_{This project explores the concept of reconstructing 4D scenes (3D space with time) using a single monocular camera during casual capture. Traditional 4D scene reconstruction methods often rely on specialized equipment, controlled environments, or multiple cameras. This project aims to overcome these limitations by developing a system that can reconstruct dynamic scenes using a single monocular camera during everyday activities, such as walking or hand-held recording.}

Background

_{Reconstructing 4D scenes, has traditionally involved specialized equipment or controlled environments. Conventional methods may utilize:}

_{– Multiple Cameras:}_{Stereo rigs or camera arrays capture the scene from various viewpoints, enabling triangulation for accurate depth information.}

_{– Depth Sensors:}_{LiDAR or time-of-flight cameras directly measure depth at each image point, providing dense and accurate depth maps.}

_{– Controlled Environments:}_{Studios or motion capture stages utilize precisely calibrated camera setups and controlled lighting to simplify reconstruction.}

_{While offering high-quality reconstructions, these techniques are impractical for capturing everyday scenes due to:}

_{– Cost:}_{Specialized equipment can be expensive and not readily available.}

_{– Complexity:}_{Setting up and calibrating multiple cameras or depth sensors can be time-consuming and require technical expertise.}

_{– Limited Applicability:}_{Controlled environments restrict the types of scenes that can be captured.}

Method

_{1. Hybrid Scene Representation:}

_{-The pipeline begins by capturing a dynamic scene using the monocular camera. A hybrid representation is then constructed, combining spatial and temporal information. This representation utilizes three 2D planes, forming a tri-plane approximation of the scene geometry.}

_{-Additionally, a deformation field is incorporated to capture how objects within the scene move and change shape over time.}

_{-This combined representation (tri-plane and deformation field) is fed into a Multi-Layer Perceptron (MLP).}

_{-The MLP acts as a regressor, predicting the 3D coordinates (x’, y’, z’) of points within the scene.}