We use the approach mentioned in DynamicNeRF [1] as our baseline. The overall architecture of DynamicNeRF consists of two NeRF models, one for the static scene and the other for the dynamic object. Both of the NeRF models take the following inputs:

  1. images from different viewpoints and timesteps of the scene containing dynamic objects
  2. ground truth mask of dynamic objects in the scene

The static model implements vanilla NeRF, querying color and density given the position and viewing direction. The dynamic model takes not only the position and viewpoint, but also the timestep to output the color and density as well as the blending factor at each point and timestep. The blending factor ranges between 0 to 1, denoting the importance of the prediction of the dynamic model at each point. The image is finally rendered using volumetric rendering.

An overview of the DynamicNeRFs rendering pipeline is shown in the figure below.

Approach used in DynamicNeRF [1]

Some baseline results are shown in the videos below.

DynamicNeRF on Balloon scene
DynamicNeRF on Truck scene


  1. Gao, Chen, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang. “Dynamic view synthesis from dynamic monocular video.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5712-5721. 2021.