Neural Radiance Fields

Neural Radiance Fields

Since its appearance in seminal work by Mildenhall et al., neural radiance fields[1] and its many follow-up works have shown great success in modeling 3D scene and provide an effective approach to 3D reconstruction and novel view synthesis. Compared to other types of scene representations (meshes, point clouds, volumes etc.), NeRF is more capable in capturing fine geometric details, which is crucial to creating realistic human model. With NeRF we are able to not only reconstruct the face and ears with more definition but also to include hairstyle in the model.


NeRF is limited in long training and rendering time. Since the seminal paper of original NeRF, a lot of work has been done to reduce reconstruction time as well as improve rendering quality. For the purpose of our project we based our experiments on one of the recent NeRF variation model, TensoRF[2]. TensoRF is able to achieve higher reconstruction quality in shorter training time compared to original NeRF model. The speed-up is obtained by standard PyTorch implementation, which provides a good foundation for our experiments.


We use the IPhone 13 Pro to capture two types of data on multiple participants. In addition, since NeRF requires known camera pose for each frame to train, we use colmap3,4 to preprocess the RGB video and acquire a selection of critical frames and camera pose estimation.

  • The participant uses our app to capture a selfie style video covering the two sides and front side of their face. This produces a ~30s RGB and depth video.
  • The participant stands still while a second person takes a 360 degree video of the participant. This produces a ~30s RGB video.

Depth Supervision

We utilize depth data to calculate a depth loss to help supervise NeRF training. This part of experiments are conducted with selfie style captured data. By conducting experiments with and without depth loss, we discover that adding depth loss helps with improving reconstruction quality. We use PSNR as a quantitative metric.

Test PSNR data 1data 2data 3data 4
w/o depth17.9416.2413.0814.91
w depth18.1416.6713.5615.14
Quantitative result on depth supervision
Visual result on ears

We noticed that after adding depth supervision, there is a significant improvement in reconstructing the subject’s ears as shown in the visual result above.

We also noticed that adding depth loss helps with sharper visual result on the subject’s cheeks.

Front Face and Side Face Ratio

During experiments, we find that with selfie style captured data, there are more front face frames in the video sequence than side face. Front face frames also have higher quality than side face frames. Consciously adding more side face frames in the video frames in training NeRF result in better reconstruction quality.

Test PSNRdata 1data 2data 3
Original video18.1416.6713.56
Adding side face frames 18.2617.7515.45
Quantitative result on adding side face frames
Visual result on side face


Finally we present visual results on 360 degree captured data. This type of data is captured by the IPhone back camera which has a higher resolution. With 360 degree it captures more complete view of the head and hair. Overall it achieves a better result on novel view synthesis of the head model.


  1. Mildenhall, Ben, et al. “Nerf: Representing scenes as neural radiance fields for view synthesis.” European conference on computer vision. Springer, Cham, 2020.
  2. A. Chen, Z. Xu, A. Geiger, J. Yu and H. Su. “TensoRF: Tensorial Radiance Fields.” ECCV. 2022.
  3. J.L. Schonberger, JM. Frahm. “Structure-from-Motion Revisited.” CVPR. 2016.
  4. J.L. Schonberger, E. Zheng, M. Pollefeys and JM. Frahm. “Pixelwise View Selection for Unstructured Multi-View Stereo.” ECCV. 2016.



With the development of VR/AR, users want to animate their face in virtual reality

A normally used 3D scanner in industry has several limitations:

  • Time costing
  • Expensive
3D scanner in industry

Our Goal

Users can capture their 3D face just using their mobile phone!

Our pipeline: RGB-D video from selfie camera —> 3D digital human heads

Proposed Pipeline


  • We are facing real-world in-the-wild iPhone data
    • High variance in video quality
    • Unknown camera extrinsics
  • How to utilize depth information

IOS Data Capture App

We implemented our own App for data collection.

  • Recording RGBD data: video + depth + calibration
  • Once stop recording, it will stream the data to server using wifi (utilize the Multipeer Connectivity framework)
Example of data


2022 Fall



2022 Spring


first part

second part


first part

second part


first part

second part


Matthew P. O’Toole (Advisor)

Matthew P. O’Toole is an assistant professor in Carnegie Mellon University, Robotic Institution.


Chen Cao (Sponsor)

I am a Research Scientist at Reality Labs Pittsburgh. I was a Senior Research Scientist at Snap. I obtained the PhD from Zhejiang University(ZJU), supervised by Prof. Kun Zhou. I was a student member of Graphics and Parallel Systems Lab. I received my B. Eng. degree from College of Computer Science & Technology, Zhejiang University in 2010. My research concentrates on computer graphics.

Yu Han

I am Yu Han, a student in Master of Computer Vision at Carnegie Mellon University. I received my B.S. degree at Peking University, majored in Computer Science and Technology. From 2019 to 2021, I worked as an intern student in STRUCT, guided by Professor Jiaying Liu. From 2020 to 2021, I remotely worked with Prof. Jianbo Shi at the GRASP Lab, University of Pennsylvania. I have also spent some great time at Microsoft Research Asia.

My research interests include GANs, computer vision and computer graphics. My homepage is

Wenyu Xia

I am a M.S. in Computer Vision student at CMU. I graduated from Tsinghua University majoring in computer science. I am broadly interested in computer vision and computer graphics.





Key Points Detection

  • Face key points detection
    • Convolutional Pose Machine
  • Ear key points detection
    • YOLO + Convolutional Pose Machine
Left: face key points, right: ear key points

3DMM Fitting


  • 2d keypoints K1 (including ears) on Rgb video from OpenPose and CPM, GT depth d1
  • 3dmm (learned) + R, T (learned) + camera intrinsics-> 2d keypoints K2, depth d2
  • Loss between K1 and K2, d1 and d2
  • Depth loss
    • Fix the correspondence between mesh vertices and the pixel in the video
    • Iterate until the depth loss under the correspondence coverage and then update the correspondence —> increase the stability of the model
Blue key points: video, red key points: mesh