With the development of VR/AR, users want to animate their face in virtual reality

A normally used 3D scanner in industry has several limitations:

  • Time costing
  • Expensive
3D scanner in industry

Our Goal

Users can capture their 3D face just using their mobile phone!

Our pipeline: RGB-D video from selfie camera —> 3D digital human heads

Proposed Pipeline


  • We are facing real-world in-the-wild iPhone data
    • High variance in video quality
    • Unknown camera extrinsics
  • How to utilize depth information

IOS Data Capture App

We implemented our own App for data collection.

  • Recording RGBD data: video + depth + calibration
  • Once stop recording, it will stream the data to server using wifi (utilize the Multipeer Connectivity framework)
Example of data


2022 Fall





2022 Spring


first part https://drive.google.com/file/d/1Xbs-ezn7TitUDXqXEToQ_mWrEB1h9iJZ/view?usp=sharing

second part https://drive.google.com/file/d/1-AGCj0WPuxjLUbBPJuNWeZUO5h4FQx_O/view?usp=sharing


first part https://drive.google.com/file/d/1gGnIqkoMi9k6cs7uBKqjUUauEwAnz5HG/view?usp=sharing

second part https://drive.google.com/file/d/1ztgZd-gsYUKlzJQoQ80bo_zWM6L2Hczt/view?usp=sharing


first part https://drive.google.com/file/d/1fkScNZkyj77WXBn5i371UVkwyxIuTdJM/view?usp=sharing

second part https://drive.google.com/file/d/1MISerWYcwqJPxOq0RnYTB0K1m97oOuo0/view?usp=sharing


Matthew P. O’Toole (Advisor)

Matthew P. O’Toole is an assistant professor in Carnegie Mellon University, Robotic Institution.

Homepage: https://www.cs.cmu.edu/~motoole2/

Chen Cao (Sponsor)

I am a Research Scientist at Reality Labs Pittsburgh. I was a Senior Research Scientist at Snap. I obtained the PhD from Zhejiang University(ZJU), supervised by Prof. Kun Zhou. I was a student member of Graphics and Parallel Systems Lab. I received my B. Eng. degree from College of Computer Science & Technology, Zhejiang University in 2010. My research concentrates on computer graphics.

Yu Han

I am Yu Han, a student in Master of Computer Vision at Carnegie Mellon University. I received my B.S. degree at Peking University, majored in Computer Science and Technology. From 2019 to 2021, I worked as an intern student in STRUCT, guided by Professor Jiaying Liu. From 2020 to 2021, I remotely worked with Prof. Jianbo Shi at the GRASP Lab, University of Pennsylvania. I have also spent some great time at Microsoft Research Asia.

My research interests include GANs, computer vision and computer graphics. My homepage is https://victoriahy.github.io/

Wenyu Xia

I am a M.S. in Computer Vision student at CMU. I graduated from Tsinghua University majoring in computer science. I am broadly interested in computer vision and computer graphics.

linkedin linkedin.com/in/wenyu-xia




Key Points Detection

  • Face key points detection
    • Convolutional Pose Machine
  • Ear key points detection
    • YOLO + Convolutional Pose Machine
Left: face key points, right: ear key points

3DMM Fitting


  • 2d keypoints K1 (including ears) on Rgb video from OpenPose and CPM, GT depth d1
  • 3dmm (learned) + R, T (learned) + camera intrinsics-> 2d keypoints K2, depth d2
  • Loss between K1 and K2, d1 and d2
  • Depth loss
    • Fix the correspondence between mesh vertices and the pixel in the video
    • Iterate until the depth loss under the correspondence coverage and then update the correspondence —> increase the stability of the model
Blue key points: video, red key points: mesh