Fall '20 Semester - Real-time Photorealistic VR Facial Animation

Dataset

We have one capture that we train on and 4 captures for testing. The first row of the the table shows the factors that are common between the train capture and each of the test capture. The second and third rows show how the appearances of the subject differ between the captures.

Proposed Methods

Metric Learning

In this semester, we explored Metric Learning. The idea was to learn to transform input images and predicted texture to a generic feature space. We wanted a comparison in this feature space to minimize the distance between current code (3-view) to 11-view results and at inference, use this transformation to ‘refine’ the code using Gradient descent. This relies heavily on the assumption that 3-views have enough information to arrive at code 11-view. The diagram can be seen in figure. We want to ensure that loss is quadratic for fast gradient descent.

Mimicking the 11 View Landscape

In this method, instead of ensuring that loss landscape is quadratic, we mimic the loss landscape of 11 view and the now it takes more time to gradient descent to correct expression and the losses are as follows.

Results

Metric Learning

The first row is the set of 3 input IR images. The next row has 5 rendered avatars. The left one represents the prediction that comes from just feeding in the 3 views. The rightmost is the prediction that comes from using all of the 11 views (and this can be considered as the ground-truth). The 3 between them represent the 3 steps of gradient descent on the learnt landscape. Notice the 3 values in white: z2 represents the square of the expression loss, l represents the value of the reconstruction loss and l-lgt represents the difference of the reconstruction loss at the current point with the one at the ground-truth. Hence, ideally, z2 should be close to l-lgt if our landscape has been formed the way we wanted it to be. The graph on the right is a visualization of the terrain. we simply plot the values of the reconstruction loss at 30 points between the 3view prediction and the 11v ground-truth. Ideally, we would like it to be half of the quadratic with the ground-truth lying at the minima.
NOTE: If you do not see the video, please download and run it locally.