Fall 2020

Slides

Mid-Semester-1 Presentation Slides (Sensitive, tend not to publish)

Mid-Semester-2 Presentation Slides (Sensitive, tend not to publish)

Final Presentation Slides (Sensitive, tend not to publish)

Videos

Final Presentation Video (Sensitive, tend not to publish)

Method

Our proposed pipeline mainly consists of two modules, geometric manipulation module and semantic manipulation module. The geometric manipulation module warps the feature maps of input image guided by estimated 3D geometry. In this way, artifacts are reduced by moving the manipulation from image space to feature space. The 3D geometry introduced here helps handle extreme head pose and lightning condition. The warped feature maps are then fed to semantic manipulation module which synthesizes information that was not the input image.

Geometric Manipulation Module

Figure 1. Geometric Manipulation Module.

As shown in Figure 1, given an input image and the FACS code of driving image as target expression, we first feed the input image into CNN encoder to extract its feature map. We also estimate the 3D geometry of the input image, which is later sent to the 3D geometry module to output an estimated 3D geometry of target image conditioned on target FACS code. Then we calculated an optical flow between input image and target image according to their estimated 3D geometry. Finally this optical flow is used to warp the feature map of input image.

Semantic Manipulation Module

Figure 2. Semantic Manipulation Module.

Besides the estimated 3D geometry of target image, the 3D geometry module also outputs an inpainting mask which synthesizes information that was not the input image and locates areas that need to be inpainted (e.g. wrinkles and teeth). The Semantic Manipulation Module first masks the warped feature map with inpainting mask and then fuses this masked feature map with target FACS code. After fusing, the feature map is fed to an CNN decoder to get the generated image.

Supervision

Our final loss function is the combination of perceptual loss, geometry loss and GAN loss as shown below. I denotes image, P denotes 3D geometry, D denoted discriminator.

Qualitative illustration

Figure 3. An illustration of how each part of the pipeline works. The columns from left to right are bar chart of source FACS and target FACS, input image, input estimated 3D geometry, driving image, driving estimated 3D geometry, manipulated (estimated target)3D geometry, optical flow, image after warping, image after fusing, image after inpainting (final generated image), inpainting mask.

Individual AU Editing result

Figure 4. Individual AU editing examples. Each time we change the intensity of only one AU in the target expression while keeping the others the same.

Qualitative comparison with GANimation

Figure 5. Qualitative comparison examples with GANimation. Our model generate images with less artifacts and more robust to head pose changes.

Project Overview

Semantic Facial Image Manipulation is a conditional generation task. The goal is to synthesize a facial image conditioned on both identity and target expression (In our settings, identity information is provided as a 2D image and target expression is provided as a FACS code).

Figure 1. Facial Action Coding System (FACS) is a standard representation of facial expression in behavior science [2]. It breaks down facial expressions into individual components of muscle movement, called Action Units. Each expression is encoded as a vector. Each number in the vector represents one AU and the magnitude of the number represents its intensity. This way of encoding facial expressions is identity agnostic and interpretable.

Social Motivation

Fujitsu is interested in using FACS-based expression analysis to analyze viewer’s behavior for advertising and to provide feedbacks for video conferences, remote training, etc. But FACS annotated data is limited in amount and often suffer from skewed distribution. Some expressions and Action Units are much more frequent than others. Using facial expression manipulation can generate FACS annotated data while controlling the distribution of generated data (i.e. we can generate expressions and Action Units we want).

Technical Motivation

2D based methods like GANimation [1] manipulates faces in 2D image space, which often results in artifacts in generated images and failures under extreme head pose and lighting condition.

Proposed Pipeline

We proposed an encode-manipulate-decode pipeline mainly consisting of two modules, geometric manipulation module and semantic manipulation module. The geometric manipulation module warps the feature maps of input image guided by estimated 3D geometry. In this way, artifacts are reduced by moving the manipulation from image space to feature space. The 3D geometry introduced here helps handle extreme head pose and lightning condition. The warped feature maps are then fed to semantic manipulation module which synthesizes information that was not the input image

Figure 2. An illustration of our proposed pipeline. It mainly consists of two parts, geometric manipulation module and semantic manipulation module

References

[1]Pumarola et al. GANimation: Anatomically-aware Facial Animation from a Single Image, ECCV2018
[2]Ekman, Paul, Wallace V. Friesen, and Joseph C. Hager. “Facial action coding system: The manual on CD ROM.” A Human Face, Salt Lake City (2002): 77-254