What is the motivation of our project?
People may unconsciously use their hands to express themselves during the conversation, and that is because hands are surprisingly important to convey information in communications. Although the Facebook Reality Lab can now generate realistic facial reconstruction for virtual communications, it would be very beneficial to reconstruct hands for virtual human interaction and it would be much more natural to see a talking floating head along with two hands.
On the other hand, it is worth exploration since it is more technically challenging compared to body or face tracking, due to the heavy occlusion and deformation.
Our task is defined as multiview 3D pose estimation. The internal dataset Facebook reality lab gave us is called InterHand, which consists sequences of hand images captured from multiple cameras in multiple views. To further prove our method is effective, we also conducted experiments on Human3.6M.
We firstly extract features and detect keypoints in 2D images. With camera information, we lift 2D keypoints to 3D. To further improve the 2D features, we propose “Epipolar Transformer” to utilize the epipolar geometry.
Here is a 2 min video to summarized our method and results.