Task Formulation

Given a video clip depicting a hand interacting with a rigid object, we aim to infer the underlying 3D shape of both the hand and the object, i.e:

  • 3D shape of the object (ϕ)
  • Texture of the hand (β)
  • Intrinsics of the camera (Kt)
  • Per frame articulation of the hand (θtA)
  • Per frame object pose (Th→o)
  • Per frame camera pose (Tc→h)

Idea: Hand a visual cue for Object Shape

Hand can actually be thought of as a special occluder. There are 2 reasons behind this idea:

  • Hand pose is predictive of the object’s shape. For example, When we pinch our fingers together, we are very likely to hold thin sticks, (like pens, nails, and brushes.)
  • The current hand pose reconstruction system is quite robust to occlusion and their prediction can often be trusted.

Given an image (a single frame), we therefore try to simultaneously optimize 2 things for a good reconstruction: i) Estimate the underlying hand pose and ii) infer the object shape in a normalized hand-centric coordinate frame. We anticipate explicit articulation conditioning to help the inference. Extending it to a dynamic setting, we intend to consider the Neural Field of the scene and fit a prior about the understanding of the possible valid shapes of the object.

Given the model, we can then get a per-frame prediction of the interaction and since the NeRF belongs to a valid human-object pair, we expect the frame predictions to be consistent.