Results

We demonstrate the effectiveness of our Language Annotation for HOI Motion Sequences using GPT-4o. The following example shows a frame-by-frame annotation of a hand pouring motion involving a milk carton. For each key frame, our model predicts both the grasp type and a detailed natural language description of the interaction.

Figure 1. Frame-by-frame annotations generated by GPT-4o for a pouring sequence involving a milk carton. For each key frame, the model infers (1) the grasp type and (2) a detailed natural language description of the hand’s role and intent. Additionally, a summary description is generated to capture the overall hand-object interaction across the entire sequence, including grasp transitions and the temporal structure of the action.

The sequence captures how the right hand transitions from a stable power grasp to a relaxed, functional grasp as the pouring motion progresses. The left hand remains uninvolved throughout the sequence. This result demonstrates the model’s ability to infer nuanced intent, contact style, and role distribution between hands across time.