We demonstrate the effectiveness of our Language Annotation for HOI Motion Sequences using GPT-4o. The following example shows a frame-by-frame annotation of a hand pouring motion involving a milk carton. For each key frame, our model predicts both the grasp type and a detailed natural language description of the interaction.

The sequence captures how the right hand transitions from a stable power grasp to a relaxed, functional grasp as the pouring motion progresses. The left hand remains uninvolved throughout the sequence. This result demonstrates the model’s ability to infer nuanced intent, contact style, and role distribution between hands across time.
