Multimodal Autoregressive Model
We aim to fine-tune a language model to generate hand-object interaction (HOI) motion sequences by learning a shared token vocabulary across modalities. Given a textual prompt, our model predicts HOI tokens autoregressively, alternating between textual and motion representations. This process is illustrated in Fig. 1, where the model consumes both text and HOI tokens to generate plausible interaction sequences in an autoregressive manner.

HOI Tokenizer
To represent HOI motion as discrete tokens, we design a VQ-based tokenizer that compresses hand-object sequences into a learnable codebook. The encoder extracts latent features from the input, which are then quantized using the codebook. The decoder reconstructs the motion from these discrete codes. As shown in Fig. 2, this enables us to transform complex spatiotemporal motion data into a compact tokenized form suitable for autoregressive generation.

Language Annotation for HOI Motion Sequences
Since existing HOI datasets lack rich textual descriptions and explicit HOI annotations, we introduce a language annotation pipeline. We leverage a multimodal LLM to infer frame-level descriptions, grasp types, and contact details. As shown in Fig. 3, the LLM takes hand-object motion as input and generates detailed annotations, including object category, contact fingers, and intent. This allows us to construct aligned text-motion pairs for training and evaluation.

