Building on the success of autoregressive models in text-driven human motion generation [1, 2], we aim to develop a motion generation model for hand-object interaction (HOI) based on an autoregressive model. Generating HOI motion requires multimodal modeling, and recent work [3] has shown that fine-tuned LLMs can effectively learn shared vocabularies across modalities. Inspired by [4], we propose a language annotation pipeline for HOI motion sequences to address the lack of existing datasets with rich textual descriptions and explicit hand-object interaction pairs.
