Introduction - Text-Driven Motion Generation for Hand-Object Interaction

Building on recent advances in motion representation learning, we study hand–object interaction (HOI) motion generation through continuous latent modeling. We demonstrate that continuous motion latents yield smoother and more faithful reconstructions than discrete VQ-based representations for both hand and object trajectories. To this end, we propose a cascaded autoencoder that jointly encodes object motion and articulated hand dynamics into structured latent spaces. Leveraging these representations, we introduce the first latent diffusion framework for HOI motion generation and completion under multimodal conditioning, enabling flexible synthesis and partial motion inference across diverse interaction scenarios.