Method

Cascaded Autoencoder for Smooth HOI Latents

As illustrated in Figure 1, we introduce a cascaded autoencoder that learns continuous latent representations for hand–object interaction motion by separately encoding object motion and articulated hand trajectories while preserving their interaction structure. This design enables the latent space to capture fine-grained temporal dynamics and avoids the quantization artifacts commonly introduced by discrete VQ-based models. To evaluate motion smoothness, we perform a jerk analysis (Figure 2), where jerk is defined as the third derivative of position with respect to time. As shown across multiple temporal window lengths, our autoencoder consistently yields lower jerk values than VQ-based baselines, indicating smoother, more natural, and temporally stable HOI reconstructions.

Figure 1. Our cascaded autoencoder separately encodes object motion and articulated hand trajectories into continuous latent spaces, which are jointly decoded to reconstruct coherent hand–object interactio
Figure 2. We compare jerk profiles (third derivative of position) between our autoencoder and VQ-based models across different temporal window lengths, showing consistently lower jerk values for our approach, indicating smoother and more temporally stable reconstructions.

Latent Diffusion for HOI Motion Generation

We introduce a latent diffusion framework for hand–object interaction motion generation that combines the strengths of diffusion and autoregressive modeling. As shown in Figure 3, an autoregressive transformer models the compositional structure of HOI motion in latent space, while a diffusion module refines per-token distributions to produce smooth and expressive motion trajectories. This design preserves the flexibility of autoregressive generation for arbitrary-length sequences while enabling continuous, high-fidelity motion synthesis under multimodal conditioning such as language.

Figure 3. An autoregressive transformer predicts latent tokens conditioned on multimodal inputs, while a diffusion module predicts masked token to produce smooth, continuous HOI motion representations.
Table 1. Our latent diffusion approach combines arbitrary-length generation from autoregressive models with the smoothness and continuous representations of diffusion, enabling high-fidelity HOI motion synthesis.