Cascaded Autoencoder for Smooth HOI Latents
As illustrated in Figure 1, we introduce a cascaded autoencoder that learns continuous latent representations for hand–object interaction motion by separately encoding object motion and articulated hand trajectories while preserving their interaction structure. This design enables the latent space to capture fine-grained temporal dynamics and avoids the quantization artifacts commonly introduced by discrete VQ-based models. To evaluate motion smoothness, we perform a jerk analysis (Figure 2), where jerk is defined as the third derivative of position with respect to time. As shown across multiple temporal window lengths, our autoencoder consistently yields lower jerk values than VQ-based baselines, indicating smoother, more natural, and temporally stable HOI reconstructions.


Latent Diffusion for HOI Motion Generation
We introduce a latent diffusion framework for hand–object interaction motion generation that combines the strengths of diffusion and autoregressive modeling. As shown in Figure 3, an autoregressive transformer models the compositional structure of HOI motion in latent space, while a diffusion module refines per-token distributions to produce smooth and expressive motion trajectories. This design preserves the flexibility of autoregressive generation for arbitrary-length sequences while enabling continuous, high-fidelity motion synthesis under multimodal conditioning such as language.


