Introduction

Video editing remains a fundamental challenge in generative AI, requiring both precise spatial control and seamless temporal consistency. While recent diffusion-based approaches have demonstrated impressive capabilities, they typically demand extensive per-video optimization or sacrifice editing precision for temporal coherence. We present a novel training-free framework that addresses these limitations through three key innovations:

Zero-Shot Efficiency. Our method achieves complex video editing through inference-time techniques alone, fundamentally eliminating computational overhead and the need for time-consuming per-video fine-tuning. This training-free paradigm enables immediate application to new videos without adaptation.

VLM-Guided Semantic Adherence. To further improve the generated results, we introduce a vision-language model guided refinement mechanism that enhances semantic fidelity and ensures faithful adherence to editing instructions. This integration leverages the rich multimodal understanding of VLMs to bridge the gap between user intent and generated content.

Modular Composability. The framework is architected for plug-and-play integration with diverse existing diffusion models, seamlessly bridging state-of-the-art image editing and video generation backbones. This modular design establishes a robust, flexible platform that can readily incorporate future advances in generative modeling.

To these ends, we propose two key technical innovations: (1) dual-path diffusion sampling and (2) VLM-guided semantic refinement.