Our approach for high-fidelity video editing is built upon a two-pronged method: Dual-Path Diffusion Sampling for consistent generation, and VLM-Guided Semantic Refinement for ensuring fidelity to the edit instruction.
1. Dual-Path Diffusion Sampling
We address the challenge of maintaining semantic consistency between the edited initial image and the resulting video sequence by employing a Dual-Path Diffusion Sampling scheme (Algorithm 2). Inspired by recent work Coupled Diffusion Sampling (Alzayer et al), which introduces a coupling function that forces two independent diffusion sampling trajectories to be “closer” to one another, we extend this idea to bridge image editing and video generation models.
Concretely, during the reverse diffusion process of two concurrent diffusion models, an Image Editing model (θI) and a Text-to-Video model (θV), we introduce a Dual-Path Guidance Step. This step forces the two sampling paths to align by calculating a cross-guidance term based on the discrepancy of their respective x0 predictions (λ(x^0,I−x^0,V)). This coupling term is applied iteratively to the intermediate latent states. This mechanism ensures that the generated video frames are not only temporally coherent but also semantically consistent with the final edited image, leading to a unified, high-quality output.

2. VLM-Guided Semantic Refinement
To guarantee that the generated video faithfully executes the desired edit while preserving unedited content, we incorporate a Vision-Language Model (VLM) into a refinement loop.
The VLM serves as a sophisticated semantic critic, evaluating the generated output against two key criteria:
- Edit-Verification: The VLM is prompted to evaluate if the editing instruction has been successfully executed in the output image/video.
- Identity-Preservation: The VLM verifies that the generated content remains identical to the input, ignoring only the changes specified by the edit instruction.
The VLM’s quantitative semantic evaluation is converted into a refinement signal (e.g., VLM edit loss), which guides the optimization of the generation models (e.g., via LoRA fine-tuning). This closed-loop semantic refinement ensures the final generated video is both visually compelling and strictly compliant with the user’s intent.

