Related Works

Training-Based Video Editing Methods

Training-based approaches require additional training or fine-tuning but enable capabilities impossible with zero-shot methods. Tune-A-Video [Wu et al., ICCV 2023] pioneered one-shot fine-tuning with 300-500 steps on a single video-text pair, while VideoComposer [Wang et al., NeurIPS 2023] trained a Spatio-Temporal Condition encoder on WebVid10M for compositional video synthesis. InstructVid2Vid [Qin et al., IEEE 2024] enables natural language instruction-following through 3D U-Net training on synthesized video-instruction triplets. CCEdit [Feng et al., CVPR 2024] disentangles structure and appearance control via a trident network architecture, and VMC [Lee et al., CVPR 2024] achieves motion-appearance disentanglement by fine-tuning only temporal attention layers with motion distillation objectives. I2VEdit [Liu et al., SIGGRAPH Asia 2024] employs motion LoRA training for first-frame-guided long video editing, while VACE [Zhang et al., 2025] provides an all-in-one framework with a unified Video Condition Unit supporting reference-to-video, video-to-video, and masked editing operations.

Training-Free Video Editing Methods

Training-free methods leverage pre-trained text-to-image diffusion models through attention manipulation and temporal correspondence strategies. Early works include Text2Video-Zero [Khachatryan et al., ICCV 2023] with cross-frame attention and FateZero [Qi et al., ICCV 2023] with attention map fusion during inversion. TokenFlow [Geyer et al., ICLR 2024] propagates diffusion features via nearest-neighbor matching of latent features, while FRESCO [Yang et al., CVPR 2024] combines spatial-temporal correspondences with EbSynth propagation for robustness to fast motion. FLATTEN [Cong et al., ICLR 2024] integrates optical flow guidance using RAFT to enforce attention along motion paths. Recent methods achieve temporal consistency through novel architectural strategies: RAVE [Kara et al., CVPR 2024 Highlight] introduces randomized noise shuffling that reorganizes frames across grids at each denoising step, enabling implicit global spatio-temporal attention while remaining 25% faster than baselines, and VidToMe [Li et al., CVPR 2024] merges similar self-attention tokens across frames according to temporal correspondence, achieving 50% memory reduction and 10× latency improvement while enforcing strict feature alignment. Additional methods include Rerender-A-Video [Yang et al., SIGGRAPH Asia 2023] with hierarchical cross-frame constraints, Ground-A-Video[Jeong et al., ICLR 2024] for grounding-driven multi-attribute editing, and Object-Centric Diffusion [Jeong et al., ECCV 2024] for efficient object-focused editing.