Motivation
Lighting plays a fundamental role in visual storytelling, shaping mood, focus, and narrative across cinematic and digital media. While significant progress has been made in single-image relighting using diffusion-based methods, these methods often fall short when applied to videos due to temporal inconsistencies and the lack of motion-aware modeling. However, extending relighting techniques to video opens the door to a broad range of applications, from post-production enhancement in filmmaking to immersive augmented reality experiences and dynamic scene editing in virtual environments. This project aims to develop a diffusion-based video relighting framework that preserves temporal coherence while offering high-quality, controllable lighting transformations across frames.
Related Works
There has been a myriad of diffusion-based image relighting methods, and they generally follow this paradigm: using attention to inject lighting conditioning to the diffusion model, and optionally enforce lighting consistency loss during training. Below, we highlight two seminal papers in this task.
IC-Light introduces a diffusion-based framework for single-image relighting that conditions the generation process on target lighting environments. By integrating lighting information through attention mechanisms and enforcing a lighting consistency loss during training, IC-Light achieves realistic and coherent relighting results across various scenes. This approach laid the groundwork for subsequent methods by demonstrating the effectiveness of combining diffusion models with lighting-aware conditioning.


Neural Gaffer builds upon and extends these concepts by proposing an end-to-end 2D relighting diffusion model capable of relighting any object from a single image under novel environmental lighting conditions. Unlike previous methods that often require explicit scene decomposition or are limited to specific object categories, Neural Gaffer conditions a pre-trained diffusion model directly on target environment maps, eliminating the need for intrinsic component estimation. Trained on the extensive RelitObjaverse dataset, which comprises approximately 18.4 million rendered images of 90,000 high-quality 3D objects under diverse lighting conditions, Neural Gaffer demonstrates superior generalization and accuracy on both synthetic and real-world images. Furthermore, it extends its capabilities to 3D tasks by serving as a strong relighting prior for neural radiance fields, enabling efficient and high-fidelity 3D relighting without traditional inverse rendering pipelines .


Method
Inspired by these methods, we adopted a similar architecture and extended it to video diffusion models. Specifically, we modify the architecture of Stable Video Diffusion by adding cross-attention layers to the diffusion U-Net that attend to the lighting condition. We provide the lighting condition in the form of environment maps. While we acknowledge that environment maps are not the most accessible form of input for users in downstream applications, they allow us to systematically evaluate the model’s capacity against ground truth renderings. The resulting model is expected to generate videos that adhere to both the given text prompt and the specified environment map.
The figure below shows our preliminary architecture: we keep the self-attention and (textual) cross-attention layers of the original diffusion UNet frozen, and train the added environment map cross attention layers (in purple).

We encode the environment map using a pretrained VAE, employing three complementary formats. First, we tonemap the environment map from HDR to LDR to ensure it remains within the distribution expected by the VAE. To mitigate the information loss introduced by tonemapping—particularly in extreme highlights and shadow regions—we also include a logarithmic representation of the environment map to emphasize these values. Finally, to ensure that the environment map informs the video generation in a view-specific manner, we provide directional embeddings corresponding to the camera angle of each frame.