Problem Formulation

Given a user-provided text prompt, our task is to generate a seamlessly looping animated video (also known as a cinemagraph) of a scene that accurately describes the text description. For the scope of this task, we mainly deal with objects/scenes containing repeating textures like waterfalls, rivers, lakes, and oceans.

Input: Text prompt describing a scene.

Example Input: “A large river flowing in front of the mountain in the style of starry nights painting”

Output: An infinitely-looping video according to the description.

Example Output:

Datasets

For this task, we do not have a dataset of text prompts and their corresponding ground-truth looping animated video. Instead, we only have a small dataset of around 1000 unique training videos and around 30 unique testing videos along with their corresponding ground-truth average optical flow (mean optical flow for the entire video). The dataset is collected and open-sourced by Holynski et al. 2022. Each video is divided into videos of 60 frames making a total of 4750 training videos and 162 testing videos each at 30 frames per second at a resolution of 720×1280.

References

Holynski, Aleksander, et al. “Animating pictures with eulerian motion fields.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.