Motivation

There has been a recent boom in the area of text-to-image generation, with Stable Diffusion, DALL·E 2, and Imagen allowing users to generate photo-realistic and highly detailed images from just text queries. There also has been a lot of recent work on the more challenging area of text-to-video synthesis, namely, Imagen Video and Make-A-Video. However, the area of text-to-cinemagraph, or infinitely looping aesthetic videos (also termed cinemagraphs) is relatively unexplored.

If we were to enhance the ability of these text-to-image models that can generate high-fidelity images coherent with text prompts to generate high-quality animations directly from a text query, it would enable professionals a very easy medium to create video assets for movies, advertisements, games, etc., rather than having to spend enormous amounts of time doing it manually. In this work, we focus mainly on creating infinitely seamlessly looping videos for scenes like a waterfall, rivers, and oceans, which can act as background assets.

Additionally, with the exponential rise in popularity of short video social media platforms like Instagram, TikTok, etc, text-to-cinemagraph also seems to be a very lucrative tool for amateurs.

Figure 1: Cinemagraph (looping video) generated from text caption

Adapting Stable Diffusion Models for Video Synthesis

Introduction

Motivation