Text-to-Trajectory Model - Computer Vision for Cinematographic Motion Control

We tested two different architectures to arrive at our final text-to-trajectory model – a diffusion transformer, and an autoregressive transformer.

DIRECTOR

The first model we tested is DIRECTOR [1], a diffusion-transformer based model that generates a full trajectory based on a text input that conditions the diffusion generation.

It is initially trained on The Exceptional Trajectories dataset, created by the same authors as this model.

GenDoP

The second model we tested was GenDoP [2]. Opposed to the diffusion-based DIRECTOR, it is autoregressive, meaning a text encoder is the initial input, along with each previously generated trajectory token, and all of that is used to generate the next trajectory token.

It is initially trained using the GenDoP dataset, from the authors of the paper itself.

Results Comparison

We trained these models further, pruning the datasets and fine-tuning them to achieve the best results. Below are some comparisons between the two.

“Camera tilts down”

Fig. 3 – DIRECTOR Output

Fig. 4 – GenDoP Output

The results above show that GenDoP’s output has much higher motion stability, and more correctly follows the input prompt telling the camera to tilt down. DIRECTOR, on the other hand, has an unstable and incorrect output that doesn’t follow the prompt very well.

“Camera orbits left”

Fig. 5 – DIRECTOR Output

Fig. 6 – GenDoP Output

These results show that neither model is able to understand the word ‘orbit’, because it is not present in either dataset extensively. However, GenDoP’s output is much more stable, showing the model’s higher efficacy even in edge cases like this.

Based on these analyses, we selected GenDoP as our final text-to-trajectory model. We are still manually generating additional data using mathematical equations for certain terminology like ‘orbit’ to improve the dataset, and therefore the model outputs.

References

[1] Courant, R., Dufour, N., Wang, X., Christie, M., & Kalogeiton, V. (2024). E.T. the Exceptional Trajectories: Text-to-camera-trajectory generation with character awareness. arXiv. https://arxiv.org/abs/2407.01516

[2] Zhang, M., Wu, T., Tan, J., Liu, Z., Wetzstein, G., & Lin, D. (2025). GenDoP: Auto-regressive camera trajectory generation as a director of photography. arXiv. https://arxiv.org/abs/2504.07083