Training Data

One roadblock to training any intelligent system is the availability of training data. For our project, we have decided to use a mix of existing and generated datasets, as neither methodology meets our requirements for data modalities and volume. By using an architecture specifically designed to be trainable with the available data, we can create the best possible final result.

Existing Datasets

The Exceptional Trajectories

One dataset that greatly fits our use case is The Exceptional Trajectories [3], which contains 115000 pairs of text prompts, camera trajectories, and tracked human SMPLs. This can enable us to train a text-to-trajectory model using the text and camera trajectories, and then fine-tune our LLM Agent to combine these trajectories with anchors, which can be substituted for the humans being tracked in the dataset.

Fig. 1 – Sample Data from The Exceptional Trajectories

Condensed Movies

The original dataset used to generate The Exceptional Trajectories is Condensed Movies [1], which contains over 30000 captioned clips from over 3000 different movies. This can be leveraged to fine-tune our LLM agent to understand better how to stitch together anchors and trajectories.

Generated Data

Another avenue of creating data pairs for training is generating it from video clips. While this method will likely have lower accuracy, it can allow us to generate data for the type of footage we want.

Fig. 2 – Sample Outputs from SLAHMR

One key package used in this pipeline is SLAHMR [4], which is a SLAM algorithm that tracks both camera and human motion from a video. This allows for better camera calibration from the 2D video, while also providing us with human anchors that can be used to train our models.

Fig. 3 – Sample Outputs from Video-Depth-Anything

Additionally, to fine-tune our anchor detection pipeline, we can bolster this data with Video-Depth-Anything [2], a model capable of generating depth from RGB video. This can give us data pairs of video, depth, and camera trajectories, which we can then label manually to add a text prompt to the data pair. This covers the full use case of our models and can be used to fine-tune our pipeline end-to-end.

References

[1] Max Bain, Arsha Nagrani, Andrew Brown, and Andrew Zisserman. Condensed movies: Story based retrieval with contextual embeddings, 2020.

[2] Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos, 2025.

[3] Robin Courant, Nicolas Dufour, Xi Wang, Marc Christie, and Vicky Kalogeiton. E.T. the exceptional trajectories: Text-to-camera-trajectory generation with character awareness, 2024.

[4] Vickie Ye, Georgios Pavlakos, Jitendra Malik, and Angjoo Kanazawa. Decoupling human and camera motion from videos in the wild, 2023.