Methodology

Fig. 1 – System Overview For Camera Motion Control

Our pipeline (Figure 1) is designed to interpret, generate, and refine user-defined motion trajectories, to enable intuitive control of a robotic camera rig via smartphone-based gestures and natural language. It allows a user to create trajectory [green pathway] and modify an existing trajectory [blue pathway]. The pipeline has three key modules:

1. Phone Trajectory Extractor: Localizes the phone to construct an initial motion path using various sensors available on an iPhone.

We use RGBD and IMU data from a phone to track the phone trajectory using ORB-SLAMv3 offline. This gives us the 3D position as well as orientation quaternions of the camera at each frame.


2. Intent-to-Motion-Module: Maps high-level user prompts into coarse motion trajectories. It generates new trajectories from scratch or modifies an existing trajectory.
Fig. 2 – Intent-to-Motion Module Architecture

Our module builds upon the methodology introduced by Liu et al. in 2024 [4]. It has three key components: LLM Agent, a Text-To-Trajectory model, and an anchor detector.

LLM Agent

The LLM Agent is responsible for understanding user intent by breaking down user instructions into a series of trajectories using a customized prompt. After analyzing each trajectory for various attributes, it calls on downstream modules: Text-To-Trajectory Model and Anchor Detector to generate atomic trajectories. The LLM Agent also defines the order of these trajectories which is used in the end to combine each atomic trajectory into a single trajectory.

Text-To-Trajectory Model

This model is responsible for producing a camera trajectory based on the text prompt from the LLM Agent.

Similar to ChatCam [4], we plan to use a cross-modal transformer decoder for text-conditioned trajectory generation.

Fig. 3 – Text-to-Trajectory Tokenizer and Transformer Models

Adapting from TEMOS [3], we aim to implement a VAE-based tokenizer to map trajectories and text descriptions into a shared latent space. This tokenizer would be built on the VQ-VAE [2] architecture, consisting of a discrete codebook Z of K latent vectors, each of dimension d. Therefore, each trajectory token z can be quantized by mapping it to the nearest latent vector in the codebook, as follows:

Given these latent embeddings, following ChatCam, we fine-tune the transformer to predict trajectory from text as well as text description from trajectory. The model would be trained using text-trajectory pairs from existing and generated datasets.

Anchor Detector

This model takes text prompts and the corresponding scene and detects anchors from the user prompt that the trajectories will connect. Examples of anchors include actors within the scene, or specific objects or locations in the scene that are specific by the user.

Fig. 4 – Anchor Detections Seen from Birds-Eye-View

To perform anchor detection, a grounded LLM model, such as 3D-LLM [1], can be used to get specific anchor locations from text prompts that are provided by the LLM agent.

These can then be used by the LLM agent to create affine transformations that join together the atomic trajectories from the previous module


3. Trajectory Refinement and Execution Layer: This module is responsible for generating a feasible, dynamically consistent trajectory and converting it into a format that the robotic rig understands.

Trajectory Refinement and Execution Layer

Fig. 5 – Overview of modules in Trajectory Refinement and Execution Layer

Trajectory generation may result in non-smooth trajectory with rapid angle and speed changes. This module is responsible for generating a smooth trajectory based on predefined heuristics

Inverse Kinematics

Fig. 6 – Control Names for Various Parts of the Camera Rig

Once we have a trajectory, we need to convert it to control commands that can be understood by the robotic rig.

The inverse kinematics module solves for various control values given the 3D position in the world coordinate system.

We assume a 7-DOF cinematic robot arm as shown in Figure 6.

Assuming dynamics of a 7-DOF cinematic robotic arm, we solve for the values for 4 control joints (Arm, Lift, Rotate, and Track) by solving the set of following equations in a least squares manner:

References

[1] Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3D-LLM: Injecting the 3D world into large language models, 2023.

[2] Xinhang Liu, Yu-Wing Tai, and Chi-Keung Tang. ChatCam: Empowering camera control through conversational AI, 2024.

[3] Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning, 2018.

[4] Mathis Petrovich, Michael J. Black, and Gül Varol. TEMOS: Generating diverse human motions from textual descriptions, 2022.