Anchor Detection

The anchor detection pipeline is responsible for generating 3D grounded points that are anchors that the camera must pass through. Due to the lack of 3D scene data in the available datasets, we had to adapt and use off-the-shelf models and create a pipeline with them.

Sensor Choice

One key consideration was how to get 3D information about the scene in hand. To ensure accuracy that AI models cannot offer, we opted to use an Intel Realsense D435, pictured below, which has stereo cameras that allow it to capture RGB and depth information simultaneously.

Fig. 1 – Intel Realsense D435

Pipeline Overview

The pipeline consists of 4 key steps:

  1. Object Detection – Find the objects in the scene that correspond to each anchor prompt.
  2. Object Centroid Calculation – Determine the 3D pose of each object.
  3. Cinematographic Constraint Generation – Figure out how the object should look in each frame.
  4. Anchor Position Calculation – Determine the camera position for each object, which are the anchors.

Object Detection

For the object detection pipeline, we decided to forego traditional object detection models like YOLOv11, as they are limited by the classes they were trained on. We used YOLO-World [1], which is an open-vocabulary model that allows the user to pass in a natural language prompt, detecting everything in the scene that matches.

For the prompt:

“Zoom in from the actor’s shoulder to his face, then dolly left to the doorway”

The LLM extracts the following anchor prompts:

[“actor’s shoulder”, “actor’s face”, “doorway”]

A second LLM call is used to expand this into a list of similar words:

[“door”, “doorway”, “person”, “face”, “actor”, “shoulder”, “actor’s face”]

This larger prompt allows YOLO-World to detect more than enough objects from the scene, reducing the chance of missing objects.

Fig. 2 – Object Detection Results

The YOLO-World detections are then matched to the initial anchors using another LLM call, resulting in the final bounding boxes shown in Figure 2.

Object Centroids

Next, the object Centroids have to be calculated. First, SAM2 [2] is used to get segmentation masks for each detected object. Then, the depth values from the Realsense depth image are taken for each pixel of the segmentation mask, and average to get a Z value.

The X and Y are simply taken from the center of the bounding box, and then calculated using the camera intrinsics and Z value of the object.

Cinematographic Constraints

Once the objects’ 3D points have been determined, the camera position relative to the object has to be calculated. There are millions of possible positions that the camera could take, but these are affected by how we want the object to look in the frame. Therefore, we prompt an LLM for three key cinematographic constraints:

  1. Direction – What direction should the camera look at the object from? The front, left, or right?
  2. Frame Coverage – How much of the frame should the object take up? The closer the camera, the higher the frame coverage.
  3. Offset – Should the object be centered in the frame, or off to a side?

Based on this triplet of variables, the anchor positions can finally be determined, subject to a final collision avoidance check using the pointcloud from the Intel Realsense camera.

Example

Using the above pipeline, we can demonstrate an example.

Input Prompt: “Start facing the TV, then dolly right to the actor, then dolly back to the TV but not as close.”

Anchors: [TV, Actor, TV]

Object Detections: We made a GUI for our pipeline below, showing the detections of the TV on the wall and the back of the person’s head, identified as the ‘actor’ on the scene.

Fig. 3 – Object Detection Example

Anchor Results: Following through with the rest of the pipeline, we get the final anchor positions, ordered red-green-blue, showing how the camera starts looking at the TV, goes to behind the actor’s head, and then back to the TV.

Fig. 4 – Final Anchor Sample

References

[1] Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. YOLO-World: Real-Time Open-Vocabulary Object Detection, 2024. arXiv. https://arxiv.org/abs/2401.17270

[2] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. SAM 2: Segment Anything in Images and Videos, 2024. arXiv. https://arxiv.org/abs/2408.00714