Unlike most existing works, we associate information across cameras at the detection stage to get an accurate top-down view detection results. Next, we track and project points back to each camera view. A single-view detector will be used to further refine the bounding boxes to yield the final results.

Multi-View Detection

Based on recent research in multi-view detection or 3D detection [1] [2], we build the following architecture. The model runs a modified ResNet backbone on each views and then projects the extracted features into the top-down view using camera calibration data. Concatenating these projected features aggregates information across camera views. For the spatial aggregation module, there are two options: large-kernel convolution and deformable transformer. The latter option allows a larger receptive field.

A single view detection head is also trained and we use its bounding box regression loss as an addition loss term of the feature extractor.

Diagram from


For simplicity we adopt SORT [3] as the tracking algorithm. We will experiment with more advanced tracking algorithm and design methods to incorporate appearance features in the multi-view setting.

Projection and Refinement

We assume a fixed human body size which allows us to project 3D detection results back to camera views. The projected bounding boxes can have an inaccurate shape but can be refined by matching them with results from normal detectors. Another benefit of this refinement is that we can identify and remove false positives when some boxes are not matched in any of the views.


We train and evaluate our model on MMPTRACK dataset [4], which is a large-scale video dataset for multi-camera multi-object tracking. The dataset has ~5 hour videos for training and 1.5 hour videos for validation. Annotations include per-frame bounding boxes, corresponding person IDs and camera calibration data. The dataset poses challenges like cluttered and crowded environments, varying human poses and appearances to our tracking system.

For quantitative evaluation, we mainly uses MOTA and IDF1 scores. We set up matching threshold as IoU>0.5 for camera view and pixel distance <25 in top-down view, following rules used in the MMPTRACK challenge.


[1] Hou, Yunzhong, Liang Zheng, and Stephen Gould. “Multiview detection with feature perspective transformation.” Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16. Springer International Publishing, 2020.

[2] Hou, Yunzhong, and Liang Zheng. “Multiview detection with shadow transformer (and view-coherent data augmentation).” Proceedings of the 29th ACM International Conference on Multimedia. 2021.

[3] Bewley, Alex, et al. “Simple online and realtime tracking.” 2016 IEEE international conference on image processing (ICIP). IEEE, 2016.

[4] Han, Xiaotian, et al. “Mmptrack: Large-scale densely annotated multi-camera multiple people tracking benchmark.” Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023.