Datasets
Some of the datasets we experimented with are as follows:
(I) 2D Keypoints datasets
- COCO keypoint detection[1]
- MPII Human Pose dataset [2]
(II) 3D human mesh datasets
- Human3.6M [3]
- 3D Poses in the Wild [4]
- AGORA [5]
- Relative Human [6]
However, since these datasets aren’t relevant to our target domain, we worked on creating our own Synthetic dataset using GTA 5 similar to the JTA [7] dataset video below:
Creating our Synthetic dataset gives us a lot of control, in a lot of aspects like lighting conditions (day or night), camera angles, camera location, number of pedestrians and vehicles, etc. Once we have trained our pose estimation model on the synthetic dataset, we will apply domain adaptation techniques to transfer knowledge to estimate pose in real-world traffic intersections.
Sample video from the dataset we are creating:It took us about 15 minutes to produce 1 video in a single view
However, the challenges with synthetic datasets are
- It is very time-consuming to build a dataset. For example, it took 15 minutes to obtain a 1-minute single-view labeled video sequence.
- Getting data from multiple views requires adding extensive mods to the game.
- Domain transfer to real-world scenarios would be a challenge.
Metrics
To evaluate our pose estimation model, we use the following metrics:
(a) Mean Per Joint Position Error (MPJPE): Mean of per joint position error for all k joints (Typically, k = 16) Calculated after aligning the root joints (typically the pelvis) of the estimated and ground truth 3D pose.
(b) Percentage of Correct Keypoints (PCK): PCK is used as an accuracy metric that measures if the predicted keypoint and the true joint are within a certain distance threshold. The PCK is usually set with respect to the scale of the subject, which is enclosed within the bounding box.
(c) Per Vertex Error (PVE): It is possible to use MPJPE to evaluate all the 6890 predicted vertices of the SMPL model. In this case, the ground truth SMPL parameters need to be provided and the only dataset available for this purpose is the 3DPW. This type of metric is called Per Vertex Error (PVE).
(d) Percentage of Correct Parts (PCP): PCP is used to measure the correct detection of limbs. If the distance between the two predicted joint locations and the true limb joint locations is almost less than half of the limb length then the limb is considered detected. However, sometimes it penalizes shorter limbs, for example, a lower arm.
References
[1] Lin, Tsung-Yi, et al. “Microsoft coco: Common objects in context.” ECCV 2014.
[2] Andriluka et al. “2D Human Pose Estimation: New Benchmark and State of the Art Analysis” CVPR 2014
[3] Ionescu, Catalin, et al. “Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments.” PAMI 2013.
[4] von Marcard, Timo, et al. “Recovering accurate 3d human pose in the wild using imus and a moving camera.” ECCV 2018.
[5] Patel, Priyanka, et al. “AGORA: Avatars in geography optimized for regression analysis.” CVPR 2021.
[6] Sun, Yu, et al. “Putting People in their Place: Monocular Regression of 3D People in Depth.” CVPR 2022.
[7] Fabbri et al. “Learning to Detect and Track Visible and Occluded Body Joints in a Virtual World” ECCV 2018