MIoU: 71.5,

Top 3:
Hips: 90.94,
L. thigh: 88.98,
R. thigh: 88.81

Worst 3:
L. fingers: 22.33,
R. fingers: 22.77,
R. wrist: 48.01

As can be seen from the sampled images using a keypoint segmentation consistency acquisition function, most sampled actions involve occlusions which are expected. However, a big problem with keypoint-based strategies is that segmentations are very coarse in earlier stages with fewer data, and thus key points could be off even for simple poses. On the contrary, most mistakes are fine-grained in the later stages, and key points don’t give enough signal.

We also experimented with an n-1 holdout multi-view consistency acquisition function in which we use all the remaining n-1 views to reproject on the current view and check pixel-wise segmentation disagreement. In the figure below, our metric ranks for four examples from most complex (top) to least complex (bottom). Although this metric gives visually complex actions by inspection, it is time-consuming to compute. Our next steps include making a more efficient version of multi-view consistency and incorporating learnings from the ViewAL paper into our baselines and approach.