Stage 1:

We start with an initial set of labeled multi-view data. Precisely, 3000 images (300 actions with ten views each) were randomly sampled. This corresponds to ~5% of the total training data. We save this set for all future experiments to maintain consistency.

Stage 2:

In this stage, we train a body part segmentation model. We adapt the architecture from CDGNet, the current state-of-the-art model for this task. We replace the ResNet101 architecture with a ResNet50 backbone to improve training/inference time due to multiple iterations of active learning. We downsample the images to a resolution of 360×640 for the same reason. We do not supervise the model with multi-view training but pass every image independently.

Stage 3:

We save the trained model from the method above and run inference on the multi-view data. We then check for multi-view consistency and diversity of poses according to some acquisition strategies (currently a work in progress). These are then used to sample hard examples for the next stage of active learning.

Stage 4:

To use the model’s notion of easy and hard, we plan to use the hard examples with higher weights in this next iteration and easy examples as pseudo labels in the next iteration of training. This is motivated by impressive results from Link