We propose to add an additional loss to the online-finetuning process. The loss aims to enforce the tracking network (e.g. MDNet) to capture features relevant to the target but not the background, based on the intuition that as a target moves in the background, the background-relevant features visible in the previous frames might disappear in the new frames, so filters trained to capture them cannot make contribution to the inference, but filters trained to capture target-relevant features can still provide critical clues to the target-background classification.
By applying this method to the MDNet, we observed an increase in accuracy (0.5319 -> 0.5388, higher the better) and robustness (31.4510 -> 27.5917, lower the better) on the VOT dataset. This matches our expectation: our method does not actually improve the prediction overlap with the ground truth but makes the base method more robust to background changing.
A more detailed description can be found in our final presentation slides.