Results

The table below compares the accuracy on balanced test set for different methods

Table 1. Comparison of class-balanced and bias-conflicting accuracies on different baseline methods on standard benchmarks.
Table 2. Comparison of class-balanced and bias-conflict accuracies on different baseline methods on challenging benchmarks.

Visualizations

Figure 1. Illustration of the NMS debiasing process. The top row shows the original images, while the bottom row shows the context-swapped images generated using the NMS strategy. NMS effectively swaps contextual attributes while preserving the target attribute, facilitating the removal of spurious correlations.

Ablation Studies

Figure 2. Effect of target subspace dimension k on debiasing performance across three datasets.

The ablation study examines the effect of different choices of ( k ) (the target subspace dimension) on performance. The results are presented in Figure 2., where each column corresponds to a dataset: CelebA-G&B-0%, CelebA-H&E-0%, and Dogs&Cats-Fur-0%. The blue line displays performance on class-balanced samples, while the red line shows performance on bias-conflicting samples. The yellow bar in each subplot represents the distribution of singular values in the Pairwise Difference Matrix.

From the figure, it is evident that the choice of ( k ) significantly impacts debiasing performance. Specifically, for bias-conflicting samples, the success of debiasing, measured by bias-conflicting accuracy gradually declines as ( k ) increases. This occurs because a higher ( k ) retains more nuisance attributes, and NMS fails to marginalize these attributes. However, for class-balanced samples, performance does not always decrease as ( k ) increases. For instance, in CelebA-G&B, accuracy steadily declines with increasing ( k ), whereas in CelebA-H&E, it gradually improves. In Dogs&Cats-Fur, accuracy initially rises before dropping. This suggests that while increasing ( k ) can introduce more bias, in some cases, this added bias may improve overall accuracy by enhancing performance on bias-aligned samples, depending on the dataset’s characteristics.