
Our method begins by defining the target attribute through the construction of two sets of diverse prompts that describe the attribute in varying contexts. These sentences are then encoded using the CLIP text encoder to obtain the corresponding embeddings, as illustrated in Figure 2.

Leveraging the discovered subspaces, we decompose each image into target and nuisance variables as shown in Figure 3.

To construct an unbiased embedding set of an image dataset, for each class, we uniformly sample embeddings. This process effectively marginalizes the nuisance attributes by decoupling the selection of target and nuisance attributes, ensuring that the context information is sampled independently of the target variable. We construct a pairwise difference matrix and perform Singular Value Decomposition on that, to obtain top k features, which forms a subspace useful for classification and the orthogonal subspace forms the bias subspace.
