Quantifying Spuriousness of Biased Datasets Using
Partial Information Decomposition
Abstract
Spurious patterns refer to a mathematical association between two or more variables in a dataset that are not causally related. However, this notion of spuriousness, which is usually introduced due to sampling biases in the dataset, has classically lacked a formal definition. To address this gap, this work presents the first information-theoretic formalization of spuriousness in a dataset (given a split of spurious and core features) using a mathematical framework called Partial Information Decomposition (PID). Specifically, we disentangle the joint information content that the spurious and core features share about another target variable (e.g., the prediction label) into distinct components, namely unique, redundant, and synergistic information. We propose the use of unique information, with roots in Blackwell Sufficiency, as a novel metric to formally quantify dataset spuriousness and derive its desirable properties. We empirically demonstrate how higher unique information in the spurious features in a dataset could lead a model into choosing the spurious features over the core features for inference, often having low worst-group-accuracy. We also propose a novel autoencoder-based estimator for computing unique information that is able to handle high-dimensional image data. Finally, we also show how this unique information in the spurious feature is reduced across several dataset-based spurious-pattern-mitigation techniques such as data reweighting and varying levels of background mixing, demonstrating a novel tradeoff between unique information (spuriousness) and worst-group-accuracy.
††footnotetext: Accepted at ICML 2024 Workshop on Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models.
Correspondence to: Barproda Halder [email protected].
1 University of Maryland College Park 2 Google Research 3 Princeton University
1 Introduction
Spurious patterns (Haig, 2003) arise when two or more variables are correlated in a dataset even though they do not have any causal relationship. For example, in the Waterbird dataset (Wah et al., 2011), most waterbirds have water backgrounds, and landbirds have land backgrounds (see Fig. 1). This correlation in the dataset essentially misleads a machine learning model into creating a spurious link between background and bird type, since it often finds the background to be “more informative” than the foreground for predicting the bird type. Learning such spurious links from the data may result in high performance on the training and in-distribution datasets, but results in reduced performance on out-of-distribution datasets and affects worst-group-accuracy (Lynch et al., 2023; Sagawa et al., 2019), i.e., the accuracy on the minority groups like waterbirds with land background or vice versa.
Several existing works (Kirichenko et al., 2022; Izmailov et al., 2022; Wu et al., 2023; Ye et al., 2023; Liu et al., 2023) focus on different dataset-based and model-training-based approaches to mitigate spurious patterns and evaluate the empirical performance over out-of-distribution datasets (or, to improve worst-group-accuracy). However, this notion of spuriousness in any given dataset lacks a formal definition. This work addresses this gap by asking the question: Given a split between core and spurious features, how do we formally quantify the spuriousness in any given dataset?
To answer this question, we present an information-theoretic formalization of spurious patterns, by leveraging a body of work in information theory called Partial Information Decomposition (PID) (Bertschinger et al., 2014; Banerjee et al., 2018). We note that classical information-theoretic measures such as mutual information (Cover & Thomas, 2012) captures the entire statistical dependency between two random variables but fail to capture how this dependency is distributed among those variables, i.e., the structure of the multivariate information. Partial Information Decomposition (PID) addresses this nuanced issue by providing a formal way of disentangling the joint information content between the core and spurious features into non-negative terms, namely, unique, redundant, or synergistic information (see (2.1) in Section 2.1).
![Refer to caption](extracted/5699032/data_exam.png)
Our proposition is to use the unique information about the target variable in the spurious features that is not in the core features as a measure of spuriousness in the dataset (often denoted as . To justify our proposition, we discuss how unique information is connected to Blackwell Sufficiency (Blackwell, 1953), a notable concept in statistical decision theory. Blackwell Sufficiency provides a partial ordering on when one random variable can be more “informative” (less noisy) than another for inference. Unique information captures the departure from Blackwell Sufficiency, which goes to zero if and only if one random variable is Blackwell Sufficient over another for a prediction task (see Theorem 1). Thus, unique information intuitively quantifies when one variable can be more informative than another, which we leverage to explain when the spurious feature can be more informative than the core feature for the model prediction. Additionally, we also show several desirable properties of unique information as a measure of spuriousness in the dataset in Theorem 2. Though Partial Information Decomposition (PID) has recently been applied to few other areas in machine learning (Tax et al., 2017; Dutta et al., 2020, 2021; Hamman & Dutta, 2024a; Liang et al., 2023; Dutta & Hamman, 2023) (also see Related Works), we are pioneering its use to decompose information in spurious and core features and quantify spuriousness, supported by desirable properties and empirical validation. Our main contributions can be concisely listed as follows:
-
•
Novel information-theoretic formalization to explain spurious patterns: Though many works attempt to prevent a model from learning spurious patterns, there is a lack of a theoretical understanding of the “amount” of spuriousness in a dataset, and how do we quantify and measure it given a split of spurious and core features. Novel to this work, we investigate spuriousness through the lens of partial information decomposition (PID) and provide a fundamental understanding of when a model finds the spurious features to be “more informative” than the core features. We leverage PID to disentangle the joint information content between the core and spurious features into unique, redundant, and synergistic information.
-
•
Demystifying unique information as a measure of spuriousness: Next, we propose unique information in the spurious features as a measure of the spuriousness in a dataset. To justify our proposition, we first establish how unique information quantifies the informativeness of a random variable compared to for predicting (see Theorem 1 for a motivation from Blackwell Sufficiency). Depending on the increasing or decreasing nature of the unique information , one can then anticipate to what extent is a model going to leverage over for prediction. Additionally, we also show several desirable properties of unique information as a measure of spuriousness in Theorem 2. Our measure can identify which features are more likely to be predictive for a classification task, paving a pathway for dataset quality assessment and explaining feature-based informativeness.
-
•
Spuriousness Disentangler: An autoencoder-based estimator for computing unique information: We propose a novel autoencoder-based framework that we call – Spuriousness Disentangler – to compute the PID values for high dimensional image data. The estimator consists of mainly three main parts: (i) First, an autoencoder reduces the dimension of the image data and gives an one-dimensional array of clusters which serves as a lower-dimensional, discrete feature representation for the image data. Along the lines of (Guo et al., 2017), the dimensionality reduction and clustering are efficiently performed through minimization of a joint loss function; (ii) Next, the computation of the joint probability distribution of this lower-dimensional representation is performed; and (iii) Finally, the partial information decomposition (PID) values are calculated by solving a convex optimization problem using the Discrete Information Theory (DIT) package (James et al., 2018).
-
•
Experimental Results and Novel Tradeoff: Our experimental results are in agreement with our theoretical postulations, demonstrating an empirical tradeoff between our proposed measure of spuriousness, i.e., and empirical evaluation metrics known to be affected by spurious patterns, i.e., worst group accuracy. We show that for real-world unbalanced datasets, e.g., the Waterbirds dataset (Wah et al., 2011), the unique information in the spurious feature is the most prominent and is significantly higher than any information in the core features. This helps explain why a model trained on this dataset readily uses the spurious feature rather than the core feature for prediction. Additionally, when a dataset-based spurious-correlation-mitigation method such as data-reweighting is applied, the unique information in the spurious features reduces drastically (again explaining why a model might now be more likely to use the core feature ). We also observe a novel tradeoff between unique information (proposed measure of spuriousness) and worst-group-accuracy for varying degrees of background mixing (a form of noise), i.e., the worst-group-accuracy improves with the decreasing unique information in the spurious features pointing to a novel tradeoff. We also study Grad-CAM (Selvaraju et al., 2017) (a technique to generate ’visual explanations’ for decisions made by Convolutional Neural Network (CNN)-based models) visualizations for many of the trained models to further confirm when the core or spurious feature is actually being emphasized by the model for different experimental setups.
Related Works: There are several perspectives on spurious correlation (see Haig (2003); Kirichenko et al. (2022); Izmailov et al. (2022); Wu et al. (2023); Ye et al. (2023); Liu et al. (2023); Stromberg et al. (2024); Singla & Feizi (2021); Moayeri et al. (2023) and the references therein; also see surveys (Ye et al., 2024; Srivastava, 2023; Ghouse et al., 2024)). Spuriousness mitigation techniques are broadly divided into two groups: (i) Dataset-based techniques (Kirichenko et al., 2022; Wu et al., 2023) and (ii) Learning-based techniques (Liu et al., 2023; Yang et al., 2023; Ye et al., 2023). Kirichenko et al. (2022) shows that last-layer fine-tuning of a pre-trained model with a group-balanced subset of data is sufficient to mitigate spurious correlation. Wu et al. (2023) proposes a concept-aware spurious correlation mitigation technique. Ye et al. (2023) introduces a Freeze and Train approach to learn salient features in an unsupervised way and freezes them before training the rest of the features via supervised learning. Yang et al. (2023) explores different regularization techniques to see the effect on the spurious correlation and Liu et al. (2023) examines a logit correction loss. Our novelty lies in formalizing the spuriousness of datasets using the PID framework, and explaining how effective a dataset-based spurious-correlation mitigation will be for regular model training.
Partial information decomposition (PID) (Williams & Beer, 2010; Bertschinger et al., 2014; Dutta et al., 2021; Venkatesh & Schamberg, 2022) is an active area of research. PID measures are beginning to be used in different domains of neuroscience and machine learning (Tax et al., 2017; Dutta et al., 2020, 2021; Hamman & Dutta, 2024a; Ehrlich et al., 2022; Liang et al., 2024; Wollstadt et al., 2023; Mohamadi et al., 2023; Venkatesh et al., 2024; Hamman & Dutta, 2024b). However, examining spurious correlation through the lens of PID and observing novel empirical tradeoffs between the spurious pattern and worst-group-accuracy is unexplored. Additionally, there is limited work on calculating PID values for high dimensional multivariate continuous data. Some existing works (Dutta et al., 2021; Venkatesh & Schamberg, 2022; Venkatesh et al., 2024) handle continuous data with Gaussian assumptions while (Pakman et al., 2021) considers one-dimensional multivariate case. Hence, estimating PID for high-dimensional data by proper dimensionality reduction and discretization is unexplored.
For dimensionality reduction, different learning based methods exist (Hotelling, 1933; Law & Jain, 2006; Lee & Verleysen, 2005; Wang et al., 2015, 2014). Similarly, for discretization, different clustering algorithms exist, e.g., k-means clustering (MacQueen et al., 1967; Bradley et al., 2000), deep embedded clustering (Xie et al., 2016). Along the lines of an autoencoder-based clustering setup in (Guo et al., 2017), our proposed Spuriousness Disentangler trains a network to jointly learn a good representation of the input image data in a self-supervised way ensuring low representation error while also clustering simultaneously to deal with the challenge of high dimensional and continuous image data.
2 Preliminaries and Background
Let be the random variable denoting the input (e.g., an image) where each which denotes a finite set of values. The core features (e.g., the foreground) will be denoted by , and the spurious features (e.g., the background) will be denoted by . We typically use the notation and to denote the range of values for the spurious and core features. Let denote the target random variable, e.g., the true labels which lie in the set , and the model predictions are given by (parameterized by ). Generally, we use the notation to denote the distribution of random variable , and to denote the conditional distribution of random variable conditioned on . Depending on the context, we also use more than one random variable as sub-script, e.g., denotes the joint distribution of . Whenever necessary, we also use the notation to denote an alternate distribution on the random variable that is different from . We also use the notation to denote a composition of two conditional distributions given by: where , and denote the range of values that can be taken by random variables , , and .
2.1 Background on Partial Information Decomposition
We provide a brief background on PID that would be relevant for the rest of the paper. The classical information-theoretic quantification of the total information that two random variables and together hold about is given by the mutual information (see (Cover & Thomas, 2012) for a background on mutual information). Mutual information is defined as the KL divergence (Cover & Thomas, 2012) between the joint distribution and the product of the marginal distributions and would go to zero if and only if is independent of . Intuitively, this mutual information captures the total predictive power about that is present jointly in together, i.e., how well can one learn from together. However, only captures the total information content about jointly in and does not unravel anything about what is unique and what is shared between and .
![Refer to caption](extracted/5699032/PID_3.png)
PID (Bertschinger et al., 2014; Banerjee et al., 2018) provides a mathematical framework that decomposes the total information content into four nonnegative terms (also see Fig. 2):
(1) |
Here, denotes the unique information about that is only in but not in . Next, denotes redundant information (common knowledge) about in both and . Lastly, is an interesting term that denotes the synergistic information that is present only jointly in but not in any one of them individually, e.g., a public and private key can jointly reveal information not in any one of them alone.
Motivational Example. Let with each i.i.d. Bern(1/2). Let , , and Bern(1/2) which is independent of . Here, bits. The unique information about that is contained only in and not in is effectively in , and is given by bit. The redundant information about that is contained in both and is effectively in and is given by bit. Lastly, the synergistic information about that is not contained in either or alone, but is contained in both of them together is effectively in the tuple , and is given by bit. This accounts for the bits in .
We also note that defining any one of the PID terms suffices for obtaining the others. This is because of another relationship among the PID terms as follows (Bertschinger et al., 2014): . Essentially is viewed as the sub-volume between and (see Fig. 2). Hence, . Lastly, (can be obtained from (2.1) once both unique and redundant information has been defined). Here, we include a popular definition of from (Bertschinger et al., 2014) which is computable using convex optimization.
Definition 1 (Unique Information (Bertschinger et al., 2014)).
Let be the set of all joint distributions on and be the set of joint distributions with the same marginals on and as the true distribution , i.e., : and . Then,
Here denotes the conditional mutual information when have joint distribution instead of .
3 Main Results
In this work, we first present an information-theoretic formalization of spurious patterns using the mathematical framework of Partial Information Decomposition (PID).
Proposition 1 (Unique Information as a Measure of Spuriousness).
For a given data distribution, the unique information is a measure of spuriousness given a split of the spurious features and core features .
![Refer to caption](extracted/5699032/blackwell_new.png)
To justify our proposition, we first establish that unique information is a measure of informativeness of the spurious feature over core feature . We draw upon a concept in statistical decision theory called Blackwell Sufficiency (Blackwell, 1953) which investigates when a random variable is “more informative” (or “less noisy”) than another for inference (also relates to stochastic degradation of channels (Venkatesh et al., 2023; Raginsky, 2011)). Let us first discuss this notion intuitively when trying to infer using two random variables and . Suppose, there exists a transformation on to give a new random variable which is always equivalent to for predicting . We note that and do not necessarily have to be the same since we only care about inferring . In fact, and can have additional irrelevant information that do not pertain to , but solely for the purpose of inferring , they need to be equivalent. Then, feature set will be regarded as “sufficient” with respect to for predicting since can itself provide all the information that has about (see Fig. 3). This intuition is formalized as:
Definition 2 (Blackwell Sufficiency (Blackwell, 1953)).
A conditional distribution is Blackwell sufficient with respect to another conditional distribution if and only if there exists a stochastic transformation (equivalently another conditional distribution with both and ) such that .
![Refer to caption](extracted/5699032/flow.png)
Now we demonstrate how our proposed unique information is closely tethered to Blackwell Sufficiency, thus justifying our Proposition 1. In fact, the unique information is if and only if is Blackwell sufficient with respect to (see Theorem 1).
Theorem 1 (Spuriousness and Blackwell Sufficiency).
The if and only if the conditional distribution is Blackwell sufficient with respect to .
Since spuriousness (unique information) if and only if is Blackwell Sufficient with respect to , we note that captures the “departure” from Blackwell Sufficiency, and thus quantifies relative informativeness. Intuitively, what this means is that for the given data distribution, there is no such transformation on core feature that is equivalent to the spurious feature for the purpose of predicting . This essentially makes spurious feature indispensable to the model for predicting , forcing the model to use or emphasize it in decision-making.
Next, we discuss some desirable properties of unique information .
Theorem 2.
The measure satisfies the following desirable properties:
-
•
and is if (spurious feature has no information about ).
-
•
is non-decreasing if more features are added to , i.e., if the set of spurious features grows, so does its unique information over core features.
-
•
is non-increasing if more features are added to , i.e., if the set of core features grow, the unique information in the spurious features reduce.
Spuriousness Disentangler (Autoencoder-based estimator): Next, we propose an autoencoder-based estimation framework – that we call Spuriousness Disentangler – to calculate the PID values. The motivation to use this estimator is that since the model learns the features to reconstruct the input image, the encoding of the image should have minimal information loss and hence should be a good low-dimensional representation of the input image. The framework mainly consists of three aspects: clustering, estimation of joint distribution and estimation of PID.
Since we are dealing with high dimensional data, dimesionality reduction is a necessary first step (Bellman, 1966). Traditionally, the clustering step is done by PCA followed by k-means clustering. However, in our setting, we can do these two steps together using an autoencoder, which is a deep neural network consisting of an encoder and a decoder, as shown in Fig. 4. The output of the encoder is the embedding for the input image, a low dimensional representation of the input images. The weights of this output layer, defined as the clustering layer, are used as the clusters centers initialized by k-means clustering algorithm. The clustering layer is optimized using the weighted sum of representation loss and clustering loss . The overall loss function is defined as where is a non-negative constant. The clustering loss is the KL divergence which measures the dissimilarity between different distributions (Xie et al., 2016; Guo et al., 2017). For cluster centers and embedded point (output of the encoder), is defined as follows (Van der Maaten & Hinton, 2008):
(2) |
where is the th entry of the soft label , denoting the probability of belonging to cluster . The loss where is the target distribution.
![Refer to caption](extracted/5699032/balanced_unbalanced_V3.0.png)
The representation loss is the mean square error between the input of the encoder and output of the decoder defined as .
The next step is to estimate the PID values. For this, the joint distribution of three random variables (e.g. the clusters of foreground, background and the binary label) is calculated using histograms, and then the PID values are obtained from the DIT package (James et al., 2018).
4 Experiments
We demonstrate experimental results to provide evidence in support of Proposition 1 for different experimental setups, i.e., unbalanced, balanced, and mixed background datasets. We illustrate how unique information in the spurious features has a tradeoff with the worst-group-accuracy, thus justifying its use as a measure of the spuriousness of a dataset. We also show a comparative analysis for PCA-based and autoencoder-based PID measurements.
Datasets: We conduct experiments on two datasets: Waterbird (Wah et al., 2011) and Dominoes (Shah et al., 2020), both framed as binary classification tasks.
Waterbird dataset (Wah et al., 2011) is the popular spurious correlation benchmark. The task is to classify the type of the birds (waterbird , landbird ). However, there exists spurious correlation between the backgrounds (water , land ) and the labels (bird type). Group, Group,Group, and Group indicate the group of images where landbirds are in the land backgrounds, landbirds are in the water backgrounds, waterbirds are in the land backgrounds and waterbirds are in the water backgrounds respectively. We call the bird as the foreground of the image.
Dominoes is a synthetic dataset created by combining handwritten digits (zero and one) from MNIST (Deng, 2012) and images of cars and trucks from CIFAR (Krizhevsky et al., 2009) (digit or at the top, car () or truck () at the bottom of an image). We make two version of this synthetic dataset namely Dominoes and Dominoes inducing different degrees of bias. The task is to classify whether the image contains a car or truck hence the car or truck corresponds to the core features (foreground). On the other hand, the digits are considered as the spurious features (background). Group, Group,Group, and Group illustrate the group of images where the top half is a zero and bottom half is a car, the top half is a one and bottom half is a car, the top half is a zero and bottom half is a truck, the top half is a one and bottom half is a truck, respectively.
4.1 Comparison between group-balanced and unbalanced datasets
We observe the relationship between the PID values and worst-group-accuracy for (i) an unbalanced dataset (which has spurious correlations) and (ii) a balanced dataset where the spurious correlation with the background is removed through sampling (balancing).
Problem Setup: We use group-balanced and unbalanced data for this part of the experiment. The balanced-unbalanced scenario arises from the four different groups that are present in the dataset, where the majority groups consist of the waterbirds with water backgrounds and landbirds with land backgrounds and other two combinations are the minority groups for the waterbird dataset. Similarly, in the Dominoes dataset, cars with digit and trucks with digit are the majority groups and the other two combinations are the minority groups. Worst-group-accuracy refers to the accuracy for the minority group which is generally the lowest for the model that is trained with biased dataset namely unbalanced dataset. The group-balanced dataset has equal number of samples in each group resulting in unbiased model training. We begin with using our autoencoder-based estimator, namely Spuriousness Disentangler, on both dataset and estimate the PID values separately for the background and foreground. This separation is done by using the segmentation mask of the foreground for the waterbird dataset. Next, we fine-tune the pre-trained ResNet- (He et al., 2016) model and calculate the worst-group-accuracy and mean accuracy over all groups.
![Refer to caption](extracted/5699032/Picture1.png)
![Refer to caption](extracted/5699032/mixed_data.png)
![Refer to caption](extracted/5699032/add_cat_V3.0.png)
Observations: Fig. 5 shows our findings regarding PID values and the worst-ground accuracy for three datasets. Firstly, we can observe that the unique information in background is generally much higher than the other PID values namely unique information in foreground, redundancy and synergy. Secondly, from the first two columns, it is obvious that there is a significant reduction of the unique information in background i.e., reduction in spuriousness when the dataset is balanced (having equal number of samples in all groups reducing the bias in dataset) and all other PID values are now in the same order. Next, from the last column of Fig. 5, we find out that the worst-group-accuracies are lower for the unbalanced case and these values significantly improve when the datasets become balanced which implies low spuriousness in the dataset. Finally, Fig. 6 shows through the Grad-CAM (Selvaraju et al., 2017) images that when the dataset is balanced, the model emphasizes more on the core features namely, waterbird or landbird for waterbird dataset and car or truck for the Dominoes dataset (the red regions) while in the unbalanced dataset the background is more highlighted which results in poor worst-group-accuracy.
4.2 Tradeoffs for varying levels of background mixing
Next, we look into the datasets for varying levels of background mixing to observe the tradeoffs between the spuriousness and the worst-group-accuracy.
![Refer to caption](extracted/5699032/WG_Acc_V3.0.png)
Problem Setup: Starting with the dataset creation, we add two backgrounds at different levels. We consider two cases: (i) half of a land background is concatenated with half of a water background (named as concatenation); and (ii) the whole image of a land background is summed with a water background (named as addition). Similar techniques are applied for background mixing for Dominoes dataset (see Fig. 7). Then, the foreground is superimposed on the background. Next, the PID values are calculated for the mixed background and the foreground using our estimator. We train the pre-trained ResNet- (He et al., 2016) with the mixed background with foreground (the whole image) and evaluate the model with the normal test dataset (without any modification). One motivation of mixing the backgrounds is to remove the group bias that is generated due to the correlation between the background and the label in the dataset, that should help mitigate the spurious correlation since the background is no longer different for different groups.
Observations: Firstly, in Fig. 8 we can observe that unique information in the background is prominent in the unbalanced case and it decreases for both addition and concatenation scenarios which indicates spuriousness reduction while using addition and concatenation datasets. Next, we observe a trend in Fig. 9 between the unique information in background i.e., spuriousness and the worst-group-accuracy: with increasing unique information in background i.e., spuriousness, the worst-group-accuracy decreases. This trend is obtained for unbalanced, addition and concatenation datasets (lowest W.G. Acc. for unbalanced and highest W.G. Acc. for concatenation).
Dataset | Unbalanced | Balanced | Addition | Concatenation |
---|---|---|---|---|
Waterbird | 29.75 | 86.60 | 89.88 | 92.99 |
Dominoes 1.0 | 90.73 | 91.42 | 94.665 | 96.45 |
Dominoes 2.0 | 79.79 | 86.94 | 85.51 | 87.35 |
Dataset | Unbalanced | Balanced | ||||||
---|---|---|---|---|---|---|---|---|
Redundancy | Uniq- B | Uniq - F | Synergy | Redundancy | Uniq - B | Uniq - F | Synergy | |
Waterbird | 0.005677 | 0.166927 | 7.75E-07 | 0.018373 | 0.002635 | 0.000127 | 0.00909 | 0.02325 |
Dominoes 1.0 | 0.015414 | 0.172779 | 3.18E-09 | 0.006822 | 0.000296 | 0.000213 | 0.001261 | 0.01548 |
Dominoes 2.0 | 0.029422 | 0.56187 | 6.00E-06 | 0.006134 | 0.014792 | 0.246192 | 9.81E-07 | 0.022527 |
Dataset | Concatenation | Addition | ||||||
Redundancy | Uniq -B | Uniq -F | Synergy | Redundancy | Uniq- B | Uniq- F | Synergy | |
Waterbird | 0.000162 | 0.000069 | 0.005487 | 0.014036 | 0.000375 | 0.000096 | 0.005302 | 0.015623 |
Dominoes 1.0 | 0.000949 | 0.000006 | 0.01443 | 0.006292 | 0.000933 | 0.000036 | 0.020326 | 0.007635 |
Dominoes 2.0 | 0.000103 | 0.000003 | 0.047737 | 0.009618 | 0.000141 | 0.000014 | 0.042616 | 0.007464 |
In the Table 2, the PID values i.e. redundant information, unique information in the background (Uniq-B), unique information in the foreground (Uniq-F) and synergistic information are demonstrated for all three datasets and all variants of the datasets: Unbalanced, balanced, concatenation and addition. Table 1 shows the worst-group-accuracy for all types of datasets. Observe that, the worst-group-accuracy is minimum for the unbalanced dataset and maximum for the concatenation dataset.
5 Conclusion
Quantifying and explaining spuriousness of a dataset can provide an efficient way to assess dataset quality rather than training a model for hours. In this work, we theoretically quantify spuriousness in a dataset with unique information, leveraging the mathematical tool of Partial information decomposition (PID). We demonstrate (with empirical validation) that unique information in the background can measure spuriousness and relate it to the worst-group-accuracy for various spurious correlation mitigation techniques. We also propose a novel autoencoder-based estimator for high-dimensional continuous image data, showing its superiority over classical estimators. However, there are some limitations: firstly to estimate the unique information, at first one has to identify the spurious features and core features of a given dataset which is not always straightforward. Moreover, the estimation is highly data-dependent. A small change in the dataset can greatly affect the PID values. Nonetheless, formally quantifying spuriousness can lead to more effective bias mitigation strategies.
References
- Banerjee et al. (2018) Banerjee, P. K., Rauh, J., and Montúfar, G. Computing the unique information. In IEEE International Symposium on Information Theory, pp. 141–145, 2018.
- Bellman (1966) Bellman, R. Dynamic programming. science, 153(3731):34–37, 1966.
- Bertschinger et al. (2014) Bertschinger, N., Rauh, J., Olbrich, E., Jost, J., and Ay, N. Quantifying unique information. Entropy, 16(4):2161–2183, 2014.
- Blackwell (1953) Blackwell, D. Equivalent comparisons of experiments. The annals of mathematical statistics, pp. 265–272, 1953.
- Bradley et al. (2000) Bradley, P. S., Bennett, K. P., and Demiriz, A. Constrained k-means clustering. Microsoft Research, Redmond, 20(0):0, 2000.
- Cover & Thomas (2012) Cover, T. M. and Thomas, J. A. Elements of Information Theory. John Wiley & Sons, 2012.
- Deng (2012) Deng, L. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
- Dutta & Hamman (2023) Dutta, S. and Hamman, F. A review of partial information decomposition in algorithmic fairness and explainability. Entropy, 25(5):795, 2023.
- Dutta et al. (2020) Dutta, S., Venkatesh, P., Mardziel, P., Datta, A., and Grover, P. An information-theoretic quantification of discrimination with exempt features. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 3825–3833, 2020.
- Dutta et al. (2021) Dutta, S., Venkatesh, P., Mardziel, P., Datta, A., and Grover, P. Fairness under feature exemptions: Counterfactual and observational measures. IEEE Transactions on Information Theory, 67(10):6675–6710, 2021.
- Ehrlich et al. (2022) Ehrlich, D. A., Schneider, A. C., Wibral, M., Priesemann, V., and Makkeh, A. Partial information decomposition reveals the structure of neural representations. arXiv preprint arXiv:2209.10438, 2022.
- Ghouse et al. (2024) Ghouse, G., Rehman, A. U., and Bhatti, M. I. Understanding of causes of spurious associations: Problems and prospects. Journal of Statistical Theory and Applications, 23(1):44–66, 2024.
- Guo et al. (2017) Guo, X., Liu, X., Zhu, E., and Yin, J. Deep clustering with convolutional autoencoders. In Neural Information Processing: 24th International Conference, ICONIP 2017, Guangzhou, China, November 14-18, 2017, Proceedings, Part II 24, pp. 373–382. Springer, 2017.
- Haig (2003) Haig, B. D. What is a spurious correlation? Understanding Statistics: Statistical Issues in Psychology, Education, and the Social Sciences, 2(2):125–132, 2003.
- Hamman & Dutta (2024a) Hamman, F. and Dutta, S. Demystifying local and global fairness trade-offs in federated learning using information theory. In International Conference on Learning Representations (ICLR), 2024a.
- Hamman & Dutta (2024b) Hamman, F. and Dutta, S. A unified view of group fairness tradeoffs using partial information decomposition. arXiv preprint arXiv:2406.04562, 2024b.
- He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016. doi: 10.1109/CVPR.2016.90.
- Hotelling (1933) Hotelling, H. Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6):417, 1933.
- Izmailov et al. (2022) Izmailov, P., Kirichenko, P., Gruver, N., and Wilson, A. G. On feature learning in the presence of spurious correlations. Advances in Neural Information Processing Systems, 35:38516–38532, 2022.
- James et al. (2018) James, R. G., Ellison, C. J., and Crutchfield, J. P. dit: a Python package for discrete information theory. The Journal of Open Source Software, 3(25):738, 2018. doi: https://doi.org/10.21105/joss.00738.
- Kirichenko et al. (2022) Kirichenko, P., Izmailov, P., and Wilson, A. G. Last layer re-training is sufficient for robustness to spurious correlations. arXiv preprint arXiv:2204.02937, 2022.
- Krizhevsky et al. (2009) Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.
- Law & Jain (2006) Law, M. H. and Jain, A. K. Incremental nonlinear dimensionality reduction by manifold learning. IEEE transactions on pattern analysis and machine intelligence, 28(3):377–391, 2006.
- Lee & Verleysen (2005) Lee, J. A. and Verleysen, M. Nonlinear dimensionality reduction of data manifolds with essential loops. Neurocomputing, 67:29–53, 2005.
- Liang et al. (2023) Liang, P. P., Cheng, Y., Fan, X., Ling, C. K., Nie, S., Chen, R., Deng, Z., Allen, N., Auerbach, R., Mahmood, F., et al. Quantifying & modeling multimodal interactions: An information decomposition framework. Advances in Neural Information Processing Systems, 36, 2023.
- Liang et al. (2024) Liang, P. P., Ling, C. K., Cheng, Y., Obolenskiy, A., Liu, Y., Pandey, R., Wilf, A., Morency, L.-P., and Salakhutdinov, R. Multimodal learning without labeled multimodal data: Guarantees and applications. International Conference on Learning Representations (ICLR), 2024.
- Liu et al. (2023) Liu, S., Zhang, X., Sekhar, N., Wu, Y., Singhal, P., and Fernandez-Granda, C. Avoiding spurious correlations via logit correction. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=5BaqCFVh5qL.
- Lynch et al. (2023) Lynch, A., Dovonon, G. J., Kaddour, J., and Silva, R. Spawrious: A benchmark for fine control of spurious correlation biases. arXiv preprint arXiv:2303.05470, 2023.
- MacQueen et al. (1967) MacQueen, J. et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pp. 281–297. Oakland, CA, USA, 1967.
- Moayeri et al. (2023) Moayeri, M., Wang, W., Singla, S., and Feizi, S. Spuriosity rankings: sorting data to measure and mitigate biases. Advances in Neural Information Processing Systems, 36:41572–41600, 2023.
- Mohamadi et al. (2023) Mohamadi, S., Doretto, G., and Adjeroh, D. A. More synergy, less redundancy: Exploiting joint mutual information for self-supervised learning. arXiv preprint arXiv:2307.00651, 2023.
- Pakman et al. (2021) Pakman, A., Nejatbakhsh, A., Gilboa, D., Makkeh, A., Mazzucato, L., Wibral, M., and Schneidman, E. Estimating the unique information of continuous variables. Advances in neural information processing systems, 34:20295–20307, 2021.
- Raginsky (2011) Raginsky, M. Shannon meets blackwell and le cam: Channels, codes, and statistical experiments. In 2011 IEEE International Symposium on Information Theory Proceedings, pp. 1220–1224. IEEE, 2011.
- Sadeghi & Armanfard (2023) Sadeghi, M. and Armanfard, N. Deep clustering with self-supervision using pairwise data similarities. Authorea Preprints, 2023.
- Sagawa et al. (2019) Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731, 2019.
- Selvaraju et al. (2017) Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626, 2017.
- Shah et al. (2020) Shah, H., Tamuly, K., Raghunathan, A., Jain, P., and Netrapalli, P. The pitfalls of simplicity bias in neural networks. Advances in Neural Information Processing Systems, 33, 2020.
- Singla & Feizi (2021) Singla, S. and Feizi, S. Salient imagenet: How to discover spurious features in deep learning? arXiv preprint arXiv:2110.04301, 2021.
- Srivastava (2023) Srivastava, M. Addressing spurious correlations in machine learning models: A comprehensive review. OSF Prepr, 2023.
- Stromberg et al. (2024) Stromberg, N., Ayyagari, R., Welfert, M., Koyejo, S., and Sankar, L. Robustness to subpopulation shift with domain label noise via regularized annotation of domains. arXiv preprint arXiv:2402.11039, 2024.
- Tax et al. (2017) Tax, T., Mediano, P., and Shanahan, M. The partial information decomposition of generative neural network models. Entropy, 19(9):474, 2017.
- Van der Maaten & Hinton (2008) Van der Maaten, L. and Hinton, G. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Venkatesh & Schamberg (2022) Venkatesh, P. and Schamberg, G. Partial information decomposition via deficiency for multivariate gaussians. In 2022 IEEE International Symposium on Information Theory (ISIT), pp. 2892–2897. IEEE, 2022.
- Venkatesh et al. (2023) Venkatesh, P., Gurushankar, K., and Schamberg, G. Capturing and interpreting unique information. In 2023 IEEE International Symposium on Information Theory (ISIT), pp. 2631–2636. IEEE, 2023.
- Venkatesh et al. (2024) Venkatesh, P., Bennett, C., Gale, S., Ramirez, T., Heller, G., Durand, S., Olsen, S., and Mihalas, S. Gaussian partial information decomposition: Bias correction and application to high-dimensional data. Advances in Neural Information Processing Systems, 36, 2024.
- Wah et al. (2011) Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. The caltech-ucsd birds-200-2011 dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
- Wang et al. (2014) Wang, W., Huang, Y., Wang, Y., and Wang, L. Generalized autoencoder: A neural network framework for dimensionality reduction. In 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 496–503, 2014. doi: 10.1109/CVPRW.2014.79.
- Wang et al. (2015) Wang, Y., Yao, H., Zhao, S., and Zheng, Y. Dimensionality reduction strategy based on auto-encoder. In Proceedings of the 7th International Conference on Internet Multimedia Computing and Service, pp. 1–4, 2015.
- Williams & Beer (2010) Williams, P. L. and Beer, R. D. Nonnegative decomposition of multivariate information. arXiv preprint arXiv:1004.2515, 2010.
- Wollstadt et al. (2023) Wollstadt, P., Schmitt, S., and Wibral, M. A rigorous information-theoretic definition of redundancy and relevancy in feature selection based on (partial) information decomposition. J. Mach. Learn. Res., 24:131–1, 2023.
- Wu et al. (2023) Wu, S., Yuksekgonul, M., Zhang, L., and Zou, J. Discover and cure: concept-aware mitigation of spurious correlation. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
- Xie et al. (2016) Xie, J., Girshick, R., and Farhadi, A. Unsupervised deep embedding for clustering analysis. In International conference on machine learning, pp. 478–487. PMLR, 2016.
- Yang et al. (2023) Yang, Y.-Y., Chou, C.-N., and Chaudhuri, K. Understanding rare spurious correlations in neural networks, 2023. URL https://openreview.net/forum?id=lrzX-rNuRvw.
- Ye et al. (2023) Ye, H., Zou, J., and Zhang, L. Freeze then train: Towards provable representation learning under spurious correlations and feature noise. Proceedings of Machine Learning Research, 206:8968–8990, 2023. ISSN 2640-3498. Publisher Copyright: Copyright © 2023 by the author(s); 26th International Conference on Artificial Intelligence and Statistics, AISTATS 2023 ; Conference date: 25-04-2023 Through 27-04-2023.
- Ye et al. (2024) Ye, W., Zheng, G., Cao, X., Ma, Y., Hu, X., and Zhang, A. Spurious correlations in machine learning: A survey. arXiv preprint arXiv:2402.12715, 2024.
Appendix A Proof of Theorem 1
As a proof sketch, we first derive the following lemma.
Lemma 1.
if and only if there exists a row-stochastic matrix such that: for all and .
Proof.
If , then we have: where : and . Thus, there exists a distribution such that and are independent given under the joint distribution . Then, we have
(3) | ||||
(4) | ||||
(5) | ||||
(6) | ||||
(7) | ||||
(8) |
Here, (a) holds because for all , (b) holds because under joint distribution , variables and are independent given , and (c) simply chooses which is a function of and will lead to a row-stochastic matrix since
Next, we prove the converse. Suppose, such a row-stochastic matrix exists such that:
Now, we can define a joint distribution such that:
(9) |
We can show that is a valid probability distribution since is row stochastic.
(10) |
Also, we can show that since:
(11) |
which holds since such a row-stochastic matrix exists. Also, we have:
(12) |
which holds since is row-stochastic.
Then,
∎
Appendix B Appendix to Experiments
This section includes additional results and figures for a more comprehensive understanding of our work.
B.1 Data
We consider the waterbird (Wah et al., 2011) and Dominoes dataset. For a summary of the datasets, we refer the readers to Tables 3, 4 and 5.
Waterbird | Group00 | Group01 | Group10 | Group11 |
---|---|---|---|---|
Train | 3498 | 184 | 56 | 1057 |
Validation | 467 | 466 | 133 | 133 |
Test | 2255 | 2255 | 642 | 642 |
Total | 6220 | 2905 | 831 | 1832 |
Dominoes 1.0 | Group00 | Group01 | Group10 | Group11 |
---|---|---|---|---|
Train | 3750 | 1250 | 1250 | 3750 |
Test | 473 | 507 | 507 | 473 |
Total | 4223 | 1772 | 1757 | 4208 |
Dominoes 2.0 | Group00 | Group01 | Group10 | Group11 |
---|---|---|---|---|
Train | 3000 | 500 | 1250 | 3000 |
Test | 245 | 490 | 245 | 490 |
Total | 3245 | 990 | 1495 | 3490 |
B.2 Experimental Setup
Sl. No. | Layer | Filter No. | Kernel Size | Stride | Padding | Output Padding | Output Shape | Param No. |
1 | Conv2d | 32 | 5 | 2 | 2 | - | (32,16,16) | 2432 |
2 | LeakyReLU | - | - | - | - | - | (32,16,16) | 0 |
3 | BatchNorm2d | - | - | - | - | - | (32,16,16) | 64 |
4 | Conv2d | 64 | 5 | 2 | 2 | - | (64,8,8) | 51264 |
5 | LeakyReLU | - | - | - | - | - | (64,8,8) | 0 |
6 | BatchNorm2d | - | - | - | - | - | (64,8,8) | 128 |
7 | Conv2d | 128 | 3 | 2 | 0 | - | (128,3,3) | 73856 |
8 | LeakyReLU | - | - | - | - | - | (128,3,3) | 0 |
9 | Flatten | - | - | - | - | - | 1152 | 0 |
10 | Linear (embedding) | - | - | - | - | - | 10 | 11530 |
11 | Clustering Layer | - | - | - | - | - | 10 | 100 |
12 | Linear(deembedding) | - | - | - | - | - | 1152 | 12672 |
13 | LeakyReLU | - | - | - | - | - | 1152 | 0 |
14 | ConvTranspose2d | 64 | 3 | 2 | 0 | 1 | (64, 8, 8) | 73,792 |
15 | LeakyReLU | - | - | - | - | - | (64, 8, 8) | 0 |
16 | BatchNorm2d | - | - | - | - | - | (64, 8, 8) | 128 |
17 | ConvTranspose2d | 32 | 5 | 2 | 2 | 1 | (32, 16, 16) | 51,232 |
18 | LeakyReLU | - | - | - | - | - | (32, 16, 16) | 0 |
19 | BatchNorm2d | - | - | - | - | - | (32, 16, 16) | 64 |
20 | ConvTranspose2d | 3 | 5 | 2 | 2 | 1 | (3, 32, 32) | 2403 |
![Refer to caption](extracted/5699032/autoencoder2.png)
Calculating PIDs: Calculation of PIDs: redundancy, unique information and synergy involves mainly three steps. First of all, the clusters for the given input images are estimated. This step requires the autoencoder. As shown in Fig.4, a given image is separated into two images: one contains the core features (foreground) and other contains the spurious features (background). For Dominoes dataset, the core features are formed of the images of cars or trucks and the spurious features are the images of zeros and ones. For each set of features, the clusters are computed. The architecture details of the autoencoder for Dominoes dataset are shown in Table 6. The output of the clustering layer is the desired clusters. For waterbird dataset, the architecture details are given in Fig. 10. The complexity of the autoencoder for waterbird is increased in order to handle the more challenging nature of this dataset as compared to the Dominoes one. The architecture is proposed inspired by (Sadeghi & Armanfard, 2023). To obtain the clusters, the model is pretrained with only mean square error loss function (MSEloss). Then, the model is again trained with weighted loss function which is a weighted sum of MSEloss and KL divergence loss. The weights of the clustering layer are initialized with the cluster centers obtained by k-means clustering. For the Dominoes dataset, hyperparameters are as follows: batch size , learning rate , CosineAnnealingLR scheduler, Adam optimizer with weight decay , pretraining epochs and later training is for epochs. The later training process is terminated if the change of label assignments between two consecutive updates for target distribution is less than . For the waterbird dataset, hyperparameters are as follows: batch size , learning rate , CosineAnnealingLR scheduler, Adam optimizer with weight decay , pretraining epochs and later training is for epochs. Next, the clusters obtained for the foreground and the background and the binary labels are used to estimate the joint distribution using histograms followed by the PID estimation with DIT (James et al., 2018) package.
Calculating Accuracies: To calculate the worst-group accuracy for the different variations of different datasets we do fine tuning of the pre-trained ResNet- (He et al., 2016) model. The worst-group-accuracy is defined as the accuracy of the minority group having the lowest number of training sample (see Table 3 and 5. For waterbird dataset, group has minimum training samples and for Dominoes 2.0 dataset, group has the lowest minority group samples.). For Dominoes 1.0 dataset, since group and group have the same number of training and test samples, the worst-group-accuracy is calculated by taking the average of the accuracies of these two groups. For the Dominoes dataset, hyperparameters are as follows: batch size , learning rate , CosineAnnealingLR scheduler, stochastic gradient descent (SGD) optimizer with weight decay , loss function binary cross-entropy and epochs . The train dataset is split into two subsets, i.e., for training split and for validation split. For waterbird dataset, the batch size is and the other parameters are same as Dominoes. For addition and concatenation dataset the number of sample images in train and test dataset are distributed as in Table 3, 4 and 5 which are created accordingly. For balanced dataset, we use weighted random sampler where weights are selected as the proportion of the groups. All the experiments are executed on NVIDIA RTX A4500.
B.3 Additional Results
Fig.11 shows the Grad-CAM (Selvaraju et al., 2017) variations for different models trained with unbalanced, balanced, addition and concatenation dataset (from left ’a’: unbalanced, ’b,c’: balanced, ’d’: addition and ’e’: concatenation). Observe that for the dataset based mitigation techniques, the model is focusing on the foreground (red region) while on the unbalanced case the model is emphasising in the background. There are cases where model does not give any importance to any portion of the image (see Fig.11b).
![Refer to caption](extracted/5699032/gradcam5.png)