Quantifying Spuriousness of Biased Datasets Using
Partial Information Decomposition

Barproda Halder    Faisal Hamman    Pasan Dissanayake    Qiuyi Zhang    Ilia Sucholutsky    Sanghamitra Dutta
Abstract

Spurious patterns refer to a mathematical association between two or more variables in a dataset that are not causally related. However, this notion of spuriousness, which is usually introduced due to sampling biases in the dataset, has classically lacked a formal definition. To address this gap, this work presents the first information-theoretic formalization of spuriousness in a dataset (given a split of spurious and core features) using a mathematical framework called Partial Information Decomposition (PID). Specifically, we disentangle the joint information content that the spurious and core features share about another target variable (e.g., the prediction label) into distinct components, namely unique, redundant, and synergistic information. We propose the use of unique information, with roots in Blackwell Sufficiency, as a novel metric to formally quantify dataset spuriousness and derive its desirable properties. We empirically demonstrate how higher unique information in the spurious features in a dataset could lead a model into choosing the spurious features over the core features for inference, often having low worst-group-accuracy. We also propose a novel autoencoder-based estimator for computing unique information that is able to handle high-dimensional image data. Finally, we also show how this unique information in the spurious feature is reduced across several dataset-based spurious-pattern-mitigation techniques such as data reweighting and varying levels of background mixing, demonstrating a novel tradeoff between unique information (spuriousness) and worst-group-accuracy. footnotetext: Accepted at ICML 2024 Workshop on Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models.

Correspondence to: Barproda Halder <<<[email protected]>>>.

1 University of Maryland College Park   2 Google Research   3 Princeton University


1 Introduction

Spurious patterns (Haig, 2003) arise when two or more variables are correlated in a dataset even though they do not have any causal relationship. For example, in the Waterbird dataset (Wah et al., 2011), most waterbirds have water backgrounds, and landbirds have land backgrounds (see Fig. 1). This correlation in the dataset essentially misleads a machine learning model into creating a spurious link between background and bird type, since it often finds the background to be “more informative” than the foreground for predicting the bird type. Learning such spurious links from the data may result in high performance on the training and in-distribution datasets, but results in reduced performance on out-of-distribution datasets and affects worst-group-accuracy (Lynch et al., 2023; Sagawa et al., 2019), i.e., the accuracy on the minority groups like waterbirds with land background or vice versa.

Several existing works (Kirichenko et al., 2022; Izmailov et al., 2022; Wu et al., 2023; Ye et al., 2023; Liu et al., 2023) focus on different dataset-based and model-training-based approaches to mitigate spurious patterns and evaluate the empirical performance over out-of-distribution datasets (or, to improve worst-group-accuracy). However, this notion of spuriousness in any given dataset lacks a formal definition. This work addresses this gap by asking the question: Given a split between core and spurious features, how do we formally quantify the spuriousness in any given dataset?

To answer this question, we present an information-theoretic formalization of spurious patterns, by leveraging a body of work in information theory called Partial Information Decomposition (PID) (Bertschinger et al., 2014; Banerjee et al., 2018). We note that classical information-theoretic measures such as mutual information (Cover & Thomas, 2012) captures the entire statistical dependency between two random variables but fail to capture how this dependency is distributed among those variables, i.e., the structure of the multivariate information. Partial Information Decomposition (PID) addresses this nuanced issue by providing a formal way of disentangling the joint information content between the core and spurious features into non-negative terms, namely, unique, redundant, or synergistic information (see (2.1) in Section 2.1).

Refer to caption
Figure 1: Spuriousness in the dataset due to sampling bias.

Our proposition is to use the unique information about the target variable Y𝑌Yitalic_Y in the spurious features B𝐵Bitalic_B that is not in the core features F𝐹Fitalic_F as a measure of spuriousness in the dataset (often denoted as Uni(Y:B|F)\mathrm{Uni}({Y{:}B|F})roman_Uni ( italic_Y : italic_B | italic_F ). To justify our proposition, we discuss how unique information is connected to Blackwell Sufficiency (Blackwell, 1953), a notable concept in statistical decision theory. Blackwell Sufficiency provides a partial ordering on when one random variable can be more “informative” (less noisy) than another for inference. Unique information captures the departure from Blackwell Sufficiency, which goes to zero if and only if one random variable is Blackwell Sufficient over another for a prediction task (see Theorem 1). Thus, unique information intuitively quantifies when one variable can be more informative than another, which we leverage to explain when the spurious feature B𝐵Bitalic_B can be more informative than the core feature F𝐹Fitalic_F for the model prediction. Additionally, we also show several desirable properties of unique information as a measure of spuriousness in the dataset in Theorem 2. Though Partial Information Decomposition (PID) has recently been applied to few other areas in machine learning (Tax et al., 2017; Dutta et al., 2020, 2021; Hamman & Dutta, 2024a; Liang et al., 2023; Dutta & Hamman, 2023) (also see Related Works), we are pioneering its use to decompose information in spurious and core features and quantify spuriousness, supported by desirable properties and empirical validation. Our main contributions can be concisely listed as follows:

  • Novel information-theoretic formalization to explain spurious patterns: Though many works attempt to prevent a model from learning spurious patterns, there is a lack of a theoretical understanding of the “amount” of spuriousness in a dataset, and how do we quantify and measure it given a split of spurious and core features. Novel to this work, we investigate spuriousness through the lens of partial information decomposition (PID) and provide a fundamental understanding of when a model finds the spurious features to be “more informative” than the core features. We leverage PID to disentangle the joint information content between the core and spurious features into unique, redundant, and synergistic information.

  • Demystifying unique information as a measure of spuriousness: Next, we propose unique information in the spurious features Uni(Y:B|F)\mathrm{Uni}({Y{:}B|F})roman_Uni ( italic_Y : italic_B | italic_F ) as a measure of the spuriousness in a dataset. To justify our proposition, we first establish how unique information Uni(Y:B|F)\mathrm{Uni}({Y{:}B|F})roman_Uni ( italic_Y : italic_B | italic_F ) quantifies the informativeness of a random variable B𝐵Bitalic_B compared to F𝐹Fitalic_F for predicting Y𝑌Yitalic_Y (see Theorem 1 for a motivation from Blackwell Sufficiency). Depending on the increasing or decreasing nature of the unique information Uni(Y:B|F)\mathrm{Uni}({Y{:}B|F})roman_Uni ( italic_Y : italic_B | italic_F ), one can then anticipate to what extent is a model going to leverage B𝐵Bitalic_B over F𝐹Fitalic_F for prediction. Additionally, we also show several desirable properties of unique information Uni(Y:B|F)\mathrm{Uni}({Y{:}B|F})roman_Uni ( italic_Y : italic_B | italic_F ) as a measure of spuriousness in Theorem 2. Our measure can identify which features are more likely to be predictive for a classification task, paving a pathway for dataset quality assessment and explaining feature-based informativeness.

  • Spuriousness Disentangler: An autoencoder-based estimator for computing unique information: We propose a novel autoencoder-based framework that we call – Spuriousness Disentangler – to compute the PID values for high dimensional image data. The estimator consists of mainly three main parts: (i) First, an autoencoder reduces the dimension of the image data and gives an one-dimensional array of clusters which serves as a lower-dimensional, discrete feature representation for the image data. Along the lines of (Guo et al., 2017), the dimensionality reduction and clustering are efficiently performed through minimization of a joint loss function; (ii) Next, the computation of the joint probability distribution of this lower-dimensional representation is performed; and (iii) Finally, the partial information decomposition (PID) values are calculated by solving a convex optimization problem using the Discrete Information Theory (DIT) package (James et al., 2018).

  • Experimental Results and Novel Tradeoff: Our experimental results are in agreement with our theoretical postulations, demonstrating an empirical tradeoff between our proposed measure of spuriousness, i.e., Uni(Y:B|F)\mathrm{Uni}({Y{:}B|F})roman_Uni ( italic_Y : italic_B | italic_F ) and empirical evaluation metrics known to be affected by spurious patterns, i.e., worst group accuracy. We show that for real-world unbalanced datasets, e.g., the Waterbirds dataset (Wah et al., 2011), the unique information in the spurious feature Uni(Y:B|F)\mathrm{Uni}({Y{:}B|F})roman_Uni ( italic_Y : italic_B | italic_F ) is the most prominent and is significantly higher than any information in the core features. This helps explain why a model trained on this dataset readily uses the spurious feature rather than the core feature for prediction. Additionally, when a dataset-based spurious-correlation-mitigation method such as data-reweighting is applied, the unique information in the spurious features Uni(Y:B|F)\mathrm{Uni}({Y{:}B|F})roman_Uni ( italic_Y : italic_B | italic_F ) reduces drastically (again explaining why a model might now be more likely to use the core feature F𝐹Fitalic_F). We also observe a novel tradeoff between unique information Uni(Y:B|F)\mathrm{Uni}({Y{:}B|F})roman_Uni ( italic_Y : italic_B | italic_F ) (proposed measure of spuriousness) and worst-group-accuracy for varying degrees of background mixing (a form of noise), i.e., the worst-group-accuracy improves with the decreasing unique information in the spurious features pointing to a novel tradeoff. We also study Grad-CAM (Selvaraju et al., 2017) (a technique to generate ’visual explanations’ for decisions made by Convolutional Neural Network (CNN)-based models) visualizations for many of the trained models to further confirm when the core or spurious feature is actually being emphasized by the model for different experimental setups.

Related Works: There are several perspectives on spurious correlation (see Haig (2003); Kirichenko et al. (2022); Izmailov et al. (2022); Wu et al. (2023); Ye et al. (2023); Liu et al. (2023); Stromberg et al. (2024); Singla & Feizi (2021); Moayeri et al. (2023) and the references therein; also see surveys (Ye et al., 2024; Srivastava, 2023; Ghouse et al., 2024)). Spuriousness mitigation techniques are broadly divided into two groups: (i) Dataset-based techniques (Kirichenko et al., 2022; Wu et al., 2023) and (ii) Learning-based techniques (Liu et al., 2023; Yang et al., 2023; Ye et al., 2023). Kirichenko et al. (2022) shows that last-layer fine-tuning of a pre-trained model with a group-balanced subset of data is sufficient to mitigate spurious correlation. Wu et al. (2023) proposes a concept-aware spurious correlation mitigation technique. Ye et al. (2023) introduces a Freeze and Train approach to learn salient features in an unsupervised way and freezes them before training the rest of the features via supervised learning. Yang et al. (2023) explores different regularization techniques to see the effect on the spurious correlation and  Liu et al. (2023) examines a logit correction loss. Our novelty lies in formalizing the spuriousness of datasets using the PID framework, and explaining how effective a dataset-based spurious-correlation mitigation will be for regular model training.

Partial information decomposition (PID) (Williams & Beer, 2010; Bertschinger et al., 2014; Dutta et al., 2021; Venkatesh & Schamberg, 2022) is an active area of research. PID measures are beginning to be used in different domains of neuroscience and machine learning (Tax et al., 2017; Dutta et al., 2020, 2021; Hamman & Dutta, 2024a; Ehrlich et al., 2022; Liang et al., 2024; Wollstadt et al., 2023; Mohamadi et al., 2023; Venkatesh et al., 2024; Hamman & Dutta, 2024b). However, examining spurious correlation through the lens of PID and observing novel empirical tradeoffs between the spurious pattern and worst-group-accuracy is unexplored. Additionally, there is limited work on calculating PID values for high dimensional multivariate continuous data. Some existing works (Dutta et al., 2021; Venkatesh & Schamberg, 2022; Venkatesh et al., 2024) handle continuous data with Gaussian assumptions while (Pakman et al., 2021) considers one-dimensional multivariate case. Hence, estimating PID for high-dimensional data by proper dimensionality reduction and discretization is unexplored.

For dimensionality reduction, different learning based methods exist (Hotelling, 1933; Law & Jain, 2006; Lee & Verleysen, 2005; Wang et al., 2015, 2014). Similarly, for discretization, different clustering algorithms exist, e.g., k-means clustering (MacQueen et al., 1967; Bradley et al., 2000), deep embedded clustering (Xie et al., 2016). Along the lines of an autoencoder-based clustering setup in (Guo et al., 2017), our proposed Spuriousness Disentangler trains a network to jointly learn a good representation of the input image data in a self-supervised way ensuring low representation error while also clustering simultaneously to deal with the challenge of high dimensional and continuous image data.

2 Preliminaries and Background

Let X=(X1,X2,,Xd)𝑋subscript𝑋1subscript𝑋2subscript𝑋𝑑X=(X_{1},X_{2},\ldots,X_{d})italic_X = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) be the random variable denoting the input (e.g., an image) where each Xi𝒳subscript𝑋𝑖𝒳X_{i}\in\mathcal{X}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_X which denotes a finite set of values. The core features (e.g., the foreground) will be denoted by FX𝐹𝑋F\subseteq Xitalic_F ⊆ italic_X, and the spurious features (e.g., the background) will be denoted by B=X\F𝐵\𝑋𝐹B=X\backslash Fitalic_B = italic_X \ italic_F. We typically use the notation \mathcal{B}caligraphic_B and \mathcal{F}caligraphic_F to denote the range of values for the spurious and core features. Let Y𝑌Yitalic_Y denote the target random variable, e.g., the true labels which lie in the set 𝒴𝒴\mathcal{Y}caligraphic_Y, and the model predictions are given by Y^=mθ(X)^𝑌subscript𝑚𝜃𝑋\hat{Y}=m_{\theta}(X)over^ start_ARG italic_Y end_ARG = italic_m start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X ) (parameterized by θ𝜃\thetaitalic_θ). Generally, we use the notation PAsubscript𝑃𝐴P_{A}italic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT to denote the distribution of random variable A𝐴Aitalic_A, and PA|Bsubscript𝑃conditional𝐴𝐵P_{A|B}italic_P start_POSTSUBSCRIPT italic_A | italic_B end_POSTSUBSCRIPT to denote the conditional distribution of random variable A𝐴Aitalic_A conditioned on B𝐵Bitalic_B. Depending on the context, we also use more than one random variable as sub-script, e.g., PABYsubscript𝑃𝐴𝐵𝑌P_{ABY}italic_P start_POSTSUBSCRIPT italic_A italic_B italic_Y end_POSTSUBSCRIPT denotes the joint distribution of (A,B,Y)𝐴𝐵𝑌(A,B,Y)( italic_A , italic_B , italic_Y ). Whenever necessary, we also use the notation QAsubscript𝑄𝐴Q_{A}italic_Q start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT to denote an alternate distribution on the random variable A𝐴Aitalic_A that is different from PAsubscript𝑃𝐴P_{A}italic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT. We also use the notation PA|BPB|Csubscript𝑃conditional𝐴𝐵subscript𝑃conditional𝐵𝐶P_{A|B}\circ P_{B|C}italic_P start_POSTSUBSCRIPT italic_A | italic_B end_POSTSUBSCRIPT ∘ italic_P start_POSTSUBSCRIPT italic_B | italic_C end_POSTSUBSCRIPT to denote a composition of two conditional distributions given by: PA|BPB|C(a|c)=bPA|B(a|b)PB|C(b|c)a𝒜,c𝒞,formulae-sequencesubscript𝑃conditional𝐴𝐵subscript𝑃conditional𝐵𝐶conditional𝑎𝑐subscript𝑏subscript𝑃conditional𝐴𝐵conditional𝑎𝑏subscript𝑃conditional𝐵𝐶conditional𝑏𝑐for-all𝑎𝒜𝑐𝒞P_{A|B}\circ P_{B|C}(a|c)=\sum_{b\in\mathcal{B}}P_{A|B}(a|b)P_{B|C}(b|c)\ % \forall a\in\mathcal{A},\ c\in\mathcal{C},italic_P start_POSTSUBSCRIPT italic_A | italic_B end_POSTSUBSCRIPT ∘ italic_P start_POSTSUBSCRIPT italic_B | italic_C end_POSTSUBSCRIPT ( italic_a | italic_c ) = ∑ start_POSTSUBSCRIPT italic_b ∈ caligraphic_B end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_A | italic_B end_POSTSUBSCRIPT ( italic_a | italic_b ) italic_P start_POSTSUBSCRIPT italic_B | italic_C end_POSTSUBSCRIPT ( italic_b | italic_c ) ∀ italic_a ∈ caligraphic_A , italic_c ∈ caligraphic_C , where 𝒜𝒜\mathcal{A}caligraphic_A, \mathcal{B}caligraphic_B and 𝒞𝒞\mathcal{C}caligraphic_C denote the range of values that can be taken by random variables A𝐴Aitalic_A, B𝐵Bitalic_B, and C𝐶Citalic_C.

2.1 Background on Partial Information Decomposition

We provide a brief background on PID that would be relevant for the rest of the paper. The classical information-theoretic quantification of the total information that two random variables A𝐴Aitalic_A and B𝐵Bitalic_B together hold about Y𝑌Yitalic_Y is given by the mutual information I(Y;A,B)I𝑌𝐴𝐵\mathrm{I}({Y;A,B})roman_I ( italic_Y ; italic_A , italic_B ) (see (Cover & Thomas, 2012) for a background on mutual information). Mutual information I(Y;A,B)I𝑌𝐴𝐵\mathrm{I}({Y;A,B})roman_I ( italic_Y ; italic_A , italic_B ) is defined as the KL divergence (Cover & Thomas, 2012) between the joint distribution PYABsubscript𝑃𝑌𝐴𝐵P_{YAB}italic_P start_POSTSUBSCRIPT italic_Y italic_A italic_B end_POSTSUBSCRIPT and the product of the marginal distributions PYPABtensor-productsubscript𝑃𝑌subscript𝑃𝐴𝐵P_{Y}\otimes P_{AB}italic_P start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ⊗ italic_P start_POSTSUBSCRIPT italic_A italic_B end_POSTSUBSCRIPT and would go to zero if and only if (A,B)𝐴𝐵(A,B)( italic_A , italic_B ) is independent of Y𝑌Yitalic_Y. Intuitively, this mutual information captures the total predictive power about Y𝑌Yitalic_Y that is present jointly in (A,B)𝐴𝐵(A,B)( italic_A , italic_B ) together, i.e., how well can one learn Y𝑌Yitalic_Y from (A,B)𝐴𝐵(A,B)( italic_A , italic_B ) together. However, I(Y;A,B)I𝑌𝐴𝐵\mathrm{I}({Y;A,B})roman_I ( italic_Y ; italic_A , italic_B ) only captures the total information content about Y𝑌Yitalic_Y jointly in (A,B)𝐴𝐵(A,B)( italic_A , italic_B ) and does not unravel anything about what is unique and what is shared between A𝐴Aitalic_A and B𝐵Bitalic_B.

Refer to caption
Figure 2: PID of I(Y;A,B)I𝑌𝐴𝐵\mathrm{I}({Y;A,B})roman_I ( italic_Y ; italic_A , italic_B ): I(Y;A,B)I𝑌𝐴𝐵\mathrm{I}({Y;A,B})roman_I ( italic_Y ; italic_A , italic_B ) is decomposed into four nonnegative terms, namely, unique information in A𝐴Aitalic_A (Uni(Y:A|B)\mathrm{Uni}({Y{:}A|B})roman_Uni ( italic_Y : italic_A | italic_B )), unique information in B𝐵Bitalic_B (Uni(Y:B|A)\mathrm{Uni}({Y{:}B|A})roman_Uni ( italic_Y : italic_B | italic_A )), redundant information in both (Red(Y:A,B)\mathrm{Red}({Y{:}A,B})roman_Red ( italic_Y : italic_A , italic_B )), and synergistic information in both (Syn(Y:A,B)\mathrm{Syn}({Y{:}A,B})roman_Syn ( italic_Y : italic_A , italic_B )).

PID (Bertschinger et al., 2014; Banerjee et al., 2018) provides a mathematical framework that decomposes the total information content I(Y;A,B)I𝑌𝐴𝐵\mathrm{I}({Y;A,B})roman_I ( italic_Y ; italic_A , italic_B ) into four nonnegative terms (also see Fig. 2):

I(Y;A,B)I𝑌𝐴𝐵\displaystyle\mathrm{I}({Y;A,B})roman_I ( italic_Y ; italic_A , italic_B ) =Uni(Y:B|A)+Uni(Y:A|B)\displaystyle=\mathrm{Uni}({Y{:}B|A})+\mathrm{Uni}({Y{:}A|B})= roman_Uni ( italic_Y : italic_B | italic_A ) + roman_Uni ( italic_Y : italic_A | italic_B )
+Red(Y:A,B)+Syn(Y:A,B).\displaystyle+\mathrm{Red}({Y{:}A,B})+\mathrm{Syn}({Y{:}A,B}).+ roman_Red ( italic_Y : italic_A , italic_B ) + roman_Syn ( italic_Y : italic_A , italic_B ) . (1)

Here, Uni(Y:B|A)\mathrm{Uni}({Y{:}B|A})roman_Uni ( italic_Y : italic_B | italic_A ) denotes the unique information about Y𝑌Yitalic_Y that is only in B𝐵Bitalic_B but not in A𝐴Aitalic_A. Next, Red(Y:A,B)\mathrm{Red}({Y{:}A,B})roman_Red ( italic_Y : italic_A , italic_B ) denotes redundant information (common knowledge) about Y𝑌Yitalic_Y in both A𝐴Aitalic_A and B𝐵Bitalic_B. Lastly, Syn(Y:A,B)\mathrm{Syn}({Y{:}A,B})roman_Syn ( italic_Y : italic_A , italic_B ) is an interesting term that denotes the synergistic information that is present only jointly in A,B𝐴𝐵{A,B}italic_A , italic_B but not in any one of them individually, e.g., a public and private key can jointly reveal information not in any one of them alone.

Motivational Example. Let Z=(Z1,Z2,Z3)𝑍subscript𝑍1subscript𝑍2subscript𝑍3Z{=}(Z_{1},Z_{2},Z_{3})italic_Z = ( italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) with each Zisimilar-tosubscript𝑍𝑖absentZ_{i}{\sim}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ i.i.d. Bern(1/2). Let A=(Z1,Z2,Z3N)𝐴subscript𝑍1subscript𝑍2direct-sumsubscript𝑍3𝑁A=(Z_{1},Z_{2},Z_{3}\oplus N)italic_A = ( italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ⊕ italic_N ), B=(Z2,N)𝐵subscript𝑍2𝑁B=(Z_{2},N)italic_B = ( italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_N ), and Nsimilar-to𝑁absentN\simitalic_N ∼ Bern(1/2) which is independent of Z𝑍Zitalic_Z. Here, I(Z;A,B)=3I𝑍𝐴𝐵3\mathrm{I}(Z;A,B)=3roman_I ( italic_Z ; italic_A , italic_B ) = 3 bits. The unique information about Z𝑍Zitalic_Z that is contained only in A𝐴Aitalic_A and not in B𝐵Bitalic_B is effectively in Z1subscript𝑍1Z_{1}italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and is given by Uni(Z:A|B)=I(Z;Z1)=1\mathrm{Uni}({Z{:}A|B})=\mathrm{I}({Z;Z_{1}})=1roman_Uni ( italic_Z : italic_A | italic_B ) = roman_I ( italic_Z ; italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = 1 bit. The redundant information about Z𝑍Zitalic_Z that is contained in both A𝐴Aitalic_A and B𝐵Bitalic_B is effectively in Z2subscript𝑍2Z_{2}italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and is given by Red(Z:A,B)=I(Z;Z2)=1\mathrm{Red}({Z{:}A,B})=\mathrm{I}(Z;Z_{2})=1roman_Red ( italic_Z : italic_A , italic_B ) = roman_I ( italic_Z ; italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 1 bit. Lastly, the synergistic information about Z𝑍Zitalic_Z that is not contained in either A𝐴Aitalic_A or B𝐵Bitalic_B alone, but is contained in both of them together is effectively in the tuple (Z3N,N)direct-sumsubscript𝑍3𝑁𝑁(Z_{3}\oplus N,N)( italic_Z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ⊕ italic_N , italic_N ), and is given by Syn(Z:A,B)=I(Z;(Z3N,N))=1\mathrm{Syn}({Z{:}A,B}){=}\mathrm{I}({Z;(Z_{3}\oplus N,N)})=1roman_Syn ( italic_Z : italic_A , italic_B ) = roman_I ( italic_Z ; ( italic_Z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ⊕ italic_N , italic_N ) ) = 1 bit. This accounts for the 3333 bits in I(Z;A,B)I𝑍𝐴𝐵\mathrm{I}({Z;A,B})roman_I ( italic_Z ; italic_A , italic_B ).

We also note that defining any one of the PID terms suffices for obtaining the others. This is because of another relationship among the PID terms as follows (Bertschinger et al., 2014): I(Y;A)=Uni(Y:A|B)+Red(Y:A,B)\mathrm{I}({Y;A})=\mathrm{Uni}({Y{:}A|B})+\mathrm{Red}({Y{:}A,B})roman_I ( italic_Y ; italic_A ) = roman_Uni ( italic_Y : italic_A | italic_B ) + roman_Red ( italic_Y : italic_A , italic_B ). Essentially Red(Y:A,B)\mathrm{Red}({Y{:}A,B})roman_Red ( italic_Y : italic_A , italic_B ) is viewed as the sub-volume between I(Y;A)I𝑌𝐴\mathrm{I}({Y;A})roman_I ( italic_Y ; italic_A ) and I(Y;B)I𝑌𝐵\mathrm{I}({Y;B})roman_I ( italic_Y ; italic_B ) (see Fig. 2). Hence, Red(Y:A,B)=I(Y;A)Uni(Y:A|B)\mathrm{Red}({Y{:}A,B})=\mathrm{I}({Y;A})-\mathrm{Uni}({Y{:}A|B})roman_Red ( italic_Y : italic_A , italic_B ) = roman_I ( italic_Y ; italic_A ) - roman_Uni ( italic_Y : italic_A | italic_B ). Lastly, Syn(Y:A,B)=I(Y;A,B)Uni(Y:A|B)Uni(Y:B|A)Red(Y:A,B)\mathrm{Syn}({Y{:}A,B})=\mathrm{I}(Y;A,B)-\mathrm{Uni}({Y{:}A|B})-\mathrm{Uni}% ({Y{:}B|A})-\mathrm{Red}({Y{:}A,B})roman_Syn ( italic_Y : italic_A , italic_B ) = roman_I ( italic_Y ; italic_A , italic_B ) - roman_Uni ( italic_Y : italic_A | italic_B ) - roman_Uni ( italic_Y : italic_B | italic_A ) - roman_Red ( italic_Y : italic_A , italic_B ) (can be obtained from (2.1) once both unique and redundant information has been defined). Here, we include a popular definition of Uni(Y:A|B)\mathrm{Uni}({Y{:}A|B})roman_Uni ( italic_Y : italic_A | italic_B ) from (Bertschinger et al., 2014) which is computable using convex optimization.

Definition 1 (Unique Information (Bertschinger et al., 2014)).

Let ΔΔ\Deltaroman_Δ be the set of all joint distributions on (Y,A,B)𝑌𝐴𝐵(Y,A,B)( italic_Y , italic_A , italic_B ) and ΔPsubscriptΔ𝑃\Delta_{P}roman_Δ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT be the set of joint distributions with the same marginals on (Y,A)𝑌𝐴(Y,A)( italic_Y , italic_A ) and (Y,B)𝑌𝐵(Y,B)( italic_Y , italic_B ) as the true distribution PYABsubscript𝑃𝑌𝐴𝐵P_{YAB}italic_P start_POSTSUBSCRIPT italic_Y italic_A italic_B end_POSTSUBSCRIPT, i.e., ΔP={QYABΔ\Delta_{P}=\{Q_{YAB}{\in}\Deltaroman_Δ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = { italic_Q start_POSTSUBSCRIPT italic_Y italic_A italic_B end_POSTSUBSCRIPT ∈ roman_Δ: QYA=PYAsubscript𝑄𝑌𝐴subscript𝑃𝑌𝐴Q_{YA}=P_{YA}italic_Q start_POSTSUBSCRIPT italic_Y italic_A end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT italic_Y italic_A end_POSTSUBSCRIPT and QYB=PYB}Q_{YB}=P_{YB}\}italic_Q start_POSTSUBSCRIPT italic_Y italic_B end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT italic_Y italic_B end_POSTSUBSCRIPT }. Then,

Uni(Y:A|B)=minQΔPIQ(Y;A|B).\mathrm{Uni}({Y{:}A|B})=\min_{Q\in\Delta_{P}}\mathrm{I}_{Q}({Y;A|B}).roman_Uni ( italic_Y : italic_A | italic_B ) = roman_min start_POSTSUBSCRIPT italic_Q ∈ roman_Δ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_I start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_Y ; italic_A | italic_B ) .

Here IQ(Y;A|B)subscriptI𝑄𝑌conditional𝐴𝐵\mathrm{I}_{Q}({Y;A|B})roman_I start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_Y ; italic_A | italic_B ) denotes the conditional mutual information when (Y,A,B)𝑌𝐴𝐵(Y,A,B)( italic_Y , italic_A , italic_B ) have joint distribution QYABsubscript𝑄𝑌𝐴𝐵Q_{YAB}italic_Q start_POSTSUBSCRIPT italic_Y italic_A italic_B end_POSTSUBSCRIPT instead of PYABsubscript𝑃𝑌𝐴𝐵P_{YAB}italic_P start_POSTSUBSCRIPT italic_Y italic_A italic_B end_POSTSUBSCRIPT.

3 Main Results

In this work, we first present an information-theoretic formalization of spurious patterns using the mathematical framework of Partial Information Decomposition (PID).

Proposition 1 (Unique Information as a Measure of Spuriousness).

For a given data distribution, the unique information Uni(Y:B|F)\mathrm{Uni}({Y{:}B|F})roman_Uni ( italic_Y : italic_B | italic_F ) is a measure of spuriousness given a split of the spurious features B𝐵Bitalic_B and core features F𝐹Fitalic_F.

Refer to caption
Figure 3: Blackwell Sufficiency

To justify our proposition, we first establish that unique information is a measure of informativeness of the spurious feature B𝐵Bitalic_B over core feature F𝐹Fitalic_F. We draw upon a concept in statistical decision theory called Blackwell Sufficiency (Blackwell, 1953) which investigates when a random variable is “more informative” (or “less noisy”) than another for inference (also relates to stochastic degradation of channels (Venkatesh et al., 2023; Raginsky, 2011)). Let us first discuss this notion intuitively when trying to infer Y𝑌Yitalic_Y using two random variables F𝐹Fitalic_F and B𝐵Bitalic_B. Suppose, there exists a transformation on F𝐹Fitalic_F to give a new random variable Bsuperscript𝐵B^{\prime}italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT which is always equivalent to B𝐵Bitalic_B for predicting Y𝑌Yitalic_Y. We note that Bsuperscript𝐵B^{\prime}italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and B𝐵Bitalic_B do not necessarily have to be the same since we only care about inferring Y𝑌Yitalic_Y. In fact, B𝐵Bitalic_B and Bsuperscript𝐵B^{\prime}italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can have additional irrelevant information that do not pertain to Y𝑌Yitalic_Y, but solely for the purpose of inferring Y𝑌Yitalic_Y, they need to be equivalent. Then, feature set F𝐹Fitalic_F will be regarded as “sufficient” with respect to B𝐵Bitalic_B for predicting Y𝑌Yitalic_Y since F𝐹Fitalic_F can itself provide all the information that B𝐵Bitalic_B has about Y𝑌Yitalic_Y (see Fig. 3). This intuition is formalized as:

Definition 2 (Blackwell Sufficiency (Blackwell, 1953)).

A conditional distribution PF|Ysubscript𝑃conditional𝐹𝑌P_{F|Y}italic_P start_POSTSUBSCRIPT italic_F | italic_Y end_POSTSUBSCRIPT is Blackwell sufficient with respect to another conditional distribution PB|Ysubscript𝑃conditional𝐵𝑌P_{B|Y}italic_P start_POSTSUBSCRIPT italic_B | italic_Y end_POSTSUBSCRIPT if and only if there exists a stochastic transformation (equivalently another conditional distribution PB|Fsubscript𝑃conditionalsuperscript𝐵𝐹P_{B^{\prime}|F}italic_P start_POSTSUBSCRIPT italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_F end_POSTSUBSCRIPT with both B𝐵Bitalic_B and Bsuperscript𝐵B^{\prime}\in\mathcal{B}italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_B) such that PB|FPF|Y=PB|Ysubscript𝑃conditionalsuperscript𝐵𝐹subscript𝑃conditional𝐹𝑌subscript𝑃conditional𝐵𝑌P_{B^{\prime}|F}\circ P_{F|Y}=P_{B|Y}italic_P start_POSTSUBSCRIPT italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_F end_POSTSUBSCRIPT ∘ italic_P start_POSTSUBSCRIPT italic_F | italic_Y end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT italic_B | italic_Y end_POSTSUBSCRIPT.

Refer to caption
Figure 4: Spuriousness Disentangler: We propose autoencoder-based estimator for PID to handle high dimensional continuous image data. The left side denotes the clustering part where soft-level q𝑞qitalic_q (clusters) is optimized by training a deep neural network consisting of encoder-decoder part with an objective to minimize loss L𝐿Litalic_L. The right side denotes the segmentation of one image into background (spurious features) and foreground (core features) followed by the clustering. Then, the joint distribution is estimated which is used to have the final estimation of PID values.

Now we demonstrate how our proposed unique information is closely tethered to Blackwell Sufficiency, thus justifying our Proposition 1. In fact, the unique information Uni(Y:B|F)\mathrm{Uni}({Y{:}B|F})roman_Uni ( italic_Y : italic_B | italic_F ) is 00 if and only if PF|Ysubscript𝑃conditional𝐹𝑌P_{F|Y}italic_P start_POSTSUBSCRIPT italic_F | italic_Y end_POSTSUBSCRIPT is Blackwell sufficient with respect to PB|Ysubscript𝑃conditional𝐵𝑌P_{B|Y}italic_P start_POSTSUBSCRIPT italic_B | italic_Y end_POSTSUBSCRIPT (see Theorem 1).

Theorem 1 (Spuriousness and Blackwell Sufficiency).

The Uni(Y:B|F)=0\mathrm{Uni}({Y{:}B|F})=0roman_Uni ( italic_Y : italic_B | italic_F ) = 0 if and only if the conditional distribution PF|Ysubscript𝑃conditional𝐹𝑌P_{F|Y}italic_P start_POSTSUBSCRIPT italic_F | italic_Y end_POSTSUBSCRIPT is Blackwell sufficient with respect to PB|Ysubscript𝑃conditional𝐵𝑌P_{B|Y}italic_P start_POSTSUBSCRIPT italic_B | italic_Y end_POSTSUBSCRIPT.

Since spuriousness (unique information) Uni(Y:B|F)=0\mathrm{Uni}({Y{:}B|F})=0roman_Uni ( italic_Y : italic_B | italic_F ) = 0 if and only if PF|Ysubscript𝑃conditional𝐹𝑌P_{F|Y}italic_P start_POSTSUBSCRIPT italic_F | italic_Y end_POSTSUBSCRIPT is Blackwell Sufficient with respect to PB|Ysubscript𝑃conditional𝐵𝑌P_{B|Y}italic_P start_POSTSUBSCRIPT italic_B | italic_Y end_POSTSUBSCRIPT, we note that Uni(Y:B|F)>0\mathrm{Uni}({Y{:}B|F})>0roman_Uni ( italic_Y : italic_B | italic_F ) > 0 captures the “departure” from Blackwell Sufficiency, and thus quantifies relative informativeness. Intuitively, what this means is that for the given data distribution, there is no such transformation on core feature F𝐹Fitalic_F that is equivalent to the spurious feature B𝐵Bitalic_B for the purpose of predicting Y𝑌Yitalic_Y. This essentially makes spurious feature B𝐵Bitalic_B indispensable to the model for predicting Y𝑌Yitalic_Y, forcing the model to use or emphasize it in decision-making.

Next, we discuss some desirable properties of unique information Uni(Y:B|F)\mathrm{Uni}({Y{:}B|F})roman_Uni ( italic_Y : italic_B | italic_F ).

Theorem 2.

The measure Uni(Y:B|F)\mathrm{Uni}({Y{:}B|F})roman_Uni ( italic_Y : italic_B | italic_F ) satisfies the following desirable properties:

  • Uni(Y:B|F)I(Y;B)\mathrm{Uni}({Y{:}B|F})\leq\mathrm{I}({Y;B})roman_Uni ( italic_Y : italic_B | italic_F ) ≤ roman_I ( italic_Y ; italic_B ) and is 00 if I(Y;B)=0I𝑌𝐵0\mathrm{I}({Y;B})=0roman_I ( italic_Y ; italic_B ) = 0 (spurious feature B𝐵Bitalic_B has no information about Y𝑌Yitalic_Y).

  • Uni(Y:B|F)\mathrm{Uni}({Y{:}B|F})roman_Uni ( italic_Y : italic_B | italic_F ) is non-decreasing if more features are added to B𝐵Bitalic_B, i.e., if the set of spurious features grows, so does its unique information over core features.

  • Uni(Y:B|F)\mathrm{Uni}({Y{:}B|F})roman_Uni ( italic_Y : italic_B | italic_F ) is non-increasing if more features are added to F𝐹Fitalic_F, i.e., if the set of core features grow, the unique information in the spurious features reduce.

Spuriousness Disentangler (Autoencoder-based estimator): Next, we propose an autoencoder-based estimation framework – that we call Spuriousness Disentangler – to calculate the PID values. The motivation to use this estimator is that since the model learns the features to reconstruct the input image, the encoding of the image should have minimal information loss and hence should be a good low-dimensional representation of the input image. The framework mainly consists of three aspects: clustering, estimation of joint distribution and estimation of PID.

Since we are dealing with high dimensional data, dimesionality reduction is a necessary first step (Bellman, 1966). Traditionally, the clustering step is done by PCA followed by k-means clustering. However, in our setting, we can do these two steps together using an autoencoder, which is a deep neural network consisting of an encoder and a decoder, as shown in Fig. 4. The output of the encoder is the embedding for the input image, a low dimensional representation of the input images. The weights of this output layer, defined as the clustering layer, are used as the clusters centers initialized by k-means clustering algorithm. The clustering layer is optimized using the weighted sum of representation loss Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and clustering loss Lcsubscript𝐿𝑐L_{c}italic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. The overall loss function is defined as L=Lr+γLc𝐿subscript𝐿𝑟𝛾subscript𝐿𝑐L=L_{r}+\gamma L_{c}italic_L = italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + italic_γ italic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT where γ𝛾\gammaitalic_γ is a non-negative constant. The clustering loss Lcsubscript𝐿𝑐L_{c}italic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is the KL divergence which measures the dissimilarity between different distributions  (Xie et al., 2016; Guo et al., 2017). For cluster centers {μj}1Ksubscriptsuperscriptsubscript𝜇𝑗𝐾1{\{\mu_{j}}\}^{K}_{1}{ italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and embedded point zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (output of the encoder), qijsubscript𝑞𝑖𝑗q_{ij}italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is defined as follows (Van der Maaten & Hinton, 2008):

qij=(1+ziμj2)1j(1+ziμj2)1subscript𝑞𝑖𝑗superscript1superscriptnormsubscript𝑧𝑖subscript𝜇𝑗21subscript𝑗superscript1superscriptnormsubscript𝑧𝑖subscript𝜇𝑗21q_{ij}=\frac{(1+\|z_{i}-\mu_{j}\|^{2})^{-1}}{\sum_{j}(1+\|z_{i}-\mu_{j}\|^{2})% ^{-1}}italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG ( 1 + ∥ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 1 + ∥ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG (2)

where qijsubscript𝑞𝑖𝑗q_{ij}italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the j𝑗jitalic_jth entry of the soft label qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, denoting the probability of zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT belonging to cluster μjsubscript𝜇𝑗\mu_{j}italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. The loss Lc=KL(P||Q)=ijpijlogpijqijandpij=qij2iqijj(qij2iqij)L_{c}=KL(P||Q)=\sum_{i}\sum_{j}p_{ij}log\frac{p_{ij}}{q_{ij}}\quad\text{and}% \quad p_{ij}=\frac{\frac{q^{2}_{ij}}{\sum_{i}q_{ij}}}{\sum_{j}(\frac{q^{2}_{ij% }}{\sum_{i}q_{ij}})}italic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_K italic_L ( italic_P | | italic_Q ) = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_l italic_o italic_g divide start_ARG italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG and italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG divide start_ARG italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( divide start_ARG italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG ) end_ARG where P𝑃Pitalic_P is the target distribution.

Refer to caption
Figure 5: The first two columns show a significant drop in spuriousness, i.e., unique information in background (Uniq-B) when the dataset is changed from unbalanced to balanced form (balancing removes the spurious correlation between background and label). The last column depicts improvement in the worst-group (W.G.) and mean accuracy (%) when the dataset is balanced. Here the ’teal’ color is for unbalanced (WG-Un, Mean-Un) and ’coral’ colored bar is for balanced (UG-Ba, Mean-Ba) dataset.

The representation loss is the mean square error between the input of the encoder x𝑥xitalic_x and output of the decoder xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT defined as Lr=xx22subscript𝐿𝑟subscriptsuperscriptnorm𝑥superscript𝑥22L_{r}=\|x-x^{\prime}\|^{2}_{2}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = ∥ italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

The next step is to estimate the PID values. For this, the joint distribution of three random variables (e.g. the clusters of foreground, background and the binary label) is calculated using histograms, and then the PID values are obtained from the DIT package (James et al., 2018).

4 Experiments

We demonstrate experimental results to provide evidence in support of Proposition 1 for different experimental setups, i.e., unbalanced, balanced, and mixed background datasets. We illustrate how unique information in the spurious features has a tradeoff with the worst-group-accuracy, thus justifying its use as a measure of the spuriousness of a dataset. We also show a comparative analysis for PCA-based and autoencoder-based PID measurements.

Datasets: We conduct experiments on two datasets: Waterbird (Wah et al., 2011) and Dominoes (Shah et al., 2020), both framed as binary classification tasks.

Waterbird dataset (Wah et al., 2011) is the popular spurious correlation benchmark. The task is to classify the type of the birds (waterbird =1absent1=1= 1, landbird =0absent0=0= 0). However, there exists spurious correlation between the backgrounds (water =1absent1=1= 1, land =0absent0=0= 0) and the labels (bird type). Group00000000, Group01010101,Group10101010, and Group11111111 indicate the group of images where landbirds are in the land backgrounds, landbirds are in the water backgrounds, waterbirds are in the land backgrounds and waterbirds are in the water backgrounds respectively. We call the bird as the foreground of the image.

Dominoes is a synthetic dataset created by combining handwritten digits (zero and one) from MNIST (Deng, 2012) and images of cars and trucks from CIFAR10101010 (Krizhevsky et al., 2009) (digit 00 or 1111 at the top, car (=0absent0=0= 0) or truck (=1absent1=1= 1) at the bottom of an image). We make two version of this synthetic dataset namely Dominoes 1.01.01.01.0 and Dominoes 2.02.02.02.0 inducing different degrees of bias. The task is to classify whether the image contains a car or truck hence the car or truck corresponds to the core features (foreground). On the other hand, the digits are considered as the spurious features (background). Group00000000, Group01010101,Group10101010, and Group11111111 illustrate the group of images where the top half is a zero and bottom half is a car, the top half is a one and bottom half is a car, the top half is a zero and bottom half is a truck, the top half is a one and bottom half is a truck, respectively.

4.1 Comparison between group-balanced and unbalanced datasets

We observe the relationship between the PID values and worst-group-accuracy for (i) an unbalanced dataset (which has spurious correlations) and (ii) a balanced dataset where the spurious correlation with the background is removed through sampling (balancing).

Problem Setup: We use group-balanced and unbalanced data for this part of the experiment. The balanced-unbalanced scenario arises from the four different groups that are present in the dataset, where the majority groups consist of the waterbirds with water backgrounds and landbirds with land backgrounds and other two combinations are the minority groups for the waterbird dataset. Similarly, in the Dominoes dataset, cars with digit 00 and trucks with digit 1111 are the majority groups and the other two combinations are the minority groups. Worst-group-accuracy refers to the accuracy for the minority group which is generally the lowest for the model that is trained with biased dataset namely unbalanced dataset. The group-balanced dataset has equal number of samples in each group resulting in unbiased model training. We begin with using our autoencoder-based estimator, namely Spuriousness Disentangler, on both dataset and estimate the PID values separately for the background and foreground. This separation is done by using the segmentation mask of the foreground for the waterbird dataset. Next, we fine-tune the pre-trained ResNet-50505050 (He et al., 2016) model and calculate the worst-group-accuracy and mean accuracy over all groups.

Refer to caption
Figure 6: Examples of Grad-CAM images: (a),(c) for the model trained with group-unbalanced dataset and (b),(d) for the model trained with group-balanced dataset. Observe that for the unbalanced dataset, the model adds more emphasis (red regions) to the background while in the balanced case, the foreground is more emphasized.
Refer to caption
Figure 7: Samples of background mixed dataset (first row is for addition and second row corresponds to concatenation).
Refer to caption
Figure 8: This bar-plot shows the distribution of the redundant information (R), unique information in background (Uniq-B) and foreground (Uniq-F) and Synergistic information (Syn) for the unbalanced, concatenation and addition dataset. Observe that the Uniq-B decrease for both addition and concatenation dataset compared to that of unbalanced dataset (Note that the scales are different for the visibility of small values).

Observations: Fig. 5 shows our findings regarding PID values and the worst-ground accuracy for three datasets. Firstly, we can observe that the unique information in background is generally much higher than the other PID values namely unique information in foreground, redundancy and synergy. Secondly, from the first two columns, it is obvious that there is a significant reduction of the unique information in background i.e., reduction in spuriousness when the dataset is balanced (having equal number of samples in all groups reducing the bias in dataset) and all other PID values are now in the same order. Next, from the last column of Fig. 5, we find out that the worst-group-accuracies are lower for the unbalanced case and these values significantly improve when the datasets become balanced which implies low spuriousness in the dataset. Finally, Fig. 6 shows through the Grad-CAM (Selvaraju et al., 2017) images that when the dataset is balanced, the model emphasizes more on the core features namely, waterbird or landbird for waterbird dataset and car or truck for the Dominoes dataset (the red regions) while in the unbalanced dataset the background is more highlighted which results in poor worst-group-accuracy.

4.2 Tradeoffs for varying levels of background mixing

Next, we look into the datasets for varying levels of background mixing to observe the tradeoffs between the spuriousness and the worst-group-accuracy.

Refer to caption
Figure 9: Showing the tradeoffs between spuriousness i.e., unique information in background (Uniq-B) and worst-group-accuracy (%) for varying levels of background mixing. The worst-group-accuracy is decreasing with the Uniq-B.

Problem Setup: Starting with the dataset creation, we add two backgrounds at different levels. We consider two cases: (i) half of a land background is concatenated with half of a water background (named as concatenation); and (ii) the whole image of a land background is summed with a water background (named as addition). Similar techniques are applied for background mixing for Dominoes dataset (see Fig. 7). Then, the foreground is superimposed on the background. Next, the PID values are calculated for the mixed background and the foreground using our estimator. We train the pre-trained ResNet-50505050 (He et al., 2016) with the mixed background with foreground (the whole image) and evaluate the model with the normal test dataset (without any modification). One motivation of mixing the backgrounds is to remove the group bias that is generated due to the correlation between the background and the label in the dataset, that should help mitigate the spurious correlation since the background is no longer different for different groups.

Observations: Firstly, in Fig. 8 we can observe that unique information in the background is prominent in the unbalanced case and it decreases for both addition and concatenation scenarios which indicates spuriousness reduction while using addition and concatenation datasets. Next, we observe a trend in Fig. 9 between the unique information in background i.e., spuriousness and the worst-group-accuracy: with increasing unique information in background i.e., spuriousness, the worst-group-accuracy decreases. This trend is obtained for unbalanced, addition and concatenation datasets (lowest W.G. Acc. for unbalanced and highest W.G. Acc. for concatenation).

Table 1: Worst-Group-Accuracy(%) for different datasets
Dataset Unbalanced Balanced Addition Concatenation
Waterbird 29.75 86.60 89.88 92.99
Dominoes 1.0 90.73 91.42 94.665 96.45
Dominoes 2.0 79.79 86.94 85.51 87.35
Table 2: PID values for different variations of datasets
Dataset Unbalanced Balanced
Redundancy Uniq- B Uniq - F Synergy Redundancy Uniq - B Uniq - F Synergy
Waterbird 0.005677 0.166927 7.75E-07 0.018373 0.002635 0.000127 0.00909 0.02325
Dominoes 1.0 0.015414 0.172779 3.18E-09 0.006822 0.000296 0.000213 0.001261 0.01548
Dominoes 2.0 0.029422 0.56187 6.00E-06 0.006134 0.014792 0.246192 9.81E-07 0.022527
Dataset Concatenation Addition
Redundancy Uniq -B Uniq -F Synergy Redundancy Uniq- B Uniq- F Synergy
Waterbird 0.000162 0.000069 0.005487 0.014036 0.000375 0.000096 0.005302 0.015623
Dominoes 1.0 0.000949 0.000006 0.01443 0.006292 0.000933 0.000036 0.020326 0.007635
Dominoes 2.0 0.000103 0.000003 0.047737 0.009618 0.000141 0.000014 0.042616 0.007464

In the Table 2, the PID values i.e. redundant information, unique information in the background (Uniq-B), unique information in the foreground (Uniq-F) and synergistic information are demonstrated for all three datasets and all variants of the datasets: Unbalanced, balanced, concatenation and addition. Table 1 shows the worst-group-accuracy for all types of datasets. Observe that, the worst-group-accuracy is minimum for the unbalanced dataset and maximum for the concatenation dataset.

5 Conclusion

Quantifying and explaining spuriousness of a dataset can provide an efficient way to assess dataset quality rather than training a model for hours. In this work, we theoretically quantify spuriousness in a dataset with unique information, leveraging the mathematical tool of Partial information decomposition (PID). We demonstrate (with empirical validation) that unique information in the background can measure spuriousness and relate it to the worst-group-accuracy for various spurious correlation mitigation techniques. We also propose a novel autoencoder-based estimator for high-dimensional continuous image data, showing its superiority over classical estimators. However, there are some limitations: firstly to estimate the unique information, at first one has to identify the spurious features and core features of a given dataset which is not always straightforward. Moreover, the estimation is highly data-dependent. A small change in the dataset can greatly affect the PID values. Nonetheless, formally quantifying spuriousness can lead to more effective bias mitigation strategies.

References

  • Banerjee et al. (2018) Banerjee, P. K., Rauh, J., and Montúfar, G. Computing the unique information. In IEEE International Symposium on Information Theory, pp.  141–145, 2018.
  • Bellman (1966) Bellman, R. Dynamic programming. science, 153(3731):34–37, 1966.
  • Bertschinger et al. (2014) Bertschinger, N., Rauh, J., Olbrich, E., Jost, J., and Ay, N. Quantifying unique information. Entropy, 16(4):2161–2183, 2014.
  • Blackwell (1953) Blackwell, D. Equivalent comparisons of experiments. The annals of mathematical statistics, pp.  265–272, 1953.
  • Bradley et al. (2000) Bradley, P. S., Bennett, K. P., and Demiriz, A. Constrained k-means clustering. Microsoft Research, Redmond, 20(0):0, 2000.
  • Cover & Thomas (2012) Cover, T. M. and Thomas, J. A. Elements of Information Theory. John Wiley & Sons, 2012.
  • Deng (2012) Deng, L. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
  • Dutta & Hamman (2023) Dutta, S. and Hamman, F. A review of partial information decomposition in algorithmic fairness and explainability. Entropy, 25(5):795, 2023.
  • Dutta et al. (2020) Dutta, S., Venkatesh, P., Mardziel, P., Datta, A., and Grover, P. An information-theoretic quantification of discrimination with exempt features. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  3825–3833, 2020.
  • Dutta et al. (2021) Dutta, S., Venkatesh, P., Mardziel, P., Datta, A., and Grover, P. Fairness under feature exemptions: Counterfactual and observational measures. IEEE Transactions on Information Theory, 67(10):6675–6710, 2021.
  • Ehrlich et al. (2022) Ehrlich, D. A., Schneider, A. C., Wibral, M., Priesemann, V., and Makkeh, A. Partial information decomposition reveals the structure of neural representations. arXiv preprint arXiv:2209.10438, 2022.
  • Ghouse et al. (2024) Ghouse, G., Rehman, A. U., and Bhatti, M. I. Understanding of causes of spurious associations: Problems and prospects. Journal of Statistical Theory and Applications, 23(1):44–66, 2024.
  • Guo et al. (2017) Guo, X., Liu, X., Zhu, E., and Yin, J. Deep clustering with convolutional autoencoders. In Neural Information Processing: 24th International Conference, ICONIP 2017, Guangzhou, China, November 14-18, 2017, Proceedings, Part II 24, pp.  373–382. Springer, 2017.
  • Haig (2003) Haig, B. D. What is a spurious correlation? Understanding Statistics: Statistical Issues in Psychology, Education, and the Social Sciences, 2(2):125–132, 2003.
  • Hamman & Dutta (2024a) Hamman, F. and Dutta, S. Demystifying local and global fairness trade-offs in federated learning using information theory. In International Conference on Learning Representations (ICLR), 2024a.
  • Hamman & Dutta (2024b) Hamman, F. and Dutta, S. A unified view of group fairness tradeoffs using partial information decomposition. arXiv preprint arXiv:2406.04562, 2024b.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  770–778, 2016. doi: 10.1109/CVPR.2016.90.
  • Hotelling (1933) Hotelling, H. Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6):417, 1933.
  • Izmailov et al. (2022) Izmailov, P., Kirichenko, P., Gruver, N., and Wilson, A. G. On feature learning in the presence of spurious correlations. Advances in Neural Information Processing Systems, 35:38516–38532, 2022.
  • James et al. (2018) James, R. G., Ellison, C. J., and Crutchfield, J. P. dit: a Python package for discrete information theory. The Journal of Open Source Software, 3(25):738, 2018. doi: https://doi.org/10.21105/joss.00738.
  • Kirichenko et al. (2022) Kirichenko, P., Izmailov, P., and Wilson, A. G. Last layer re-training is sufficient for robustness to spurious correlations. arXiv preprint arXiv:2204.02937, 2022.
  • Krizhevsky et al. (2009) Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.
  • Law & Jain (2006) Law, M. H. and Jain, A. K. Incremental nonlinear dimensionality reduction by manifold learning. IEEE transactions on pattern analysis and machine intelligence, 28(3):377–391, 2006.
  • Lee & Verleysen (2005) Lee, J. A. and Verleysen, M. Nonlinear dimensionality reduction of data manifolds with essential loops. Neurocomputing, 67:29–53, 2005.
  • Liang et al. (2023) Liang, P. P., Cheng, Y., Fan, X., Ling, C. K., Nie, S., Chen, R., Deng, Z., Allen, N., Auerbach, R., Mahmood, F., et al. Quantifying & modeling multimodal interactions: An information decomposition framework. Advances in Neural Information Processing Systems, 36, 2023.
  • Liang et al. (2024) Liang, P. P., Ling, C. K., Cheng, Y., Obolenskiy, A., Liu, Y., Pandey, R., Wilf, A., Morency, L.-P., and Salakhutdinov, R. Multimodal learning without labeled multimodal data: Guarantees and applications. International Conference on Learning Representations (ICLR), 2024.
  • Liu et al. (2023) Liu, S., Zhang, X., Sekhar, N., Wu, Y., Singhal, P., and Fernandez-Granda, C. Avoiding spurious correlations via logit correction. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=5BaqCFVh5qL.
  • Lynch et al. (2023) Lynch, A., Dovonon, G. J., Kaddour, J., and Silva, R. Spawrious: A benchmark for fine control of spurious correlation biases. arXiv preprint arXiv:2303.05470, 2023.
  • MacQueen et al. (1967) MacQueen, J. et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pp.  281–297. Oakland, CA, USA, 1967.
  • Moayeri et al. (2023) Moayeri, M., Wang, W., Singla, S., and Feizi, S. Spuriosity rankings: sorting data to measure and mitigate biases. Advances in Neural Information Processing Systems, 36:41572–41600, 2023.
  • Mohamadi et al. (2023) Mohamadi, S., Doretto, G., and Adjeroh, D. A. More synergy, less redundancy: Exploiting joint mutual information for self-supervised learning. arXiv preprint arXiv:2307.00651, 2023.
  • Pakman et al. (2021) Pakman, A., Nejatbakhsh, A., Gilboa, D., Makkeh, A., Mazzucato, L., Wibral, M., and Schneidman, E. Estimating the unique information of continuous variables. Advances in neural information processing systems, 34:20295–20307, 2021.
  • Raginsky (2011) Raginsky, M. Shannon meets blackwell and le cam: Channels, codes, and statistical experiments. In 2011 IEEE International Symposium on Information Theory Proceedings, pp.  1220–1224. IEEE, 2011.
  • Sadeghi & Armanfard (2023) Sadeghi, M. and Armanfard, N. Deep clustering with self-supervision using pairwise data similarities. Authorea Preprints, 2023.
  • Sagawa et al. (2019) Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731, 2019.
  • Selvaraju et al. (2017) Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp.  618–626, 2017.
  • Shah et al. (2020) Shah, H., Tamuly, K., Raghunathan, A., Jain, P., and Netrapalli, P. The pitfalls of simplicity bias in neural networks. Advances in Neural Information Processing Systems, 33, 2020.
  • Singla & Feizi (2021) Singla, S. and Feizi, S. Salient imagenet: How to discover spurious features in deep learning? arXiv preprint arXiv:2110.04301, 2021.
  • Srivastava (2023) Srivastava, M. Addressing spurious correlations in machine learning models: A comprehensive review. OSF Prepr, 2023.
  • Stromberg et al. (2024) Stromberg, N., Ayyagari, R., Welfert, M., Koyejo, S., and Sankar, L. Robustness to subpopulation shift with domain label noise via regularized annotation of domains. arXiv preprint arXiv:2402.11039, 2024.
  • Tax et al. (2017) Tax, T., Mediano, P., and Shanahan, M. The partial information decomposition of generative neural network models. Entropy, 19(9):474, 2017.
  • Van der Maaten & Hinton (2008) Van der Maaten, L. and Hinton, G. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  • Venkatesh & Schamberg (2022) Venkatesh, P. and Schamberg, G. Partial information decomposition via deficiency for multivariate gaussians. In 2022 IEEE International Symposium on Information Theory (ISIT), pp.  2892–2897. IEEE, 2022.
  • Venkatesh et al. (2023) Venkatesh, P., Gurushankar, K., and Schamberg, G. Capturing and interpreting unique information. In 2023 IEEE International Symposium on Information Theory (ISIT), pp.  2631–2636. IEEE, 2023.
  • Venkatesh et al. (2024) Venkatesh, P., Bennett, C., Gale, S., Ramirez, T., Heller, G., Durand, S., Olsen, S., and Mihalas, S. Gaussian partial information decomposition: Bias correction and application to high-dimensional data. Advances in Neural Information Processing Systems, 36, 2024.
  • Wah et al. (2011) Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. The caltech-ucsd birds-200-2011 dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
  • Wang et al. (2014) Wang, W., Huang, Y., Wang, Y., and Wang, L. Generalized autoencoder: A neural network framework for dimensionality reduction. In 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp.  496–503, 2014. doi: 10.1109/CVPRW.2014.79.
  • Wang et al. (2015) Wang, Y., Yao, H., Zhao, S., and Zheng, Y. Dimensionality reduction strategy based on auto-encoder. In Proceedings of the 7th International Conference on Internet Multimedia Computing and Service, pp.  1–4, 2015.
  • Williams & Beer (2010) Williams, P. L. and Beer, R. D. Nonnegative decomposition of multivariate information. arXiv preprint arXiv:1004.2515, 2010.
  • Wollstadt et al. (2023) Wollstadt, P., Schmitt, S., and Wibral, M. A rigorous information-theoretic definition of redundancy and relevancy in feature selection based on (partial) information decomposition. J. Mach. Learn. Res., 24:131–1, 2023.
  • Wu et al. (2023) Wu, S., Yuksekgonul, M., Zhang, L., and Zou, J. Discover and cure: concept-aware mitigation of spurious correlation. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
  • Xie et al. (2016) Xie, J., Girshick, R., and Farhadi, A. Unsupervised deep embedding for clustering analysis. In International conference on machine learning, pp.  478–487. PMLR, 2016.
  • Yang et al. (2023) Yang, Y.-Y., Chou, C.-N., and Chaudhuri, K. Understanding rare spurious correlations in neural networks, 2023. URL https://openreview.net/forum?id=lrzX-rNuRvw.
  • Ye et al. (2023) Ye, H., Zou, J., and Zhang, L. Freeze then train: Towards provable representation learning under spurious correlations and feature noise. Proceedings of Machine Learning Research, 206:8968–8990, 2023. ISSN 2640-3498. Publisher Copyright: Copyright © 2023 by the author(s); 26th International Conference on Artificial Intelligence and Statistics, AISTATS 2023 ; Conference date: 25-04-2023 Through 27-04-2023.
  • Ye et al. (2024) Ye, W., Zheng, G., Cao, X., Ma, Y., Hu, X., and Zhang, A. Spurious correlations in machine learning: A survey. arXiv preprint arXiv:2402.12715, 2024.

Appendix A Proof of Theorem 1

As a proof sketch, we first derive the following lemma.

Lemma 1.

Uni(Y:B|F)=0\mathrm{Uni}({Y{:}B|F})=0roman_Uni ( italic_Y : italic_B | italic_F ) = 0 if and only if there exists a row-stochastic matrix T[0,1]||×||𝑇superscript01T\in[0,1]^{|\mathcal{F}|\times|\mathcal{B}|}italic_T ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT | caligraphic_F | × | caligraphic_B | end_POSTSUPERSCRIPT such that: PYB(Y=y,B=b)=fPYF(Y=y,F=f)T(f,b)subscript𝑃𝑌𝐵formulae-sequence𝑌𝑦𝐵𝑏subscript𝑓subscript𝑃𝑌𝐹formulae-sequence𝑌𝑦𝐹𝑓𝑇𝑓𝑏P_{YB}(Y=y,B=b)=\sum_{f\in\mathcal{F}}P_{YF}(Y=y,F=f)T(f,b)italic_P start_POSTSUBSCRIPT italic_Y italic_B end_POSTSUBSCRIPT ( italic_Y = italic_y , italic_B = italic_b ) = ∑ start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_Y italic_F end_POSTSUBSCRIPT ( italic_Y = italic_y , italic_F = italic_f ) italic_T ( italic_f , italic_b ) for all y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y and b𝑏b\in\mathcal{B}italic_b ∈ caligraphic_B.

Proof.

If Uni(Y:B|F)=0\mathrm{Uni}({Y{:}B|F})=0roman_Uni ( italic_Y : italic_B | italic_F ) = 0, then we have: minQΔPIQ(Y;B|F)=0subscript𝑄subscriptΔ𝑃subscriptI𝑄𝑌conditional𝐵𝐹0\min_{Q\in\Delta_{P}}\mathrm{I}_{Q}({Y;B|F})=0roman_min start_POSTSUBSCRIPT italic_Q ∈ roman_Δ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_I start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_Y ; italic_B | italic_F ) = 0 where ΔP={QΔ\Delta_{P}=\{Q{\in}\Deltaroman_Δ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = { italic_Q ∈ roman_Δ : QYF(Y=y,F=f)=PYF(Y=y,F=f)subscript𝑄𝑌𝐹formulae-sequence𝑌𝑦𝐹𝑓subscript𝑃𝑌𝐹formulae-sequence𝑌𝑦𝐹𝑓Q_{YF}(Y=y,F=f)=P_{YF}(Y=y,F=f)italic_Q start_POSTSUBSCRIPT italic_Y italic_F end_POSTSUBSCRIPT ( italic_Y = italic_y , italic_F = italic_f ) = italic_P start_POSTSUBSCRIPT italic_Y italic_F end_POSTSUBSCRIPT ( italic_Y = italic_y , italic_F = italic_f ) and QYB(Y=y,B=b)=PYB(Y=y,B=b)}Q_{YB}(Y=y,B=b)=P_{YB}(Y=y,B=b)\}italic_Q start_POSTSUBSCRIPT italic_Y italic_B end_POSTSUBSCRIPT ( italic_Y = italic_y , italic_B = italic_b ) = italic_P start_POSTSUBSCRIPT italic_Y italic_B end_POSTSUBSCRIPT ( italic_Y = italic_y , italic_B = italic_b ) }. Thus, there exists a distribution QΔP𝑄subscriptΔ𝑃Q\in\Delta_{P}italic_Q ∈ roman_Δ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT such that Y𝑌Yitalic_Y and B𝐵Bitalic_B are independent given F𝐹Fitalic_F under the joint distribution Q𝑄Qitalic_Q. Then, we have

PYB(Y=y,B=b)subscript𝑃𝑌𝐵formulae-sequence𝑌𝑦𝐵𝑏\displaystyle P_{YB}(Y=y,B=b)italic_P start_POSTSUBSCRIPT italic_Y italic_B end_POSTSUBSCRIPT ( italic_Y = italic_y , italic_B = italic_b ) =QYB(Y=y,B=b)absentsubscript𝑄𝑌𝐵formulae-sequence𝑌𝑦𝐵𝑏\displaystyle=Q_{YB}(Y=y,B=b)= italic_Q start_POSTSUBSCRIPT italic_Y italic_B end_POSTSUBSCRIPT ( italic_Y = italic_y , italic_B = italic_b ) (3)
=fQYFB(Y=y,F=f,B=b)absentsubscript𝑓subscript𝑄𝑌𝐹𝐵formulae-sequence𝑌𝑦formulae-sequence𝐹𝑓𝐵𝑏\displaystyle=\sum_{f\in\mathcal{F}}Q_{YFB}(Y=y,F=f,B=b)= ∑ start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_Y italic_F italic_B end_POSTSUBSCRIPT ( italic_Y = italic_y , italic_F = italic_f , italic_B = italic_b ) (4)
=fQB|YF(B=b|Y=y,F=f)QYF(Y=y,F=f)\displaystyle=\sum_{f\in\mathcal{F}}Q_{B|YF}(B=b|Y=y,F=f)Q_{YF}(Y=y,F=f)= ∑ start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_B | italic_Y italic_F end_POSTSUBSCRIPT ( italic_B = italic_b | italic_Y = italic_y , italic_F = italic_f ) italic_Q start_POSTSUBSCRIPT italic_Y italic_F end_POSTSUBSCRIPT ( italic_Y = italic_y , italic_F = italic_f ) (5)
=(a)fQB|YF(B=b|Y=y,F=f)PYF(Y=y,F=f)\displaystyle\overset{(a)}{=}\sum_{f\in\mathcal{F}}Q_{B|YF}(B=b|Y=y,F=f)P_{YF}% (Y=y,F=f)start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG = end_ARG ∑ start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_B | italic_Y italic_F end_POSTSUBSCRIPT ( italic_B = italic_b | italic_Y = italic_y , italic_F = italic_f ) italic_P start_POSTSUBSCRIPT italic_Y italic_F end_POSTSUBSCRIPT ( italic_Y = italic_y , italic_F = italic_f ) (6)
=(b)fQB|F(B=b|F=f)PYF(Y=y,F=f)𝑏subscript𝑓subscript𝑄conditional𝐵𝐹𝐵conditional𝑏𝐹𝑓subscript𝑃𝑌𝐹formulae-sequence𝑌𝑦𝐹𝑓\displaystyle\overset{(b)}{=}\sum_{f\in\mathcal{F}}Q_{B|F}(B=b|F=f)P_{YF}(Y=y,% F=f)start_OVERACCENT ( italic_b ) end_OVERACCENT start_ARG = end_ARG ∑ start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_B | italic_F end_POSTSUBSCRIPT ( italic_B = italic_b | italic_F = italic_f ) italic_P start_POSTSUBSCRIPT italic_Y italic_F end_POSTSUBSCRIPT ( italic_Y = italic_y , italic_F = italic_f ) (7)
=(c)fT(f,b)PYF(Y=y,F=f).𝑐subscript𝑓𝑇𝑓𝑏subscript𝑃𝑌𝐹formulae-sequence𝑌𝑦𝐹𝑓\displaystyle\overset{(c)}{=}\sum_{f\in\mathcal{F}}T(f,b)P_{YF}(Y=y,F=f).start_OVERACCENT ( italic_c ) end_OVERACCENT start_ARG = end_ARG ∑ start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT italic_T ( italic_f , italic_b ) italic_P start_POSTSUBSCRIPT italic_Y italic_F end_POSTSUBSCRIPT ( italic_Y = italic_y , italic_F = italic_f ) . (8)

Here, (a) holds because PYF=QYFsubscript𝑃𝑌𝐹subscript𝑄𝑌𝐹P_{YF}=Q_{YF}italic_P start_POSTSUBSCRIPT italic_Y italic_F end_POSTSUBSCRIPT = italic_Q start_POSTSUBSCRIPT italic_Y italic_F end_POSTSUBSCRIPT for all QΔP𝑄subscriptΔ𝑃Q\in\Delta_{P}italic_Q ∈ roman_Δ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT, (b) holds because under joint distribution Q𝑄Qitalic_Q, variables Y𝑌Yitalic_Y and B𝐵Bitalic_B are independent given F𝐹Fitalic_F, and (c) simply chooses T(f,b)=QB|F(B=b|F=f)𝑇𝑓𝑏subscript𝑄conditional𝐵𝐹𝐵conditional𝑏𝐹𝑓T(f,b)=Q_{B|F}(B=b|F=f)italic_T ( italic_f , italic_b ) = italic_Q start_POSTSUBSCRIPT italic_B | italic_F end_POSTSUBSCRIPT ( italic_B = italic_b | italic_F = italic_f ) which is a function of (f,b)𝑓𝑏(f,b)( italic_f , italic_b ) and will lead to a row-stochastic matrix T𝑇Titalic_T since bT(f,b)=bQB|F(B=b|F=f)=1.subscript𝑏𝑇𝑓𝑏subscript𝑏subscript𝑄conditional𝐵𝐹𝐵conditional𝑏𝐹𝑓1\sum_{b\in\mathcal{B}}T(f,b)=\sum_{b\in\mathcal{B}}Q_{B|F}(B=b|F=f)=1.∑ start_POSTSUBSCRIPT italic_b ∈ caligraphic_B end_POSTSUBSCRIPT italic_T ( italic_f , italic_b ) = ∑ start_POSTSUBSCRIPT italic_b ∈ caligraphic_B end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_B | italic_F end_POSTSUBSCRIPT ( italic_B = italic_b | italic_F = italic_f ) = 1 .

Next, we prove the converse. Suppose, such a row-stochastic matrix T𝑇Titalic_T exists such that:

PYB(Y=y,B=b)=fT(f,b)PYF(Y=y,F=f).subscript𝑃𝑌𝐵formulae-sequence𝑌𝑦𝐵𝑏subscript𝑓𝑇𝑓𝑏subscript𝑃𝑌𝐹formulae-sequence𝑌𝑦𝐹𝑓P_{YB}(Y=y,B=b)=\sum_{f\in\mathcal{F}}T(f,b)P_{YF}(Y=y,F=f).italic_P start_POSTSUBSCRIPT italic_Y italic_B end_POSTSUBSCRIPT ( italic_Y = italic_y , italic_B = italic_b ) = ∑ start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT italic_T ( italic_f , italic_b ) italic_P start_POSTSUBSCRIPT italic_Y italic_F end_POSTSUBSCRIPT ( italic_Y = italic_y , italic_F = italic_f ) .

Now, we can define a joint distribution Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT such that:

Q(Y=y,F=f,B=b)=PYF(Y=y,F=f)T(f,b).superscript𝑄formulae-sequence𝑌𝑦formulae-sequence𝐹𝑓𝐵𝑏subscript𝑃𝑌𝐹formulae-sequence𝑌𝑦𝐹𝑓𝑇𝑓𝑏Q^{*}(Y=y,F=f,B=b)=P_{YF}(Y=y,F=f)T(f,b).italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_Y = italic_y , italic_F = italic_f , italic_B = italic_b ) = italic_P start_POSTSUBSCRIPT italic_Y italic_F end_POSTSUBSCRIPT ( italic_Y = italic_y , italic_F = italic_f ) italic_T ( italic_f , italic_b ) . (9)

We can show that Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a valid probability distribution since T𝑇Titalic_T is row stochastic.

y𝒴bfQ(Y=y,F=f,B=b)subscript𝑦𝒴subscript𝑏subscript𝑓superscript𝑄formulae-sequence𝑌𝑦formulae-sequence𝐹𝑓𝐵𝑏\displaystyle\sum_{y\in\mathcal{Y}}\sum_{b\in\mathcal{B}}\sum_{f\in\mathcal{F}% }Q^{*}(Y=y,F=f,B=b)∑ start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_b ∈ caligraphic_B end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_Y = italic_y , italic_F = italic_f , italic_B = italic_b ) =y𝒴bfPYF(Y=y,F=f)T(f,b)absentsubscript𝑦𝒴subscript𝑏subscript𝑓subscript𝑃𝑌𝐹formulae-sequence𝑌𝑦𝐹𝑓𝑇𝑓𝑏\displaystyle=\sum_{y\in\mathcal{Y}}\sum_{b\in\mathcal{B}}\sum_{f\in\mathcal{F% }}P_{YF}(Y=y,F=f)T(f,b)= ∑ start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_b ∈ caligraphic_B end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_Y italic_F end_POSTSUBSCRIPT ( italic_Y = italic_y , italic_F = italic_f ) italic_T ( italic_f , italic_b )
=y𝒴fPYF(Y=y,F=f)(bT(f,b))absentsubscript𝑦𝒴subscript𝑓subscript𝑃𝑌𝐹formulae-sequence𝑌𝑦𝐹𝑓subscript𝑏𝑇𝑓𝑏\displaystyle=\sum_{y\in\mathcal{Y}}\sum_{f\in\mathcal{F}}P_{YF}(Y=y,F=f)\left% (\sum_{b\in\mathcal{B}}T(f,b)\right)= ∑ start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_Y italic_F end_POSTSUBSCRIPT ( italic_Y = italic_y , italic_F = italic_f ) ( ∑ start_POSTSUBSCRIPT italic_b ∈ caligraphic_B end_POSTSUBSCRIPT italic_T ( italic_f , italic_b ) )
=y𝒴fPYF(Y=y,F=f)=1.absentsubscript𝑦𝒴subscript𝑓subscript𝑃𝑌𝐹formulae-sequence𝑌𝑦𝐹𝑓1\displaystyle=\sum_{y\in\mathcal{Y}}\sum_{f\in\mathcal{F}}P_{YF}(Y=y,F=f)=1.= ∑ start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_Y italic_F end_POSTSUBSCRIPT ( italic_Y = italic_y , italic_F = italic_f ) = 1 . (10)

Also, we can show that QΔPsuperscript𝑄subscriptΔ𝑃Q^{*}\in\Delta_{P}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Δ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT since:

QYB(Y=y,B=b)=fPYF(Y=y,F=f)T(f,b)=PYB(Y=y,B=b),subscriptsuperscript𝑄𝑌𝐵formulae-sequence𝑌𝑦𝐵𝑏subscript𝑓subscript𝑃𝑌𝐹formulae-sequence𝑌𝑦𝐹𝑓𝑇𝑓𝑏subscript𝑃𝑌𝐵formulae-sequence𝑌𝑦𝐵𝑏\displaystyle Q^{*}_{YB}(Y=y,B=b)=\sum_{f\in\mathcal{F}}P_{YF}(Y=y,F=f)T(f,b)=% P_{YB}(Y=y,B=b),italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Y italic_B end_POSTSUBSCRIPT ( italic_Y = italic_y , italic_B = italic_b ) = ∑ start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_Y italic_F end_POSTSUBSCRIPT ( italic_Y = italic_y , italic_F = italic_f ) italic_T ( italic_f , italic_b ) = italic_P start_POSTSUBSCRIPT italic_Y italic_B end_POSTSUBSCRIPT ( italic_Y = italic_y , italic_B = italic_b ) , (11)

which holds since such a row-stochastic matrix T𝑇Titalic_T exists. Also, we have:

QYF(Y=y,F=f)=bPYF(Y=y,F=f)T(f,b)=PYF(Y=y,F=f),subscriptsuperscript𝑄𝑌𝐹formulae-sequence𝑌𝑦𝐹𝑓subscript𝑏subscript𝑃𝑌𝐹formulae-sequence𝑌𝑦𝐹𝑓𝑇𝑓𝑏subscript𝑃𝑌𝐹formulae-sequence𝑌𝑦𝐹𝑓\displaystyle Q^{*}_{YF}(Y=y,F=f)=\sum_{b\in\mathcal{B}}P_{YF}(Y=y,F=f)T(f,b)=% P_{YF}(Y=y,F=f),italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Y italic_F end_POSTSUBSCRIPT ( italic_Y = italic_y , italic_F = italic_f ) = ∑ start_POSTSUBSCRIPT italic_b ∈ caligraphic_B end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_Y italic_F end_POSTSUBSCRIPT ( italic_Y = italic_y , italic_F = italic_f ) italic_T ( italic_f , italic_b ) = italic_P start_POSTSUBSCRIPT italic_Y italic_F end_POSTSUBSCRIPT ( italic_Y = italic_y , italic_F = italic_f ) , (12)

which holds since T𝑇Titalic_T is row-stochastic.

Then, Uni(Y:B|F)=minQΔPIQ(Y;B|F)IQ(Y;B|F)=0.\mathrm{Uni}({Y{:}B|F})=\min_{Q\in\Delta_{P}}\mathrm{I}_{Q}({Y;B|F})\leq% \mathrm{I}_{Q^{*}}({Y;B|F})=0.roman_Uni ( italic_Y : italic_B | italic_F ) = roman_min start_POSTSUBSCRIPT italic_Q ∈ roman_Δ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_I start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_Y ; italic_B | italic_F ) ≤ roman_I start_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Y ; italic_B | italic_F ) = 0 .

Next, it can be shown that the existence of such a row-stochastic matrix is equivalent to Blackwell Sufficiency as per Definition 2 from (Blackwell, 1953).

Appendix B Appendix to Experiments

This section includes additional results and figures for a more comprehensive understanding of our work.

B.1 Data

We consider the waterbird (Wah et al., 2011) and Dominoes dataset. For a summary of the datasets, we refer the readers to Tables 34 and  5.

Table 3: Summary of the Waterbird Dataset
Waterbird Group00 Group01 Group10 Group11
Train 3498 184 56 1057
Validation 467 466 133 133
Test 2255 2255 642 642
Total 6220 2905 831 1832
Table 4: Summary of the Dominoes 1.0 Dataset
Dominoes 1.0 Group00 Group01 Group10 Group11
Train 3750 1250 1250 3750
Test 473 507 507 473
Total 4223 1772 1757 4208
Table 5: Summary of the Dominoes 2.0 Dataset
Dominoes 2.0 Group00 Group01 Group10 Group11
Train 3000 500 1250 3000
Test 245 490 245 490
Total 3245 990 1495 3490

B.2 Experimental Setup

Table 6: Architecture details of autoencoder for Dominoes Dataset
Sl. No. Layer Filter No. Kernel Size Stride Padding Output Padding Output Shape Param No.
1 Conv2d 32 5 2 2 - (32,16,16) 2432
2 LeakyReLU - - - - - (32,16,16) 0
3 BatchNorm2d - - - - - (32,16,16) 64
4 Conv2d 64 5 2 2 - (64,8,8) 51264
5 LeakyReLU - - - - - (64,8,8) 0
6 BatchNorm2d - - - - - (64,8,8) 128
7 Conv2d 128 3 2 0 - (128,3,3) 73856
8 LeakyReLU - - - - - (128,3,3) 0
9 Flatten - - - - - 1152 0
10 Linear (embedding) - - - - - 10 11530
11 Clustering Layer - - - - - 10 100
12 Linear(deembedding) - - - - - 1152 12672
13 LeakyReLU - - - - - 1152 0
14 ConvTranspose2d 64 3 2 0 1 (64, 8, 8) 73,792
15 LeakyReLU - - - - - (64, 8, 8) 0
16 BatchNorm2d - - - - - (64, 8, 8) 128
17 ConvTranspose2d 32 5 2 2 1 (32, 16, 16) 51,232
18 LeakyReLU - - - - - (32, 16, 16) 0
19 BatchNorm2d - - - - - (32, 16, 16) 64
20 ConvTranspose2d 3 5 2 2 1 (3, 32, 32) 2403
Refer to caption
Figure 10: Architecture of the proposed autoencoder for the waterbird dataset. Here, BN stands for Batch normalization.

Calculating PIDs: Calculation of PIDs: redundancy, unique information and synergy involves mainly three steps. First of all, the clusters for the given input images are estimated. This step requires the autoencoder. As shown in Fig.4, a given image is separated into two images: one contains the core features (foreground) and other contains the spurious features (background). For Dominoes dataset, the core features are formed of the images of cars or trucks and the spurious features are the images of zeros and ones. For each set of features, the clusters are computed. The architecture details of the autoencoder for Dominoes dataset are shown in Table 6. The output of the clustering layer is the desired clusters. For waterbird dataset, the architecture details are given in Fig. 10. The complexity of the autoencoder for waterbird is increased in order to handle the more challenging nature of this dataset as compared to the Dominoes one. The architecture is proposed inspired by (Sadeghi & Armanfard, 2023). To obtain the clusters, the model is pretrained with only mean square error loss function (MSEloss). Then, the model is again trained with weighted loss function which is a weighted sum of MSEloss and KL divergence loss. The weights of the clustering layer are initialized with the cluster centers obtained by k-means clustering. For the Dominoes dataset, hyperparameters are as follows: batch size 8888, learning rate 0.0010.0010.0010.001, CosineAnnealingLR scheduler, Adam optimizer with weight decay 0.00010.00010.00010.0001, pretraining epochs 100100100100 and later training is for 50505050 epochs. The later training process is terminated if the change of label assignments between two consecutive updates for target distribution is less than 0.0010.0010.0010.001. For the waterbird dataset, hyperparameters are as follows: batch size 64646464, learning rate 0.0010.0010.0010.001, CosineAnnealingLR scheduler, Adam optimizer with weight decay 0.00010.00010.00010.0001, pretraining epochs 150150150150 and later training is for 50505050 epochs. Next, the clusters obtained for the foreground and the background and the binary labels are used to estimate the joint distribution using histograms followed by the PID estimation with DIT (James et al., 2018) package.

Calculating Accuracies: To calculate the worst-group accuracy for the different variations of different datasets we do fine tuning of the pre-trained ResNet-50505050 (He et al., 2016) model. The worst-group-accuracy is defined as the accuracy of the minority group having the lowest number of training sample (see Table 3 and  5. For waterbird dataset, group10101010 has minimum training samples and for Dominoes 2.0 dataset, group01010101 has the lowest minority group samples.). For Dominoes 1.0 dataset, since group01010101 and group10101010 have the same number of training and test samples, the worst-group-accuracy is calculated by taking the average of the accuracies of these two groups. For the Dominoes dataset, hyperparameters are as follows: batch size 8888, learning rate 0.00010.00010.00010.0001, CosineAnnealingLR scheduler, stochastic gradient descent (SGD) optimizer with weight decay 0.00010.00010.00010.0001, loss function binary cross-entropy and epochs 100100100100. The train dataset is split into two subsets, i.e., 70%percent7070\%70 % for training split and 30%percent3030\%30 % for validation split. For waterbird dataset, the batch size is 64646464 and the other parameters are same as Dominoes. For addition and concatenation dataset the number of sample images in train and test dataset are distributed as in Table 34 and 5 which are created accordingly. For balanced dataset, we use weighted random sampler where weights are selected as the proportion of the groups. All the experiments are executed on NVIDIA RTX A4500.

B.3 Additional Results

Fig.11 shows the Grad-CAM (Selvaraju et al., 2017) variations for different models trained with unbalanced, balanced, addition and concatenation dataset (from left ’a’: unbalanced, ’b,c’: balanced, ’d’: addition and ’e’: concatenation). Observe that for the dataset based mitigation techniques, the model is focusing on the foreground (red region) while on the unbalanced case the model is emphasising in the background. There are cases where model does not give any importance to any portion of the image (see Fig.11b).

Refer to caption
Figure 11: Grad-CAM visualization for different variations of models trained with different datasets.