\clearauthor\Name

Kang Du \Email[email protected]
\addrUniversity of Utah and \NameYu Xiang \Email[email protected]
\addrUniversity of Utah

Low-Rank Approximation of Structural Redundancy for Self-Supervised Learning

Abstract

We study the data-generating mechanism for reconstructive SSL to shed light on its effectiveness. With an infinite amount of labeled samples, we provide a sufficient and necessary condition for perfect linear approximation. The condition reveals a full-rank component that preserves the label classes of $Y$ , along with a redundant component. Motivated by the condition, we propose to approximate the redundant component by a low-rank factorization and measure the approximation quality by introducing a new quantity $\varepsilon_{s}$ , parameterized by the rank of factorization $s$ . We incorporate $\varepsilon_{s}$ into the excess risk analysis under both linear regression and ridge regression settings, where the latter regularization approach is to handle scenarios when the dimension of the learned features is much larger than the number of labeled samples $n$ for downstream tasks. We design three stylized experiments to compare SSL with supervised learning under different settings to support our theoretical findings.

keywords:

Self-supervised learning, redundancy, low-rank approximation, ridge regression.

1 Introduction

Reconstructive self-supervised learning (SSL) has been highly successful in various fields (Pathak et al., 2016; Vincent et al., 2010; Radford et al., 2018; Devlin et al., 2018), where the theme is to extract representations from unlabeled data that are potentially useful for downstream tasks. One of the major advantages of SSL is its significantly reduced dependency on labeled data. Despite abundant empirical evidence, the theoretical understanding of the performance of SSL under limited labeled data is still insufficient.

In reconstructive SSL, a pretext task is designed to predict a target $X_{2}$ with input features $X_{1}$ , which yields the learned representation $\psi(X_{1})$ . Then, the downstream task is to predict the target $Y$ using $\psi(X_{1})$ . Whether the learned representation is useful for the downstream task relies on the connections between the pretext and downstream tasks. To bridge the pretext and downstream tasks, the conditional independence (CI) assumption, namely $X_{1}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{% \displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0% mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.% 0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}% \mkern 2.0mu{\scriptscriptstyle\perp}}}X_{2}\,|\,Y$ , has been studied in the seminal work (Lee et al., 2021). For the classification setting, they show that CI is a sufficient condition for a linear predictor to be optimal for the downstream task, that is, $\psi(X_{1})$ can linearly predict $Y$ perfectly with an infinite number of samples available for the task. Motivated by this key observation, they provide theoretical guarantees showing the superior sample complexity of SSL under general approximate conditional independence settings. However, a fundamental theoretical question for understanding reconstructive SSL still remains:

What is the sufficient and necessary condition on $(X_{1},X_{2},Y)$ , in the classification setting, for a linear predictor to be optimal for the downstream task?

To address this question, it is helpful to express $X_{2}$ as $X_{2}=h(X_{1},Y)+N$ , where $(X_{1},Y)$ is for an arbitrary supervised learning task and $N:=X_{2}-\operatorname{\sf E}[X_{2}|X_{1},Y]$ . With this expression, roughly speaking, the target $Y$ can be decoded from $X_{2}$ if $h$ is invertible in some sense. We formalize this notion of invertibility by focusing on classification problems. Our formulation allows for general dependency between $X_{2}$ and $X_{1}$ when conditioning on $Y$ . Thus there are features in the learned representation $\psi(X_{1})$ that are redundant for the prediction of $Y$ . For instance, for image classification problems, the redundant features may come from the background in the image; if the object of interest is surrounded by other objects in the background, a pretext task of predicting blocked patches of the image may mistakenly extract too many features from the background (Pathak et al., 2016). Without any constraints, a large percentage of redundant features can potentially make SSL fail. To this end, we introduce a quantity $\varepsilon_{s}$ , indexed by rank $s$ , for a low-rank approximation of the redundancy in the learned representation. We show how $\varepsilon_{s}$ affects the performance of SSL through both of our theoretical analysis and experiments.

Our main contributions are summarized below.

1.

Under the classification setting, we characterize in Section 3 a sufficient and necessary condition for a linear predictor to be optimal in the downstream task.
2.

In Section 4, we introduce a low-rank approximation quantity to characterize the redundancy in the learned representation.
3.

Based on the low-rank approximation quantity, we derive finite sample bounds on the excess risk and the corresponding sample complexity for both ordinary least squares and ridge regression estimators in Section 5.
4.

In Section 6.1, we design a simulation setting to demonstrate the effectiveness of the low-rank approximation. Our sufficient and necessary condition is partially verified through two computer vision tasks in Section 6.2.

1.1 Related work

Reconstructive SSL is focused on recovering deliberately concealed information in the data. In computer vision, examples include the prediction of blocked patches (Pathak et al., 2016), recovering the color (Zhang et al., 2016), denoising (Vincent et al., 2010), and identifying the rotated angle (Gidaris et al., 2018), while the simple scheme of next word prediction is widely adopted in NLP (Radford et al., 2018; Devlin et al., 2018). From the theoretical perspective, (Saunshi et al., 2020) and (Wei et al., 2021) study how pre-trained language models yield useful representation for downstream tasks. For computer vision tasks, (Pathak et al., 2016) provides a theoretical understanding of features learned by auto-encoders under a multi-view data assumption. Under a general formulation of reconstructive SSL, (Lee et al., 2021) shows that CI is sufficient for a linear predictor to be optimal in the downstream task and provides finite sample analysis. Since CI often fails to hold in practical settings, (Teng et al., 2022) proposes to modify the unlabeled data to make CI hold. Their theoretical analysis suggests that the modification is hurtful rather than helpful for the performance of SSL. The other popular type of SSL is called contrastive SSL, where the goal is to learn representations that make different views of the same data point closer. The CI assumption has been adopted in (Arora et al., 2019; Tosh et al., 2021) to provide theoretical guarantees for contrastive learning. In the context of contrastive SSL, CI is a natural assumption since, ideally, two views are expected to share less information given the label. The literature on SSL is vast and we refer the readers to (Gui et al., 2023; Ozbulak et al., 2023) for detailed reviews.

1.2 Notation

Throughout the paper, $\lVert\cdot\rVert$ denotes the $l_{2}$ norm for vectors or Frobenius norm for matrices. We use $\bm{0}$ and $\bm{1}$ to denote vectors or matrices of zeros and ones, respectively. For a full (column) rank matrix $A\in\mathbb{R}^{m\times n}$ with $n<m$ , we use $A^{\dagger}$ to denote its (left) pseudoinverse. Let $\mathop{\rm Cov}\nolimits(X)$ denote the covariance matrix of a random vector $X$ and $\mathop{\rm Cov}\nolimits(X_{1},X_{2})$ denote that of two random vectors $X_{1}$ and $X_{2}$ . We use $\widetilde{\mathcal{O}}$ to hide log factors and $\lesssim$ to hide constants in inequalities. We use ${\bf I}_{d}$ to denote the identity matrix of size $d\times d$ . A random vector $X\in\mathbb{R}^{d}$ is said to be $\sigma^{2}$ -sub-Gaussian if $\operatorname{\sf E}[X]=\bm{0}$ and $\operatorname{\sf E}[e^{tu^{\top}X}]\leq e^{\frac{t^{2}\sigma^{2}}{2}}$ for any $t\in\mathbb{R}$ and $u\in\mathbb{R}^{d}$ such that $\lVert u\rVert=1$ .

2 Problem Formulation: Reconstructive SSL

Consider $(X_{1},X_{2},Y)\in\mathcal{X}_{1}\times\mathcal{X}_{2}\times\mathcal{Y}$ where $X_{2}\in\mathbb{R}^{d_{2}}$ and $Y$ are the target variables for the pretext and downstream tasks, respectively, and $X_{1}\in\mathbb{R}^{d_{1}}$ is a vector of features shared by the two prediction tasks. We focus on the classification setting for the downstream task, i.e., $Y$ is categorical. For regression problems, one can consider the continuous target variable being discretized to a set of values. We use $\bar{Y}\in\bar{\mathcal{Y}}=\{y_{1},\ldots,y_{p+1}\}$ to denote the original label variable and $Y=(\mathbbm{1}_{\bar{Y}=y_{1}},\ldots,\mathbbm{1}_{\bar{Y}=y_{p}})^{\top}$ to denote its one-hot encoding with one class excluded to avoid multicolinearity as $\sum_{i=1}^{p+1}\mathbbm{1}_{\bar{Y}=y_{i}}=1$ , and we will simply refer to $Y$ as the one-hot encoding of $\bar{Y}$ throughout this work. We assume $p<d_{2}$ throughout the work. For simplicity, we assume that the optimal predictors for different classes of $Y$ are not linearly dependent, i.e., $\mathop{\rm Cov}\nolimits(\operatorname{\sf E}[Y|X_{1}])$ has full rank; otherwise, certain classes of $Y$ can be hidden to make it hold.

Concretely, we consider the following reconstructive SSL procedure.

1.

Pretext task: Given unlabeled data, predict $X_{2}$ using $X_{1}$ under some function class $\Psi$ , i.e., estimate $\psi^{*}:=\arg\min_{\psi\in\Psi}\operatorname{\sf E}[\lVert X_{2}-\psi(X_{1})% \rVert^{2}]$ .
2.

Downstream task: Given $n$ labeled data, regress $Y$ on the learned representation $\psi^{*}(X_{1})$ using simple regression functions such as linear or ridge regression.

Since there is often a large amount of unlabeled data and one can adopt deep neural networks to achieve universal approximation, we fix $\psi^{*}(x):=\operatorname{\sf E}[X_{2}|X_{1}=x]$ and focus on analyzing the downstream task. Due to the nature of the small (labeled) sample size of SSL, the function class for the downstream task is often assumed to have lower complexity compared to $\Psi$ (e.g., smaller parameter space). For theoretical analysis, we consider the class of all linear functions for the downstream task similarly as in (Lee et al., 2021). In practice, the advantage of SSL over supervised learning (SL) is more significant when the labeled sample size $n$ is relatively small, in which case the dimension of $\psi^{*}$ can be larger than $n$ . To avoid the downstream task being ill-posed, we adopt the ridge estimator. To measure the gap between the SSL prediction and the optimal predictor $\operatorname{\sf E}[Y|X_{1}]$ in infinite and finite samples, respectively, we define the approximation error and excess risk.

Definition 2.1.

Define the approximation error of SSL as $\mathsf{error}_{\text{apx}}^{*}:=\min_{\beta}\mathsf{error}_{\text{apx}}(\beta)$ , where

\mathsf{error}_{\text{apx}}(\beta):=\operatorname{\sf E}\left[\left\lVert% \operatorname{\sf E}[Y|X_{1}]-\beta\psi^{*}(X_{1})\right\rVert^{2}\right]

(1)

with $\psi^{*}(x)=\operatorname{\sf E}[X_{2}|X_{1}=x]$ , the optimal predictor of $X_{2}$ given $X_{1}$ .

Note that $\psi^{*}(x)=\operatorname{\sf E}[X_{2}|X_{1}=x]$ can be ensured by a function class with universal approximation power such as deep neural networks.

Definition 2.2.

We say there is an exact matching between $Y$ and $X_{2}$ given $X_{1}$ if $\mathsf{error}_{\text{apx}}^{*}=0$ .

For simplicity, we will omit the intercept $b(\beta):=\operatorname{\sf E}[Y]-\beta\operatorname{\sf E}[X_{2}]$ throughout the work. The performance of the downstream task is usually quantified through the so-called excess risk defined with respect to the finite sample analysis. Denote $X:=\psi^{*}(X_{1})=\operatorname{\sf E}[X_{2}|X_{1}]$ . Let $\bm{X}_{1}\in\mathbb{R}^{n\times d_{1}}$ and $\bm{Y}\in\mathcal{Y}^{n\times p}$ be the labeled data, and $\bm{X}:=\psi^{*}(\bm{X}_{1})\in\mathbb{R}^{n\times d_{2}}$ denote the learned representation from pretraining. For the downstream task and $\lambda\geq 0$ , let

\displaystyle\hat{\beta}_{\lambda}:=\arg\min_{\beta}\frac{1}{n}\lVert\bm{Y}-% \bm{X}\beta^{\top}\rVert^{2}+\lambda\lVert\beta\rVert^{2}=\bm{Y}^{\top}\bm{X}(% \bm{X}^{\top}\bm{X}+\lambda n{\bf I}_{d_{2}})^{-1}.

Definition 2.3.

The excess risk induced by the estimator $\hat{\beta}_{\lambda}$ is defined as $\mathcal{R}(\hat{\beta}_{\lambda}):=\mathsf{error}_{\text{apx}}(\hat{\beta}_{% \lambda})$ .

The term “matching” can be viewed in the following sense: (1) the form of nonlinearity in $\operatorname{\sf E}[Y|X_{1}]$ should be captured by $\operatorname{\sf E}[X_{2}|X_{1}]$ ; (2) the “redundant” nonlinearity in $\operatorname{\sf E}[X_{2}|X_{1}]$ should be linearly dependent so that they can be removed through a linear transform. As a toy example, consider $\operatorname{\sf E}[Y|X_{1}]=X_{1}^{2}$ and $\operatorname{\sf E}[X_{2}|X_{1}]=(-X_{1}^{2}+\sin(X_{1}),\,0.5\sin(X_{1}))^{\top}$ and note they share the same quadratic term $X_{1}^{2}$ , while the sine functions in $\operatorname{\sf E}[X_{2}|X_{1}]$ are redundant for predicting $Y$ . Observe that SSL with $\beta=(-1,\,2)^{\top}$ extracts the quadratic term while eliminating the sine functions. In contrast, $\operatorname{\sf E}[X_{2}|X_{1}]=(X_{1},\,0.5\cos(X_{1}))^{\top}$ will not lead to an exact matching.

Remark 2.4.

This notion of predicting a subset of $X$ can be helpful for predicting $Y$ is not limited to reconstructive SSL. For instance, in a series of recent papers Du and Xiang (2022, 2023a, 2023b), the authors have explored a similar direction from an invariance perspective for multi-environment domain adaption, which has partially motivated this study.

3 Necessary and Sufficient Condition for Exact Matching

In an attempt to demystify the matching between the pretext and downstream tasks, we propose to identify the conditions on the generating mechanism of $(X_{1},X_{2},Y)$ that enable an exact matching. The generating mechanism of $(X_{1},Y)$ in a supervised learning task is often complicated, and thus we make no assumptions on how $(X_{1},Y)$ is generated and focus on the interactions between $(X_{1},Y)$ and $X_{2}$ . Without loss of generality, we can write $X_{2}$ in the following form

X_{2}=h(X_{1},Y)+N,

(2)

where $h(X_{1},Y):=\operatorname{\sf E}[X_{2}|X_{1},Y]$ is the regression function of $X_{2}$ on $(X_{1},Y)$ and therefore the residual variable $N:=X_{2}-h(X_{1},Y)$ satisfies $\operatorname{\sf E}[N|X_{1},Y]=0$ . The function $h$ captures how the label $Y$ and feature $X_{1}$ are encoded into $X_{2}$ .

Equation (2) can be viewed from a causal perspective through a general structural causal model (SCM) (Pearl, 2009), $X_{2}=f(X_{1},Y,\varepsilon)$ , where $\varepsilon$ is a vector of exogenous variables independent of $(X_{1},Y)$ . Since this general SCM suffers from identifiability issues, we focus on (2), observing that $h(X_{1},Y):=\operatorname{\sf E}[X_{2}|X_{1},Y]=\operatorname{\sf E}[f(X_{1},Y% ,\varepsilon)|X_{1},Y]$ . It is important to note that (2) is valid even when there is no underlying causal graph over $(X_{1},X_{2},Y)$ .

Recall that $Y$ is the one-hot encoding of $\bar{Y}$ . Observe that an arbitrary function $h:(\mathcal{X},\mathcal{Y})\to\mathbb{R}^{d}$ can be equivalently written as

h(X,Y)=\sum_{j=1}^{p}h(X,e_{j})\mathbbm{1}_{\bar{Y}=y_{j}}=\sum_{j=1}^{p}h(X,e% _{j})e_{j}^{\top}Y:=C^{h}(X)Y,

(3)

where we use the fact that $\mathbbm{1}_{\bar{Y}=y_{j}}=e_{j}^{\top}Y$ . This simple derivation implies a one-to-one correspondence between $h$ and $C^{h}$ , meaning that the general function model (2) can be expressed as

X_{2}=h(X_{1},Y)+N=C^{h}(X_{1})Y+N,

(4)

with $C^{h}:\mathcal{X}_{1}\to\mathbb{R}^{d_{2}\times p}$ . The role of the latent random matrix $C^{h}(X_{1})$ is to encode the label variable $Y$ into $X_{2}$ , thus we call $C^{h}$ the encoding function. The expression in (4) implies the identity $\operatorname{\sf E}[X_{2}|X_{1}]=C^{h}(X_{1})\operatorname{\sf E}[Y|X_{1}]$ , which is equivalent to

\operatorname{\sf E}[X_{2}|X_{1}]=(C^{h}(X_{1})+O(X_{1}))\operatorname{\sf E}[% Y|X_{1}]:=\bar{C}^{h}(X_{1})\operatorname{\sf E}[Y|X_{1}],

(5)

for any $O:\mathcal{X}_{1}\to\mathbb{R}^{d_{2}\times p}$ such that $O(X_{1})\operatorname{\sf E}[Y|X_{1}]=\bm{0}$ . In words, the rows of $O(x)$ are orthogonal to $\operatorname{\sf E}[Y|X_{1}=x]$ for $\forall x\in\mathcal{X}_{1}$ . We call such $O(x)$ an orthogonal term. For instance, the orthogonality holds when $\operatorname{\sf E}[Y|X_{1}]=(X_{1},X_{1}^{2})^{\top}$ and each row of $O(X_{1})$ is $(-X_{1},1)$ . Equation (5) defines an equivalent class of encoding functions $\mathcal{C}=\{\bar{C}^{h}\}$ that results in the same pretext representation $\psi^{*}(X_{1})=\operatorname{\sf E}[X_{2}|X_{1}]$ . This shows that such orthogonal terms do not affect the analysis of SSL, and thus we use $\doteq$ to hide the orthogonal term (added to $C^{h}$ ) in equations throughout the paper.

Proposition 3.1.

The exact matching in Definition 2.2 holds if and only if $\beta C^{h}(x)\doteq{\bf I}_{p}$ , for $\forall x\in\mathcal{X}_{1}$ and some $\beta\in\mathbb{R}^{p\times d_{2}}$ .

Therefore, in this formulation, finding an exact matching is equivalent to inverting the encoding function $C^{h}$ . Proposition 3.1 implies that the full rank of $C^{h}(x)$ for every $x\in\mathcal{X}_{1}$ is a necessary condition for the exact matching. In the following lemma, we provide a sufficient and necessary condition for the exact matching through a full characterization of the invertibility of $C^{h}$ .

Lemma 3.2 (sufficient and necessary condition for exact matching).

There is an exact matching between $Y$ and $X_{2}$ given $X_{1}$ if and only if

C^{h}(x)\doteq A\begin{bmatrix}{\bf I}_{p}\\ R(x)\end{bmatrix}\quad\text{ for }\forall x\in\mathcal{X}_{1},

(6)

for some invertible matrix $A\in\mathbb{R}^{d_{2}\times d_{2}}$ , an arbitrary matrix function $R:\mathcal{X}_{1}\to\mathbb{R}^{(d_{2}-p)\times p}$ .

The identity map** ${\bf I}_{p}$ fully preserves each class of $Y$ , and $R(x)$ represents the redundancy encoded into $X_{2}$ . It is worth noting that redundancy refers to the features extracted from $X_{1}$ that are predictive for $X_{2}$ , but redundant for the prediction of $Y$ (given the optimal predictor $\operatorname{\sf E}[Y|X_{1}]$ ). In our stylized MNIST experiment in Section 6.2.2, the dash pattern in the background is redundancy since it is useful for predicting the image orientation (i.e., $X_{2}$ ), but it contains no information about the label. In contrast, the dot pattern is not redundant, since it is independent of both the image orientation and the label. The lemma above reveals that the label $Y$ should be encoded into $X_{2}$ through an invertible linear mixture (i.e., $A$ ) of the full label information and some redundancy. When $A$ is an identity matrix, the first $p$ rows of $\psi^{*}(X_{1})=\operatorname{\sf E}[X_{2}|X_{1}]$ capture the full label information, thus the downstream task has a sparse solution $\beta^{*}=[{\bf I}_{p},\,\bm{0}]$ . However, the solutions to the downstream task may not be sparse in general, and we handle this challenge in Section 4. Below are two examples with explicit forms of $C^{h}$ .

Example 3.3.

An important special case of model (4) is $X_{2}=\widetilde{C}Y+N$ , where $C^{h}\equiv\widetilde{C}$ is a constant function. In this case, the necessary and sufficient condition simplifies to the condition that $\widetilde{C}$ has full rank. Observe that $\operatorname{\sf E}[X_{2}|X_{1}]=\widetilde{C}\operatorname{\sf E}[Y|X_{1}]$ implies $\mathsf{error}_{\text{apx}}(\widetilde{C}^{\dagger})=0$ .

In Appendix A, we show that model $X_{2}=\widetilde{C}Y+N$ is equivalent to $\operatorname{\sf E}[X_{2}|X_{1},Y]=\operatorname{\sf E}[X_{2}|Y]$ , which we call conditional mean independence, which is weaker than CI, i.e., $X_{1}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{% \displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0% mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.% 0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}% \mkern 2.0mu{\scriptscriptstyle\perp}}}X_{2}\,|\,Y$ . The setting in Example 3.3 has been studied in (Lee et al., 2021) under CI. Despite the simplicity of conditional mean independence, it can be unrealistic in practical settings, since it requires that the label $Y$ is fully encoded into $X_{2}$ with no redundant information (depending on $X_{1}$ ) as if the pretext and downstream tasks are two equivalent prediction tasks. Even though approximate conditional independence has been studied in (Lee et al., 2021), it is unclear if the approximation provides sufficient insights into explaining why and when SSL works (or fails), since conditional independence (or constant $C^{h}$ ) is not a necessary condition for exact matching.

Example 3.4 (partially linear model).

Define an invertible matrix $A\in\mathbb{R}^{d_{2}\times d_{2}}$ with a column partition as $A=[A_{1}|A_{2}|A_{3}]$ , where $A_{1}$ has $p$ columns, $A_{2}$ has $k$ columns such that $1\leq k\leq d_{2}-p$ , and $A_{3}$ has the rest of the columns. Let $X_{2}=A_{1}Y+A_{2}a(X_{1})+N$ , where $a:\mathcal{X}_{1}\to\mathbb{R}^{k}$ satisfies $\operatorname{\sf E}[a(X_{1})]=\bm{0}$ . Its encoding function is $C^{h}(X_{1})=A_{1}+A_{2}a(X_{1})\bm{1}^{\top}$ as derived below. The sufficient and necessary condition is immediately satisfied with $A$ and $R(x)=[\bm{1}a^{\top}(x),\bm{0}]^{\top}$ , where $R(x)$ has $d_{2}-p-k$ all zero rows.

h(X_{1},Y)=\sum_{j=1}^{p}(A_{2}a(X_{1})+A_{1}e_{j})e_{j}^{\top}Y=\big{(}A_{2}a% (X_{1})\sum_{j}e_{j}^{\top}+A_{1}\sum_{j}e_{j}e_{j}^{\top}\big{)}Y:=C^{h}(X_{1% })Y.\vspace{-.6em}

Even though conditional independence fails to hold since $C^{h}$ is not constant, according to Lemma 3.2, there is still an exact matching.

4 Structural Redundancy

The pretext representation $\psi^{*}(X_{1})$ is typically high-dimensional, designed to capture abundant information for various downstream tasks. Given limited labeled samples ( $n\ll d_{2}$ in our notation), the downstream task is a high-dimensional linear regression problem. Without any assumptions such as sparsity of the true coefficients (Tibshirani, 1996; Candes and Tao, 2007) or low effective dimension of the features (Zhang, 2005; Hsu et al., 2012), SSL may not perform well even if the exact matching holds. Since the true coefficients are not necessarily sparse as discussed in Section 3, we explore how low-rank structures in the redundancy (i.e., $R(X_{1})$ ) naturally lead to a low effective dimension. In particular, we adopt the definition of effective dimension from (Hsu et al., 2012) in the context of ridge regression (see details below); a closely related notion is called effective degrees of freedom (Efron, 1986; Hastie et al., 2009). Roughly speaking, the effective dimension measures the number of features that are not linearly correlated; when it is low, a small number of labeled samples can be sufficient for reliable estimation

To simplify the notation, denote $X:=\psi^{*}(X_{1})$ with covariance matrix $\Sigma:=\mathop{\rm Cov}\nolimits(X)$ . Let $\{\lambda_{j}\}_{j=1}^{d_{2}}$ denote the eigenvalues of $\Sigma$ . The population ridge estimator is given by

\displaystyle\beta_{\lambda}=\arg\min_{\beta}\operatorname{\sf E}\left[\lVert Y% -X\beta\rVert^{2}\right]+\lambda\lVert\beta\rVert^{2}=(\Sigma+\lambda{\bf I}_{% d_{2}})^{-1}\operatorname{\sf E}[XY^{\top}].

Implicitly dimension reduction is performed in ridge regression when some appropriate shrinkage parameter $\lambda$ is chosen. The reduced dimension for a chosen $\lambda$ can be quantified by the effective dimension, defined as $d_{\lambda}=\sum_{j=1}^{d}\frac{\lambda_{j}}{\lambda_{j}+\lambda}$ for $X\in\mathbb{R}^{d}$ . Note that the bias of the ridge estimator increases monotonically as $\lambda$ increases. When $\Sigma$ has exactly $s$ nonzero eigenvalues, $d_{\lambda}$ is upper bounded by $s$ for any $\lambda\geq 0$ . Besides this special case, a low effective dimension can be achieved under a weak penalty (i.e., small $\lambda$ ) when there is a large percentage of small eigenvalues.

In the next subsection, we demonstrate that a low effective dimension is naturally attained when the redundancy $R(X_{1})$ can be approximated by a low-rank decomposition. Our finite sample analysis on the high-dimensional setting, presented in Section 5, relies on an upper bound of the low effective dimension, utilizing a measure of the low-rank approximation introduced in the following subsection (Lemma H.1). When $d_{2}<n$ , $d_{\lambda}$ with $\lambda=0$ offers an interpretation of our upper bound on the excess risk and sample complexity (see details in Theorem 5.2 and Remark 5.3).

4.1 Low-rank Approximation of Redundancy

Recall, redundancy refers to information in $X_{1}$ useful for predicting $X_{2}$ but not for the label $Y$ . For instance, in computer vision tasks, consider a label $Y$ determined by the object of interest within a surrounding background. Redundancy arises when the pretext task captures background information unrelated to the label. If the background features simple patterns, such as sky, grassland, or beach, this redundancy can be considered low-rank. Consequently, it is relatively easy to eliminate such redundancy in downstream tasks (recall the cancellation of sine functions below Definition 2.3). In the following, we present the technical details of the low-rank approximation of redundancy.

Denote $\widetilde{C}:=\operatorname{\sf E}[C^{h}(X_{1})]$ and recall that $C^{h}(X_{1})$ reduces to $\widetilde{C}$ under conditional mean independence. Assume that the necessary and sufficient condition in Lemma 3.2 is satisfied,

	$\displaystyle X=C^{h}(X_{1})\operatorname{\sf E}[Y\|X_{1}]$	$\displaystyle=\left(\widetilde{C}+A\begin{bmatrix}\bm{0}\,\,\left(R(X_{1})-% \widetilde{R}\right)^{\top}\end{bmatrix}^{\top}\right)\operatorname{\sf E}[Y\|X% _{1}]$
		$\displaystyle=(\widetilde{C}+A_{p+1:d_{2}}(R(X_{1})-\widetilde{R})))% \operatorname{\sf E}[Y\|X_{1}],$

where $\widetilde{R}:=\operatorname{\sf E}[R(X_{1})]$ and $A_{p+1:d_{2}}\in\mathbb{R}^{d_{2}\times(d_{2}-p)}$ denotes the last $d_{2}-p$ columns of $A$ . If the (centered) redundancy $R(X_{1})-\widetilde{R}$ admits a low-rank decomposition, i.e., $R(X_{1})-\widetilde{R}=Bg(X_{1})$ for some $B\in\mathbb{R}^{(d_{2}-p)\times s}$ and $g:\mathcal{X}_{1}\to\mathbb{R}^{s\times p}$ , where $s\ll d_{2}$ , we get

X=(\widetilde{C}+A_{p+1:d_{2}}Bg(X_{1}))\operatorname{\sf E}[Y|X_{1}],

(7)

which has at most $p+s$ linearly independent components, as $\text{rank}(\widetilde{C})\leq p$ and $\text{rank}(A_{p+1:d_{2}}B)\leq s$ . This shows that the effective dimension of $X$ is bounded by $p+s$ for any $\lambda\geq 0$ .

Since the low-rank decomposition may not hold exactly for a chosen $s$ , we identify $\hat{X}$ of the form $(\widetilde{C}+Bg(X))\operatorname{\sf E}[Y|X_{1}]$ that best approximates $X$ . Specifically, for any fixed $s$ s.t. $1\leq s\leq d_{2}-p$ , we consider $B\in\mathbb{R}^{d_{2}\times s},\,g:\mathcal{X}_{1}\to\mathbb{R}^{s\times p}$ , and define

\displaystyle\varepsilon_{s}

\displaystyle:=\min_{\hat{X}}\frac{1}{d_{2}}\operatorname{\sf E}\left[\left% \lVert X-\hat{X}\right\rVert^{2}\right]=\min_{B,g}\frac{1}{d_{2}}\operatorname% {\sf E}\left[\left\lVert\left(C^{h}(X_{1})-\widetilde{C}-Bg(X_{1})\right)% \operatorname{\sf E}[Y|X_{1}]\right\rVert^{2}\right],

(8)

with minimizers $\{(B^{*},g^{*})\}$ and we use the shorthand $\widetilde{X}:=(\widetilde{C}+B^{*}g^{*}(X_{1}))\operatorname{\sf E}[Y|X_{1}]$ for any pair of optimizer $(B^{*},g^{*})$ . Without loss of generality, we normalize $g(X_{1})$ and assume $\operatorname{\sf E}[\lVert g(X_{1})\rVert]=1$ . The low-rank approximation error is averaged so that $\varepsilon_{s}$ does not grow with $d_{2}$ . A challenge for analyzing $\varepsilon_{s}$ is that the minimizers do not have closed-form expressions. When $(X_{1},X_{2},Y)$ follows a Gaussian distribution, we show (8) reduces to a weighted low-rank approximation problem (with no randomness) in Appendix E. However, these weighted problems do not have closed-form expressions in general (Srebro and Jaakkola, 2003; Dutta and Li, 2017). Therefore, we leave the further investigations of $\varepsilon_{s}$ for future work. For $s=0$ , we simply define

\varepsilon_{0}:=\frac{1}{d_{2}}\operatorname{\sf E}\left[\left\lVert\left(C^{% h}(X_{1})-\widetilde{C}\right)\operatorname{\sf E}[Y|X_{1}]\right\rVert^{2}% \right],

which measures how approximately conditional mean independence (i.e., $C^{h}=\widetilde{C}$ ) holds. There is a tradeoff between the effective dimension of $\widetilde{X}$ (which are no greater than $s$ ) and the approximation quality, as the approximation level $\varepsilon_{s}$ is non-increasing as $s$ increases. An important special case of the low-rank approximation is when the encoding functions are smooth.

Example 4.1 (smooth encoding function).

Consider a binary classification problem with a scalar predictor $X_{1}$ , i.e., $p=d_{1}=1$ , assume that the encoding function $C^{h}:\mathbb{R}\to\mathbb{R}^{d_{2}}$ is twice continuously differentiable, then its second-order Taylor expansion at $a\in\mathbb{R}$ is given by

\displaystyle C^{h}(x)=C^{h}(a)+\begin{bmatrix}\derivative{C^{h}}{x}\big{|}_{x% =a}&\derivative[2]{C^{h}}{x}\big{|}_{x=a}\end{bmatrix}\begin{bmatrix}x-a&(x-a)% ^{2}\end{bmatrix}^{\top}+\mathcal{O}((x-a)^{3}),

where we can choose $a$ so that $C^{h}(a)=\widetilde{C}$ . This provides a rank-two approximation for $C^{h}(x)-\widetilde{C}$ such that $\varepsilon_{s}=\mathcal{O}((x-a)^{3})$ , where $s=2$ . This example can be generalized to high-order, multi-class, and multivariate cases, and we provide the details in Appendix D.

To understand the impact of the size of $\varepsilon_{s}$ on how approximately the matching holds (or how small the approximation error is), we derive the following upper bound via a ridge-type estimator. Unlike ridge-type estimators used in practice, the parameter $\varepsilon_{s}$ that restricts the size of the coefficients is determined by the generating mechanism of $(X_{1},X_{2},Y)$ .

Lemma 4.2.

Consider $B^{*}$ and $g^{*}$ corresponding to $\varepsilon_{s}$ , we have

\displaystyle\min_{\beta}\mathsf{error}_{\text{apx}}(\beta)\leq 2(p+% \operatorname{\sf E}[\lVert N\rVert^{2}])\min_{\beta}\left(\lVert{\bf I}_{p}-% \beta\widetilde{C}\rVert^{2}+\lVert\beta B^{*}\rVert^{2}+\varepsilon_{s}||% \beta\rVert^{2}\right),

where the minimum of the RHS is attained at $\beta_{s}:=(\widetilde{C}^{\top}\widetilde{C}+(B^{*})^{\top}B^{*}+\varepsilon_% {s}I)^{-1}\widetilde{C}$ . The equality holds with the RHS being zero when $\varepsilon_{s}=0$ .

5 Finite Sample Analysis

Let $\beta^{*}\in\arg\min_{\beta}\mathsf{error}_{\text{apx}}(\beta)$ be a fixed true parameter for the downstream task. Recall that the excess risk is defined as $\mathcal{R}(\hat{\beta}_{\lambda})=\operatorname{\sf E}[\lVert\operatorname{% \sf E}[Y|X_{1}]-\hat{\beta}_{\lambda}X)\rVert^{2}]$ . Under conditional mean independence, observe that $X=\operatorname{\sf E}[X_{2}|X_{1}]=\widetilde{C}\operatorname{\sf E}[Y|X_{1}]$ is a feature vector with at most $p$ (out of $d_{2}$ ) features that are linearly independent. Since the number of classes $p$ is often much smaller than the dimension of the learned representation $d_{2}$ , the design matrix $\bm{X}$ for the downstream task is of low rank. This enables a finite-sample bound on the excessive risk $\widetilde{\mathcal{O}}(\frac{p}{n}\sigma^{2})$ with sample complexity $\widetilde{\mathcal{O}}(p)$ (Lee et al., 2021), where the bound is independent of the dimension $d_{2}$ . In the following, we provide a finite-sample analysis of SSL in the general setting when conditional independence can be violated, based on the low-rank approximation defined in (8).

First, we introduce a few technical assumptions. Let $\Sigma$ , $\widetilde{\Sigma}$ , and $\bar{\Sigma}$ denote the covariance matrix of $X$ , $\widetilde{X}$ , and $\bar{X}:=\widetilde{X}-X$ , respectively.

Assumption 1

We assume $N:=Y-\operatorname{\sf E}[Y|X_{1}]$ is $\sigma^{2}$ -sub-Gaussian, and the whitened feature vectors $\widetilde{\Sigma}^{-1/2}\tilde{X}$ and $\bar{\Sigma}^{-1/2}\bar{X}$ are $\rho^{2}$ -sub-Gaussian. ¹¹1 When $\widetilde{\Sigma}$ or $\bar{\Sigma}$ is not invertible, the whitened feature vector is defined through the generalized inverse.

Assumption 2

There exists $\tilde{b},\bar{b}\geq 0$ s.t. the following holds almost surely,

$\bullet$

$\lVert\widetilde{\Sigma}^{-1/2}\widetilde{X}(\operatorname{\sf E}[Y|X_{1}]-% \beta^{*}X])^{\top}\rVert\leq\tilde{b}\sqrt{p+s}$ ;
$\bullet$

$\lVert\bar{\Sigma}^{-1/2}\bar{X}(\operatorname{\sf E}[Y|X_{1}]-\beta^{*}X])^{% \top}\rVert\leq\bar{b}\sqrt{d_{2}}$ .

Remark 5.1.

A similar assumption has been made in (Lee et al., 2021, Assumption 3.3), which is motivated by (Hsu et al., 2012, Condition 4).

Let $\lambda_{\text{max}}(A)$ denote the largest eigenvalue of a symmetric real matrix $A$ such that $A\neq\bm{0}$ , $\lambda_{\text{min}\neq 0}(A)$ denote its smallest nonzero eigenvalue, and $\{\lambda_{i}(A)\}$ denote the set of all its eigenvalues. When $\widetilde{X}$ is good approximates of $X$ , we expect $\widetilde{\Sigma}-\Sigma$ and $\bar{\Sigma}$ to be close to zero matrices. Therefore, we consider restricting the largest eigenvalues of the two matrices, respectively. A generic bound is provided in (Wolkowicz and Styan, 1980), that is $\lambda_{\text{max}}(A)\leq\frac{\text{tr}(A)}{d}+\sqrt{d-1}\cdot s(A)$ , where $s(A):=\frac{\text{tr}(A^{2})}{d}-\frac{\text{tr}^{2}(A)}{d^{2}}$ is the variance of $\{\lambda_{i}(A)\}$ . The equality holds when the $d-1$ smallest eigenvalues are equal. However, this bound can be quite loose when $d$ is large. Instead, we make the following assumption.

Assumption 3

For some universal constants $c_{1}\geq 0$ and $c_{2}\geq 0$ ,

$\bullet$

$\lambda_{\text{max}}(\widetilde{\Sigma}-\Sigma)\leq c_{1}\frac{1}{d_{2}}\cdot% \lvert\text{tr}(\widetilde{\Sigma}-\Sigma)\rvert$ ;
$\bullet$

$\lambda_{\text{max}}(\bar{\Sigma})\leq c_{2}\frac{1}{d_{2}}\cdot\text{tr}(\bar% {\Sigma})$ .

Both inequalities require that the average eigenvalue is comparable to the largest eigenvalue. The assumption can be unrealistic when $\widetilde{\Sigma}-\Sigma$ or $\bar{\Sigma}$ has mostly zero eigenvalues but a few large positive eigenvalues. We explain why such a case will not happen when $s\ll d_{2}$ . Case I: When $\text{rank}(\Sigma)\ll d_{2}$ , there exists $s=\text{rank}(\Sigma)$ such that $\widetilde{X}=X$ and $\bar{X}=\bm{0}$ , in which case the inequalities are satisfied with $c_{1}=c_{2}=0$ . Case II: In settings when $\text{rank}(\Sigma)$ is comparable to $d_{2}$ (i.e., $\text{rank}(\Sigma)$ is a fraction of $d_{2}$ ), $\widetilde{X}$ satisfies $\text{rank}(\widetilde{\Sigma})\leq p+s\ll d_{2}$ , and thus the rank of $\widetilde{\Sigma}-\Sigma$ should be greater than $d_{2}-p-s$ , meaning that most of eigenvalues are nonzero. Similarly, $\bar{X}=X-\widetilde{X}$ should have at least $d_{2}-p-s$ linearly independent components, i.e., $\text{rank}(\bar{\Sigma})\geq d_{2}-p-s$ . Given that $\widetilde{X}$ serves as an approximation for $X$ with a lower effective dimension, we make the following technical assumption on the rank.

Assumption 4

$\text{rank}(\widetilde{\Sigma})\leq\text{rank}(\Sigma)$ and $\text{rank}(\bar{\Sigma})\leq\text{rank}(\Sigma)$ .

Since $X$ has $\text{rank}(\Sigma)$ linearly independent components while $\widetilde{X}$ is introduced to approximate $\text{rank}(\widetilde{\Sigma})$ independent components out of them, $\bar{X}=X-\widetilde{X}$ is expected to have less independent components than $X$ .

Theorem 5.2.

Under Assumptions 1— 4, for any $\delta\in(0,1)$ , if $n\gg\rho^{4}(d_{2}+\log\frac{1}{\delta})$ , the excess risk of the downstream task induced by $\hat{\beta}_{0}$ is upper bounded, with probability at least $1-\delta$ , by

\displaystyle\mathcal{R}(\hat{\beta}_{0})

\displaystyle\leq\mathsf{error}_{\text{apx}}^{*}+\widetilde{\mathcal{O}}\left(% (1+\varepsilon_{s})\frac{p+s}{n}\sigma^{2}+\varepsilon_{s}\frac{d_{2}}{n}% \sigma^{2}\right).

Remark 5.3.

When $\varepsilon_{s}=0$ , if $n\gg\rho^{4}(p+s+\log\frac{1}{\delta})$ , we have $\mathcal{R}(\hat{\beta}_{0})\leq\widetilde{\mathcal{O}}\left(\frac{p+s}{n}% \sigma^{2}\right)$ .

The proof of Theorem 5.2 follows a similar idea to that of (Lee et al., 2021, Theorem $3.5$ ); a subtle yet important difference is that we consider approximation errors due to the violations of the exact matching while they consider approximation errors due to choices of the function class $\Psi$ . When $\varepsilon_{s}=0$ , the dominating rate of $\mathcal{R}(\hat{\beta}_{0})$ is $\frac{p+s}{n}\sigma^{2}$ , which shows that SSL enjoys a similar sample complexity as shown in (Lee et al., 2021) even when conditional independence is violated. We have demonstrated in Lemma 4.2 how the approximation error $\mathsf{error}_{\text{apx}}(\beta^{*})$ depends on the approximation level $\varepsilon_{s}$ . We also provide the bound with respect to $\hat{\beta}_{\lambda}$ , stated below. The proof is largely followed from (Hsu et al., 2012, Theorem 16) and we only outline the main steps in Appendix H.

Corollary 5.4 (Informal).

Under suitable assumptions, the excess risk of the downstream task induced by $\hat{\beta}_{\lambda}$ can be upper bounded by

\mathcal{R}(\hat{\beta}_{\lambda})\leq\mathsf{error}_{\text{apx}}^{*}+% \operatorname{\sf E}[\lVert(\beta_{\lambda}-\beta^{*})X\rVert^{2}]+\mathcal{O}% \left(\frac{p+s}{n}\left(1+\frac{\varepsilon_{s}}{\lambda}\right)\tilde{\sigma% }^{2}\right),

with high probability, where $\tilde{\sigma}^{2}:=\mathsf{error}_{\text{apx}}(\beta_{\lambda})+\operatorname% {\sf E}[\lVert(\beta_{\lambda}-\beta^{*})X\rVert^{2}]+\sigma^{2}$ .

For simplicity, the parameters that depend on the choice of $\lambda$ are omitted. The bound requires $p+s\ll n$ even though $n<d_{2}$ , thus an approximation (8) with lower rank is expected in this more challenging setting. The term $\operatorname{\sf E}[\lVert(\beta_{\lambda}-\beta^{*})X\rVert^{2}]$ relies on the difference between $\beta_{\lambda}$ and $\beta^{*}$ , as well as the choice of $\lambda$ . When $\beta_{\lambda}=\beta^{*}$ and $\varepsilon_{s}=0$ , the dominating rate $\frac{p+s}{n}\sigma^{2}$ is the same as that in Remark 5.3. This shows that low-rank structures enable SSL to share a similar excess risk upper bound and sample complexity in low- and high-dimensional settings.

6 Experiments

We propose a synthetic dataset and two computer vision tasks to examine the importance of the full rank condition on $C^{h}(x)$ and the low-rank approximation quality. Recall that a necessary condition for the exact matching is that $C^{h}(x)$ has full rank for every $x\in\mathcal{X}_{1}$ . For the synthetic dataset, we ensure that $C^{h}(x)$ is of full rank and focus on the low-rank approximation. For image data, since $C^{h}$ is a latent matrix function, it is not straightforward to test whether $C^{h}(x)$ has full rank in general. To this end, we design images of simple geometric shapes, that can be seen as abstractions of real images and show how some geometric properties make the rank condition hold or fail. To further understand the importance of the low-rank approximation, we add background patterns to the MNIST dataset and show that certain patterns can lead to poor low-rank approximation. See more details of the experiments in Appendixes I, J, and K. SSL approaches have achieved superior performance on large benchmark datasets, while the function class for downstream tasks is often much larger than linear models (e.g., MLPs), which is beyond the scope of our theoretical analysis. The implementation of our experiments is provided at \urlhttps://github.com/dukang4655/reconstructive_ssl.

6.1 Synthetic Data

Refer to caption — Figure 1: Setting I (top row): $n=300$ and vary $s$ . SSL (red), $\text{SL}_{1}$ (blue), and $\text{SL}_{2}$ (green). $50$ repeated experiments. Solid lines: mean; shaded region: standard derivation. Setting II (bottom row): $s=5$ and $n$ varies.

We use a synthetic dataset to verify our theoretical results when $n>d_{2}$ . First, we generate $(X_{1},X_{2},Y)$ with $d_{1}=10$ , $d_{2}=20$ , and $p=2$ , where $X_{2}=(A+Bg(X_{1}))Y+N$ , where $B\in\mathbb{R}^{d_{2}\times s}$ . See details of the model parameters in Appendix I. We compare SSL with two supervised learning (SL) procedures in two settings: I. Fix $s=5$ and vary the labeled sample size $n\in\{100,200,400,800,1600\}$ ; II. Fix $n=300$ and vary the low-rank approximation by considering $s\in\{1,2,\ldots,5,10,15,20\}$ . We consider two supervised learning procedures. $\text{SL}_{2}$ : Predicting $Y$ by $X_{1}$ , and $\text{SL}_{2}$ : Predicting $Y$ by $(X_{1},X_{2})$ . We use MLPs for the pretext task and the two supervised learning procedures. In Fig. 1, the performance of $\text{SL}_{1}$ is roughly invariant with respect to $s$ since it does not use $X_{2}$ for the prediction, making it more robust than $\text{SL}_{2}$ . This verifies that predictions using the parents of $Y$ as predictors (which we call causal predictions) are more robust than non-causal predictions under a small sample size. The superior performance of SSL degrades as $s$ increases. When $s<d_{2}-p=18$ , we have $\varepsilon_{s}=0$ according to the factorization in (7). In Fig. 1, when $s=20$ , a low-rank approximation (8) could lead to a large approximation error $\varepsilon_{\bar{s}}$ for $\bar{s}\leq 18$ . As a consequence, the advantage of SSL over SL₁ is much smaller compared with the case with $s=0$ . This indicates that a good low-rank approximation is not only sufficient but also necessary for SSL. In the other setting when $s$ is fixed to $5$ but the sample size $n$ varies, as shown in Fig.1, the performance of SSL improves slowly with the increasing sample size, since it already achieves performance that to close to the optimal (i.e., the performance of SL₁ with a large $n$ ) under a small $n$ . SL₁ starts to catch up with SSL when $n\geq 800$ , while the accuracy of SL₂ does not consistently improve as $n$ increases.

6.2 Computer Vision Tasks

6.2.1 Geometric Shapes (On the Rank Condition)

In computer vision applications, it is common that the dimension of the learned representation is much larger than the labeled sample size (i.e., $d_{2}\gg n$ ) for SSL. We design a simple task to help understand how the patterns in an image make SSL work or fail; the task is inspired by (Gidaris et al., 2018), where the pretext task is to predict the rotated angles of images. We consider $X_{1}$ as a random image of some objects, where the objects have random sizes and random locations. The goal is to classify the shape of the object $\bar{Y}$ . We created $X_{2}$ by randomly rotating $X_{1}$ by $0$ or $90$ degrees. Observe that the location and the size of an object are redundant features for predicting its shape and orientation. This stylized setting, even though much simpler than real-world images, is designed this way on purpose in order for the redundancy variable to have low-rank approximations. According to (4), the $j^{th}$ column of $C^{h}(x)$ can be viewed as a feature vector for the rotation angle corresponding to the $j^{th}$ class of objects. Since we consider the classification of two classes (i.e., $p=2$ ), the condition requires that the two feature vectors should not be similar. To verify this necessary condition, we consider two pairs of objects:

Triangle vs. Tangent Circles (Fig. 2(a)). In this case, the identification of orientation is based on characteristics specific to the two shapes. For triangles, it is natural to focus on the edges and vertices, while those characteristics are not even defined for circles. Thus, we think the rank condition is approximately satisfied. We examine this observation using Grad-CAM (Selvaraju et al., 2017) that visualizes the contributing features that the model extracted from the image (see Fig. 4 from Appendix J). Similarly for the pair of objects below.

Triangle vs. Pentagon (Fig. 2(b)) In this case, the orientation of the two objects can be identified in similar ways, mainly based on the edges and vertices. In this case, the columns of $C^{h}(x)$ are close to linearly dependent for different $x\in\mathcal{X}_{1}$ . As a consequence, the necessary condition for the exact matching is violated.

We compare SSL with SL under different labeled sample sizes $n\in\{10,20,40,60,80,100\}$ . For Triangle vs. Tangent Circles, as shown in Fig. 2, SSL consistently outperforms SL for small sample sizes (i.e., $n\leq 60$ ). The performance of SSL improves slower compared with SL for sufficiently large $n$ , since the prediction error of SSL will be dominated by the population error instead of the estimation error. Recall that our finite-sample bound on the excess risk in Corollary 5.4 converges to the sum of the approximation error and an error term depending on $\lambda$ as $n\to\infty$ . In contrast, SSL behaves very differently for Triangle vs. Pentagon. From Fig. 2(b), the accuracy of SSL improves slower as $n$ increases, potentially due to the large approximation error. Overall, SSL has almost no advantage over SL. This experiment shows that the rank condition is crucial for SSL.

6.2.2 Stylized MNIST (On the Low-Rank Approximation)

To examine how the redundancy affects the performance of SSL, we consider the same rotation prediction task for a stylized MNIST dataset illustrated in Fig. 3, where the density of the background pattern varies randomly. A key observation is that the dot pattern does not help identify the image orientation, so it is not encoded into the orientation variable as redundancy. On the contrary, one can tell the orientation of the image simply by the orientation of the dash pattern, thus the pretext task will extract features from the dash pattern as redundant information. Again, we use Grad-CAM to visualize our observation in Fig. 7 from Appendix K. Consequently, a dense dash pattern can lead to poor low-rank approximation. In Fig. 6 from Appendix K, the performance of SSL is almost invariant to the density of the dot pattern while the performance of SL drops as the density increases. In contrast, SSL is quite sensitive to a sparse dash pattern and the performance gets worse as the density increases (see Fig. 3). We have tested the dot vs. dash patterns for the geometric shape images, and similar results are observed as shown in Fig. 5 from Appendix J.

7 Discussion

Many important questions remain to be studied and we list a few in this section. One natural next step is to study nonlinear function classes for the downstream task and characterize the corresponding sufficient and necessary conditions. Since our theoretical results can potentially provide guidance for develo** SSL procedures, especially for designing pretext tasks, it would be worthwhile to design systematic and extensive experiments to better bridge the theories and practical designs. Besides the superior performance under limited labeled samples, the other major advantage of SSL is that the learned representation can be useful for a diverse class of downstream tasks; a theoretical understanding of its ability to generalize to new tasks or unseen environments (e.g., by exploiting invariance Du and Xiang (2023b)) is of great importance.

References

Arora et al. (2019) Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229, 2019.
Candes and Tao (2007) Emmanuel Candes and Terence Tao. The dantzig selector: Statistical estimation when p is much larger than n. 2007.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Du and Xiang (2022) Kang Du and Yu Xiang. An invariant matching property for distribution generalization under intervened response. In 2022 30th European Signal Processing Conference (EUSIPCO), pages 1387–1391. IEEE, 2022.
Du and Xiang (2023a) Kang Du and Yu Xiang. Generalized invariant matching property via lasso. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023a.
Du and Xiang (2023b) Kang Du and Yu Xiang. Learning invariant representations under general interventions on the response. IEEE Journal on Selected Areas in Information Theory, 2023b.
Dutta and Li (2017) Aritra Dutta and Xin Li. On a problem of weighted low-rank approximation of matrices. SIAM Journal on Matrix Analysis and Applications, 38(2):530–553, 2017.
Efron (1986) Bradley Efron. How biased is the apparent error rate of a prediction rule? Journal of the American statistical Association, 81(394):461–470, 1986.
Gidaris et al. (2018) Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.
Gui et al. (2023) Jie Gui, Tuo Chen, Qiong Cao, Zhenan Sun, Hao Luo, and Dacheng Tao. A survey of self-supervised learning from multiple perspectives: Algorithms, theory, applications and future trends. arXiv preprint arXiv:2301.05712, 2023.
Hastie et al. (2009) Trevor Hastie, Robert Tibshirani, and Jerome H Friedman. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
Hsu et al. (2012) Daniel Hsu, Sham M Kakade, and Tong Zhang. Random design analysis of ridge regression. In Conference on learning theory, pages 9–1. JMLR Workshop and Conference Proceedings, 2012.
Lee et al. (2021) Jason D Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo. Predicting what you already know helps: Provable self-supervised learning. Advances in Neural Information Processing Systems, 34:309–323, 2021.
Ozbulak et al. (2023) Utku Ozbulak, Hyun Jung Lee, Beril Boga, Esla Timothy Anzaku, Homin Park, Arnout Van Messem, Wesley De Neve, and Joris Vankerschaver. Know your self-supervised learning: A survey on image-based generative and discriminative training. arXiv preprint arXiv:2305.13689, 2023.
Pathak et al. (2016) Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
Pearl (2009) Judea Pearl. Causality. Cambridge University Press, 2009.
Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018.
Saunshi et al. (2020) Nikunj Saunshi, Sadhika Malladi, and Sanjeev Arora. A mathematical exploration of why language models help solve downstream tasks. arXiv preprint arXiv:2010.03648, 2020.
Selvaraju et al. (2017) Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
Srebro and Jaakkola (2003) Nathan Srebro and Tommi Jaakkola. Weighted low-rank approximations. In Proceedings of the 20th international conference on machine learning (ICML-03), pages 720–727, 2003.
Teng et al. (2022) Jiaye Teng, Weiran Huang, and Haowei He. Can pretext-based self-supervised learning be boosted by downstream data? a theoretical analysis. In International Conference on Artificial Intelligence and Statistics, pages 4198–4216. PMLR, 2022.
Tibshirani (1996) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996.
Tosh et al. (2021) Christopher Tosh, Akshay Krishnamurthy, and Daniel Hsu. Contrastive learning, multi-view redundancy, and linear models. In Algorithmic Learning Theory, pages 1179–1206. PMLR, 2021.
Vincent et al. (2010) Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and Léon Bottou. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(12), 2010.
Wei et al. (2021) Colin Wei, Sang Michael Xie, and Tengyu Ma. Why do pretrained language models help in downstream tasks? an analysis of head and prompt tuning. Advances in Neural Information Processing Systems, 34:16158–16170, 2021.
Wolkowicz and Styan (1980) Henry Wolkowicz and George PH Styan. Bounds for eigenvalues using traces. Linear algebra and its applications, 29:471–506, 1980.
Zhang et al. (2016) Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pages 649–666. Springer, 2016.
Zhang (2005) Tong Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural computation, 17(9):2077–2098, 2005.

Appendix A Conditional Mean Independence

Recall that when $C^{h}$ is a constant function, we have $X_{2}=\widetilde{C}Y+N$ and we require $\operatorname{\sf E}[N|X_{1},Y]=0$ rather than $\operatorname{\sf E}[N|Y]=0$ .

Proposition A.1.

Model (4) holds with $C^{h}$ being a constant function if and only if $\operatorname{\sf E}[X_{2}|X_{1},Y]=\operatorname{\sf E}[X_{2}|Y]$ .

Proof A.2.

First, assume that $X_{2}=\widetilde{C}Y+N$ holds with $\operatorname{\sf E}[N|X_{1},Y]=0$ . It follows immediately that

\operatorname{\sf E}[X_{2}|X_{1},Y]=\operatorname{\sf E}[\widetilde{C}Y+N|X_{1% },Y]=\widetilde{C}Y

and $\operatorname{\sf E}[X_{2}|Y]=\operatorname{\sf E}[\widetilde{C}Y+N|Y]=% \widetilde{C}Y$ , where we use the fact that $\operatorname{\sf E}[N|Y]=\operatorname{\sf E}[\operatorname{\sf E}[N|X_{1},Y]% |Y]=0$ . Thus we have $\operatorname{\sf E}[X_{2}|X_{1},Y]=\operatorname{\sf E}[X_{2}|Y]$ . Now we prove the other direction. Assume that $\operatorname{\sf E}[X_{2}|X_{1},Y]=\operatorname{\sf E}[X_{2}|Y]$ , then

\operatorname{\sf E}[X_{2}|X_{1},Y]=\operatorname{\sf E}[X_{2}|Y]=\sum_{i=1}^{% p}\operatorname{\sf E}[X_{2}|Y=e_{i}]\mathbbm{1}_{\bar{Y}=y_{i}}:=\widetilde{C% }Y,

where $\widetilde{C}$ has columns $\operatorname{\sf E}[X_{2}|Y=e_{i}]$ ’s, which implies model (A.1) in the form of $X_{2}=\widetilde{C}Y+N$ .

Appendix B Proof of Proposition 3.1

Proof B.1.

First, according to Definition 2.2, the exacting matching is equivalent to

\operatorname{\sf E}[Y|X_{1}]=\beta\operatorname{\sf E}[X_{2}|X_{1}]

(9)

for some $\beta\in\mathbb{R}^{p\times d_{2}}$ . By our assumption that $\mathop{\rm Cov}\nolimits(\operatorname{\sf E}[Y|X_{1}])$ has full rank, $\beta$ has to have full rank when the exact matching holds. Plugging $\operatorname{\sf E}[X_{2}|X_{1}]=C^{h}(X_{1})\operatorname{\sf E}[Y|X_{1}]$ into (9), we have

({\bf I}_{p}-\beta C^{h}(X_{1}))\operatorname{\sf E}[Y|X_{1}]=\bm{0},

for some $\beta\in\mathbb{R}^{p\times d_{2}}$ . Equivalently,

{\bf I}_{p}-\beta C^{h}(X_{1})=\bar{O}(X_{1})

(10)

for some $\bar{O}:\mathcal{X}_{1}\to\mathbb{R}^{p\times p}$ such that $\bar{O}(X_{1})\operatorname{\sf E}[Y|X_{1}]=\bm{0}$ . Since $\beta$ has full rank, (10) is equivalent to

\beta(C^{h}(X_{1})-\beta^{-1}\bar{O}(X_{1}))={\bf I}_{p},

(11)

where $\beta^{-1}$ denotes the right inverse of $\beta$ and $O(X_{1}):=\beta^{-1}\bar{O}(X_{1})$ satisfies $O(X_{1})\operatorname{\sf E}[Y|X_{1}]=\bm{0}$ .

Appendix C Proof of Lemma 3.2

Proof C.1.

For simplicity of notation, we prove the lemma without considering the orthogonal term, while the orthogonal term can be directly added to the final expression of $C^{h}$ . Recall the dimension of $\beta$ is $p$ by $d_{2}$ . According to Proposition 3.1, it is sufficient to show that $\beta C^{h}(x)={\bf I}_{p}$ is equivalent to (6). The “if” part is immediate since $\beta=[C^{-1},\bm{0}]A^{-1}$ will lead to $\beta C^{h}(x)={\bf I}_{p}$ , $\forall x\in\mathcal{X}_{1}$ . In the following, we prove the other direction. When the exact matching holds, recall that the full rank of $\mathop{\rm Cov}\nolimits(\operatorname{\sf E}[Y|X_{1}])$ implies that $\beta$ has full row rank since $p<d_{2}$ . The QR decomposition of $\beta^{\top}$ gives $\beta=[\bar{C},\bm{0}]\bar{A}$ , where $\bar{C}\in\mathbb{R}^{p\times p}$ is an invertible lower-triangular matrix and $\bar{A}\in\mathbb{R}^{d_{2}\times d_{2}}$ is an orthonormal matrix. Then, $\beta C^{h}(x)={\bf I}_{p}$ implies $[\bar{C},\bm{0}]B(x)={\bf I}_{p}$ with $B(x):=\bar{A}C^{h}(x)$ . Due to the zero columns in $[\bar{C},\bm{0}]$ , the first $p$ rows of $B(x)$ has to be $\bar{C}^{-1}$ , while the other rows, denoted by $R(x)$ , can be arbitrary. Therefore, we obtain

C^{h}(x)=\bar{A}^{-1}B(x)=\bar{A}^{-1}\begin{bmatrix}\bar{C}^{-1}\\ R(x)\end{bmatrix}=\widetilde{A}^{-1}\begin{bmatrix}{\bf I}_{p}\\ R(x)\end{bmatrix},\quad\forall x\in\mathcal{X}_{1},

where $\widetilde{A}^{-1}$ is the product of $\bar{A}^{-1}$ and some elementary matrices introduced to transform $\bar{C}^{-1}$ to an identity matrix ${\bf I}_{p}$ .

Appendix D Low-rank approximation of Smooth Encoding Functions

In this section, we study how the smoothness of the encoding function enables the low-rank approximation. Specifically, we will construct the approximation in (8) explicitly with polynomial functions. For simplicity, we present the idea for second-order approximation, while the higher-order cases can be derived in a similar manner.

Let $C^{h}:\mathcal{X}_{1}\to\mathbb{R}^{d_{2}\times p}$ be a twice continuously differentiable matrix function, the Taylor expansion of its $j^{th}$ column $C^{h}_{j}$ at $a\in\mathcal{X}_{1}$ is given by

	$\displaystyle C_{j}^{h}(x)$	$\displaystyle=C_{j}^{h}(a)+(C_{j}^{h})^{\prime}(a)(x-a)+\begin{bmatrix}\frac{1% }{2}(x-a)^{\top}(C_{1j}^{h})^{\prime\prime}(a)(x-a)\\ \cdots\\ \frac{1}{2}(x-a)^{\top}(C_{d_{2}j}^{h})^{\prime\prime}(a)(x-a)\end{bmatrix}+% \mathcal{O}(\|\|x-a\|\|^{3})$
		$\displaystyle:=C_{j}^{h}(a)+A_{j}(a)\phi(x)+\mathcal{O}(\|\|x-a\|\|^{3}),$		(12)

where the $i^{th}$ row of $(C_{j}^{h})^{\prime}(a)\in\mathbb{R}^{d_{2}\times d_{1}}$ is the derivative of $C^{h}_{ij}$ evaluated at $x=a$ and $(C_{ij}^{h})^{\prime\prime}(a)$ is the Hessian matrix of $C_{ij}^{h}$ evaluated at $x=a$ . We represent $C_{j}^{h}(x)$ in a matrix form in (12) by introducing $\phi(x)=(x_{1}-a_{1},x_{2}-a_{2},\ldots,x_{d_{1}}-a_{d_{1}},(x_{1}-a_{1})^{2},% (x_{1}-a_{1})(x_{2}-a_{2}),(x_{1}-a_{1})(x_{3}-a_{3}),\ldots,(x_{d_{1}-1}-a_{d% _{1}-1})(x_{d_{1}}-a_{d_{1}}),(x_{d_{1}}-a_{d_{1}})^{2})^{\top}\in\mathbb{R}^{% d_{1}+d_{1}^{2}}$ and the coefficient matrix $A_{j}(a)\in\mathbb{R}^{d_{2}\times(d_{1}+d_{1}^{2})}$ consisting of the (scaled) first two order derivatives. This allows us to approximate $C^{h}$ by

C^{h}(x)=C^{h}(a)+\begin{bmatrix}A_{1}(a)\phi(x),A_{2}(a)\phi(x),\ldots,A_{p}(% a)\phi(x)\end{bmatrix}+\mathcal{O}(||x-a||^{3}).

Let the maximum rank of the matrices $\{A_{i}(a):i=1,\ldots,p\}$ be $s$ , then there exists $B\in\mathbb{R}^{d_{2}\times s}$ and $\{D_{i}(a)\in\mathbb{R}^{s\times(d_{1}+d_{1}^{2})}:i=1,\ldots,p\}$ such that

C^{h}(x)=C^{h}(a)+B\begin{bmatrix}D_{1}(a)\phi(x),D_{2}(a)\phi(x),\ldots,D_{p}% (a)\phi(x)\end{bmatrix}+\mathcal{O}(||x-a||^{3}),

(13)

which is enabled by the decomposition $A_{i}(a)=BD_{i}(a)$ for each $i$ . With the continuity of $C^{h}(x)$ , we can fix $a$ so that $C^{h}(a)=\widetilde{C}$ by the mean value theorem. The high-order reminder term can be ignored when the third derivatives of $C^{h}$ are all zeros. If this is not the case, we can include high-order terms in $\phi$ until the reminder term is small enough. If the maximum rank $s$ is not small, one can still consider the low-rank approximation of $\{A_{i}(a)\}$ , and there will be an additional approximation error term in (13).

Appendix E SSL under Gaussian Distribution

Even though our formulation focuses on the classification setting, the extension of our results to the Gaussian case is straightforward. Assume $\{X_{1},X_{2},Y\}$ are jointly Gaussian, where $Y$ is a scalar Gaussian variable. Let $\Sigma_{XZ}:=\mathop{\rm Cov}\nolimits(X,Z)$ for any random vectors $X$ and $Z$ . Then, $X:=\operatorname{\sf E}[X_{2}|X_{1}]=\Sigma_{X_{2}X_{1}}\Sigma_{X_{1}X_{1}}^{-% 1}X_{1}$ and $\operatorname{\sf E}[Y|X_{1}]=\Sigma_{YX_{1}}\Sigma_{X_{1}X_{1}}^{-1}X_{1}$ . If $d_{2}\geq d_{1}$ and $\Sigma_{X_{2}X_{1}}$ has a full rank, it is straightforward to see that an exact matching holds with $\beta=\Sigma_{YX_{1}}\Sigma_{X_{2}X_{1}}^{\dagger}$ . Concretely, there exists $A\in\mathbb{R}^{d_{1}\times d_{2}}$ and $b\in\mathbb{R}^{d_{2}}$ such that

X_{2}=h(X_{1},Y)+N=AX_{1}+bY+N,

(14)

where $\operatorname{\sf E}[N|X_{1},Y]=\bm{0}$ .

Since the encoding function is not well-defined when $Y$ is continuous, we formulate the low-rank approximation by

	$\displaystyle\varepsilon_{s}:=\min_{\hat{X}}\operatorname{\sf E}\left[\left% \lVert X-\hat{X}\right\rVert^{2}\right]$	$\displaystyle=\min_{B\in\mathbb{R}^{d_{2}\times d_{1}}:\text{rank}(B)=s}% \operatorname{\sf E}\left[\left\lVert\Sigma_{X_{2}X_{1}}\Sigma_{X_{1}X_{1}}^{-% 1}X_{1}-BX_{1}\right\rVert^{2}\right]$
		$\displaystyle=\min_{B\in\mathbb{R}^{d_{2}\times d_{1}}:\text{rank}(B)=s}\left% \lVert(\Sigma_{X_{2}X_{1}}\Sigma_{X_{1}X_{1}}^{-1}-B)\Sigma_{X_{1}X_{1}}^{1/2}% \right\rVert^{2},$

which aims to find a weighted low-rank approximation for $\Sigma_{X_{2}X_{1}}\Sigma_{X_{1}X_{1}}^{-1}$ .

Appendix F Proof of Lemma 4.2

Proof F.1.

We have

	$\displaystyle\min_{\beta}\mathsf{error}_{\text{apx}}(\beta)$	$\displaystyle=\min_{\beta}\operatorname{\sf E}[\lVert\operatorname{\sf E}[Y\|X_% {1}]-\beta C^{h}(X_{1})\operatorname{\sf E}[Y\|X_{1}]]\rVert^{2}]$
		$\displaystyle\leq\min_{\beta}\operatorname{\sf E}[\lVert\operatorname{\sf E}[Y% \|X_{1}]\rVert^{2}]\operatorname{\sf E}[\lVert{\bf I}_{p}-\beta(\widetilde{C}+B% ^{}g^{}(X_{1}))\rVert^{2}]+\varepsilon_{s}\lVert\beta\rVert^{2}$
		$\displaystyle\leq\min_{\beta}2(p+\operatorname{\sf E}[\lVert N\rVert^{2}])% \left(\lVert{\bf I}_{p}-\beta\widetilde{C}\rVert^{2}+\lVert\beta B^{*}\rVert^{% 2}\right)+\varepsilon_{s}\lVert\beta\rVert^{2}$
		$\displaystyle\leq 2(p+\operatorname{\sf E}[\lVert N\rVert^{2}])\min_{\beta}% \left(\lVert{\bf I}_{p}-\beta\widetilde{C}\rVert^{2}+\lVert\beta B^{*}\rVert^{% 2}+\varepsilon_{s}\lVert\beta\rVert^{2}\right),$

where the first inequality follows from the sub-multiplicativity of the matrix norm and the triangle inequality and the last two inequalities are due to the triangle inequality, and the fact that $\operatorname{\sf E}[\lVert\operatorname{\sf E}[Y|X_{1}]\rVert^{2}]=% \operatorname{\sf E}[\lVert Y-N\rVert^{2}]\leq 2\operatorname{\sf E}[\lVert Y% \rVert^{2}]+2\operatorname{\sf E}[\lVert N\rVert^{2}]\leq 2(p+\operatorname{% \sf E}[\lVert N\rVert^{2}])$ . When $\varepsilon_{s}=0$ , note that $[\widetilde{C},B^{*}]$ has rank at most $p+s\leq d_{2}$ , thus there exists at least one solution $\beta\in\mathbb{R}^{p\times d_{2}}$ for the equation $\beta[\widetilde{C},B^{*}]=[\beta\widetilde{C},\beta B^{*}]=[{\bf I}_{p},\bm{0}]$ . The expression of $\beta_{s}$ is a standard expression of ridge-type estimators.

Appendix G Proof of Theorem 5.2

Lemma G.1 (Concentration on the covariance matrix (Lee et al., 2021)).

For $\bm{X}\in\mathbb{R}^{n\times d}$ with i.i.d. rows, where each row is $\rho^{2}$ -sub-Gaussian with covariance $\Sigma$ . For any $B\in\mathbb{R}^{d\times m}$ with rank $k$ that is independent of $\bm{X}$ . For any $\delta\in(0,1)$ , if $n\gg\rho^{4}(k+\log(\frac{1}{\delta}))$ , with probability at least $1-\frac{\delta}{10}$ , we have

0.9B^{\top}\Sigma B\preceq\frac{1}{n}B^{\top}\bm{X}^{\top}\bm{X}B\preceq 1.1B^% {\top}\Sigma B.

Lemma G.2 ((Lee et al., 2021)).

Let $\bm{P}\in\mathbb{R}^{n\times n}$ be a projection matrix and let $\bm{Z}\in\mathbb{R}^{n\times d}$ be a matrix with i.i.d. rows, where each row of $\bm{Z}$ is mean zero (conditioning on $\bm{P}$ being rank $k$ ) $\sigma^{2}$ -sub-Gaussian. For any $\delta\in(0,1)$ , with probability at least $1-\delta$ ,

\lVert\bm{P}\bm{Z}\rVert\lesssim\sigma\sqrt{k(1+\log(k/\delta))}.

G.1 Technical Lemmas

Lemma G.3.

Under Assumptions 3 and 4, for any $s$ such that $1\leq s\leq d_{2}-p$ ,

$\bullet$

$\widetilde{\Sigma}\text{\, satisfies \,}\widetilde{\Sigma}\preceq\tilde{a}(1+% \varepsilon_{s})\Sigma\text{\,\, for some \,}\tilde{a}\geq 0$ ;
$\bullet$

$\bar{\Sigma}\text{\, satisfies \,}\bar{\Sigma}\preceq\bar{a}\varepsilon_{s}% \Sigma\text{\,\, for some \,}\bar{a}\geq 0$ .

Proof G.4.

To prove $A\preceq aB$ for symmetrical matrices $A,B\in\mathbb{R}^{d\times d}$ and $a\geq 0$ , one can simply prove $\lambda_{j}(A)\leq a\lambda_{j}(B),\forall j$ . This holds immediately for zero eigenvalues $\lambda_{j}(B)$ ’s if $\text{rank}(A)\leq\text{rank}(B)$ . Therefore, by Assumption 4, we will focus on the case when $\Sigma$ has all positive eigenvalues. First,

\displaystyle\varepsilon_{s}=\frac{1}{d_{2}}\operatorname{\sf E}[\lVert% \widetilde{X}-X\rVert^{2}]\geq\frac{1}{d_{2}}\left\lvert\operatorname{\sf E}[% \lVert\widetilde{X}\rVert^{2}]-\operatorname{\sf E}[\lVert X\rVert^{2}]\right% \rvert=\frac{1}{d_{2}}\left\lvert\text{tr}(\widetilde{\Sigma}-\Sigma)\right\rvert.

In the following, we will use the fact that $\lambda_{\text{min}\neq 0}(A){\bf I}_{d}\preceq A\preceq\lambda_{\text{max}}(A% ){\bf I}_{d}$ for any symmetric matrix $A\in\mathbb{R}^{d\times d}$ . Using Assumption 3,

\widetilde{\Sigma}-\Sigma\preceq\lambda_{\text{max}}(\widetilde{\Sigma}-\Sigma% ){\bf I}_{d_{2}}\leq\frac{c_{1}}{d_{2}}\lvert\text{tr}(\widetilde{\Sigma}-% \Sigma)\rvert{\bf I}_{d_{2}}\leq c_{1}\varepsilon_{s}{\bf I}_{d_{2}},

which implies,

\lambda_{i}(\widetilde{\Sigma})\leq\lambda_{i}(\Sigma)+c_{1}\varepsilon_{s}% \leq\lambda_{i}(\Sigma)+\frac{\lambda_{i}(\Sigma)}{\lambda_{\text{min}\neq 0}(% \Sigma)}c_{1}\varepsilon_{s}=\left(1+\frac{c_{1}\varepsilon_{s}}{\lambda_{% \text{min}\neq 0}(\Sigma)}\right)\lambda_{i}(\Sigma)\leq\tilde{a}(1+% \varepsilon_{s})\lambda_{i}(\Sigma),

for every $i$ and $\tilde{a}\geq\max(1,\frac{c_{1}}{\lambda_{\text{min}\neq 0}(\Sigma)})$ . This immediately leads to $\widetilde{\Sigma}\preceq\tilde{a}(1+\varepsilon_{s})\Sigma$ .

Finally, recall the fact that $\varepsilon_{s}=\frac{1}{d_{2}}\cdot\text{tr}(\bar{\Sigma})$ . By Assumption 3, we have

\bar{\Sigma}\preceq\lambda_{\text{max}}(\bar{\Sigma}){\bf I}_{d_{2}}\preceq% \frac{c_{2}}{d_{2}}\text{tr}(\bar{\Sigma}){\bf I}_{d_{2}}=c_{2}\varepsilon_{s}% {\bf I}_{d_{2}}\preceq\frac{c_{2}}{\lambda_{\text{min}\neq 0}(\Sigma)}% \varepsilon_{s}\Sigma:=\bar{a}\varepsilon_{s}\Sigma.

G.2 Proof of Theorem 5.2

Proof G.5.

First, recall that $\beta^{*}\in\arg\min_{\beta}\mathsf{error}_{\text{apx}}(\beta)$ and the shorthand $X:=\operatorname{\sf E}[X_{2}|X_{1}]$ . By the triangle inequality, we have

\mathcal{R}(\hat{\beta}_{0})=\operatorname{\sf E}[\lVert\operatorname{\sf E}[Y% |X_{1}]-\hat{\beta}_{0}X)\rVert^{2}]\leq\mathsf{error}_{\text{apx}}^{*}+% \operatorname{\sf E}[\lVert(\beta^{*}-\hat{\beta}_{0})X\rVert^{2}].

Denote $a(X_{1}):=\operatorname{\sf E}[Y|X_{1}]-\beta^{*}X$ .

Recall that $X=\widetilde{X}+\bar{X}$ . Then, we can write $Y=\beta^{*}\widetilde{X}+\beta^{*}\bar{X}+a(X_{1})+N$ , with $N$ satisfying $\operatorname{\sf E}[N|X_{1}]=0$ by the tower property. The definition of $\hat{\beta}_{0}$ implies

\lVert\bm{Y}-\bm{X}\hat{\beta}_{0}^{\top}\rVert^{2}\leq\lVert\bm{Y}-\bm{X}(% \beta^{*})^{\top}\rVert^{2}=\lVert a(\bm{X}_{1})+\bm{N}\rVert^{2}.

By rearranging the terms, we get

	$\displaystyle\lVert\bm{X}(\beta^{*}-\hat{\beta}_{0})^{\top})\rVert^{2}$	$\displaystyle\leq\langle a(\bm{X}_{1}),\bm{\tilde{X}}(\beta^{}-\hat{\beta}_{0% })^{\top}\rangle-\langle\bm{N},\bm{\tilde{X}}(\beta^{}-\hat{\beta}_{0})^{\top}\rangle$
		$\displaystyle+\langle a(\bm{X}_{1}),\bm{\bar{X}}(\beta^{}-\hat{\beta}_{0})^{% \top}\rangle-\langle\bm{N},\bm{\bar{X}}(\beta^{}-\hat{\beta}_{0})^{\top}\rangle.$

We bound the first two inner products in the following, and the other two follow similarly. First,

$\displaystyle\langle a(\bm{X}_{1}),\bm{\tilde{X}}(\beta^{*}-\hat{\beta}_{0})^{% \top}\rangle$	$\displaystyle=\langle\widetilde{\Sigma}^{-1/2}\bm{\tilde{X}}^{\top}a(\bm{X}_{1% }),\widetilde{\Sigma}^{1/2}(\beta^{*}-\hat{\beta}_{0})^{\top}\rangle$
	$\displaystyle\leq\lVert\widetilde{\Sigma}^{-1/2}\bm{\tilde{X}}^{\top}a(\bm{X}_% {1})\rVert\lVert\widetilde{\Sigma}^{1/2}(\beta^{*}-\hat{\beta}_{0})^{\top}\rVert$
	$\displaystyle\leq 1.1\tilde{a}\tilde{b}\sqrt{1+\varepsilon_{s}}\sqrt{n(p+s)}% \lVert\Sigma^{1/2}(\beta^{*}-\hat{\beta}_{0})^{\top}\rVert$
	$\displaystyle\lesssim\sqrt{1+\varepsilon_{s}}\sqrt{p+s}\lVert\bm{X}(\beta^{*}-% \hat{\beta}_{0})^{\top}\rVert,$	(15)

where the inequality $\lVert\widetilde{\Sigma}^{-1/2}\bm{\tilde{X}}^{\top}a(\bm{X}_{1})\rVert\leq 1.% 1\tilde{b}\sqrt{n(p+s)}$ is due to Assumption 2 and the covariance concentration in Lemma G.1, and the last inequality is simply due to the covariance concentration. The replacement of $\widetilde{\Sigma}$ by $\Sigma$ is by Lemma G.3. Let $\bm{P}_{\bm{\tilde{X}}}$ denote the projection matrix defined with respect to $\bm{\tilde{X}}$ , we have

$\displaystyle\langle\bm{N},\bm{\tilde{X}}(\beta^{*}-\hat{\beta}_{0})^{\top}\rangle$	$\displaystyle=\langle\bm{P}_{\bm{\tilde{X}}}\bm{N},\bm{\tilde{X}}(\beta^{*}-% \hat{\beta}_{0})^{\top}\rangle$
	$\displaystyle\leq\lVert\bm{P}_{\bm{\tilde{X}}}\bm{N}\rVert\lVert\bm{\tilde{X}}% (\beta^{*}-\hat{\beta}_{0})^{\top}\rVert.$
	$\displaystyle\lesssim\sigma\sqrt{1+\varepsilon_{s}}\sqrt{(p+s)\left(1+\log% \frac{p+s}{\delta}\right)}\lVert\bm{X}(\beta^{*}-\hat{\beta}_{0})^{\top}\rVert,$	(16)

where the last bound is due to Lemma G.2 and the replacement of $\bm{\tilde{X}}$ by $\bm{X}$ follows from the covariance concentration and Lemma G.3. Since we make no assumptions on the rank of $\bar{\Sigma}$ , it has at most rank $d_{2}$ . Similarly, we get

	$\displaystyle\langle a(\bm{X}_{1}),\bm{\bar{X}}(\beta^{*}-\hat{\beta}_{0})^{% \top}\rangle$	$\displaystyle\lesssim\sqrt{\varepsilon_{s}}\sqrt{d_{2}}\lVert\bm{X}(\beta^{*}-% \hat{\beta}_{0})^{\top}\rVert$		(17)
	$\displaystyle\langle\bm{N},\bm{\bar{X}}(\beta^{*}-\hat{\beta}_{0})^{\top}\rangle$	$\displaystyle\lesssim\sigma\sqrt{\varepsilon_{s}}\sqrt{d_{2}\left(1+\log\frac{% d_{2}}{\delta}\right)}\lVert\bm{X}(\beta^{*}-\hat{\beta}_{0})^{\top}\rVert.$		(18)

Combining (15), (16), (17), and (18) yields

\lVert\bm{X}(\beta^{*}-\hat{\beta}_{0})^{\top}\rVert\lesssim\sigma\sqrt{1+% \varepsilon_{s}}\sqrt{(p+s)\left(1+\log\frac{p+s}{\delta}\right)}+\sigma\sqrt{% \varepsilon_{s}}\sqrt{d_{2}\left(1+\log\frac{d_{2}}{\delta}\right)}.

Finally, the covariance concentration implies

\operatorname{\sf E}[\lVert(\beta^{*}-\hat{\beta}_{0})X\rVert^{2}]\lesssim(1+% \varepsilon_{s})\frac{(p+s)\left(1+\log\frac{p+s}{\delta}\right)}{n}\sigma^{2}% +\varepsilon_{s}\frac{d_{2}\left(1+\log\frac{d_{2}}{\delta}\right)}{n}\sigma^{% 2}.

Appendix H Proof of Corollary 5.4

Lemma H.1.

Under Assumptions 3 and 4, for $\lambda>0$ , there exists $c_{2}$ such that

d_{\lambda}\leq c_{2}\left(1+\frac{\varepsilon_{s}}{\lambda}\right)(p+s).

Remark H.2.

It is known that $d_{\lambda}=\operatorname{\sf E}[\lVert(\Sigma+\lambda{\bf I})^{-1/2}X\rVert^{% 2}]$ (Hsu et al., 2012). When $\varepsilon_{s}=0$ , we have $\operatorname{\sf E}[\lVert(\Sigma+\lambda{\bf I})^{-1/2}X\rVert^{2}]\leq\text% {rank}(\Sigma)=p+s$ , where the equality holds when $\lambda=0$ .

Proof H.3.

Let $\{\tilde{\lambda}_{i}\}$ and $\{\lambda_{i}\}$ denote the eigenvalues of $\Sigma$ and $\widetilde{\Sigma}$ , respectively. Recall that Assumptions 3 and 4 imply $\tilde{\lambda}_{i}-\lambda_{i}\leq c_{1}\varepsilon_{s}$ as shown in the proof of Lemma G.3, then we have

	$\displaystyle d_{\lambda}=\sum_{i=1}^{d}\frac{\lambda_{i}}{\lambda+\lambda_{i}}$	$\displaystyle=\sum_{i=1}^{d}\frac{\tilde{\lambda}_{i}}{\lambda+\tilde{\lambda}% _{i}}+\lambda\sum_{i=1}^{d}\frac{\lambda_{i}-\tilde{\lambda}_{i}}{(\lambda+{% \lambda}_{i})(\lambda+\tilde{\lambda}_{i})}$
		$\displaystyle\leq\sum_{i=1}^{d}\frac{\tilde{\lambda}_{i}}{\lambda+\tilde{% \lambda}_{i}}+c_{1}\varepsilon_{s}\sum_{i=1}^{d}\frac{1}{(\lambda+{\lambda}_{i% })(\lambda+\tilde{\lambda}_{i})}$
		$\displaystyle=\sum_{i=1}^{d}\frac{\tilde{\lambda}_{i}}{\lambda+\tilde{\lambda}% _{i}}+\frac{c_{1}\varepsilon_{s}}{\lambda\cdot\lambda_{\text{min}\neq 0}(% \widetilde{\Sigma})}\sum_{i=1}^{d}\frac{\lambda\cdot\lambda_{\text{min}\neq 0}% (\widetilde{\Sigma})}{(\lambda+\lambda_{i})(\lambda+\tilde{\lambda}_{i})},$

where $\frac{\lambda}{\lambda+\lambda_{i}}\leq 1$ for $\forall i$ and $\sum_{i=1}^{d}\frac{\tilde{\lambda}_{i}}{\lambda+\tilde{\lambda}_{i}}\leq p+s$ since $\text{rank}(\widetilde{\Sigma})\leq p+s$ . Finally, we get

d_{\lambda}\leq p+s+\frac{c_{1}\varepsilon_{s}(p+s)}{\lambda\cdot\lambda_{% \text{min}\neq 0}(\widetilde{\Sigma})}\leq c_{2}\left(1+\frac{\varepsilon_{s}}% {\lambda}\right)(p+s),

where $c_{2}\geq\max\left\{1,\frac{c_{1}}{\lambda_{\text{min}\neq 0}(\widetilde{% \Sigma})}\right\}$ .

Corollary 5.4 Under (Hsu et al., 2012, Condition 2 and 4), and the assumptions in Lemma H.1, the excess risk of the downstream task can be upper bounded by

\mathcal{R}(\hat{\beta}_{\lambda})\leq\mathsf{error}_{\text{apx}}^{*}+% \operatorname{\sf E}[\lVert(\beta_{\lambda}-\beta^{*})X\rVert^{2}]+\mathcal{O}% \left(\frac{p+s}{n}\left(1+\frac{\varepsilon_{s}}{\lambda}\right)\tilde{\sigma% }^{2}\right),

with high probability, where $\tilde{\sigma}^{2}:=\left(\operatorname{\sf E}[\lvert\operatorname{\sf E}[Y|X]% -\beta_{\lambda}X\rVert^{2}]+\operatorname{\sf E}[\lVert(\beta_{\lambda}-\beta% ^{*})X\rVert^{2}]+\sigma^{2}\right)$ .

We only outline the main steps and refer the readers to (Hsu et al., 2012, Theorem 16) for details. First, by the triangular inequality, we have $\mathcal{R}(\hat{\beta}_{\lambda})\leq\mathsf{error}_{\text{apx}}^{*}+% \operatorname{\sf E}[\lVert(\hat{\beta}_{\lambda}-\beta^{*})X\rVert^{2}]$ . The bound on the second term can be obtained as discussed below. The last term is simply due to $d_{2,\lambda}\leq d_{1,\lambda}\leq c_{\lambda}(1+\frac{\varepsilon_{s}}{% \lambda})(p+s)$ by Lemma H.1, where $d_{l,\lambda}=\sum_{j=1}^{d}\left(\frac{\lambda_{j}}{\lambda_{j}+\lambda}% \right)^{l}$ for $l\in\{1,2\}$ . Observe that the ridge estimator with $Y_{j}=\mathbbm{1}_{\bar{Y}=y_{j}}$ as the target variable is equivalent to the $j^{th}$ row of the ridge estimator with $Y$ as the target variable. Thus we can provide a bound on $\operatorname{\sf E}[\lVert(\hat{\beta}_{\lambda,j}-\beta_{j}^{*})X\rVert^{2}]$ for each $j\in\{1,\ldots,p\}$ according to (Hsu et al., 2012, Theorem 16). Summing up the inequalities gives $\operatorname{\sf E}[\lVert(\hat{\beta}_{\lambda}-\beta^{*})X\rVert^{2}]\leq% \operatorname{\sf E}[\lVert(\beta_{\lambda}-\beta^{*})X\rVert^{2}]+% \operatorname{\sf E}[\lVert(\hat{\beta}_{\lambda}-\beta_{\lambda})X\rVert^{2}]$ , where $\operatorname{\sf E}[\lVert(\hat{\beta}_{\lambda}-\beta_{\lambda})X\rVert^{2}]$ is upper bounded in (Hsu et al., 2012, Theorem 16).

Appendix I Synthetic Data: Details of the Data Generation

Let $X_{1}\sim\mathcal{N}(\bm{0},I_{d_{1}})$ . Note that $\operatorname{\sf E}[\lVert X_{1}\rVert]=\sqrt{2}\frac{\Gamma(5.5)}{\Gamma(5)}% \approx 3.08$ , where $\Gamma(\cdot)$ is the Gamma function. The label $\bar{Y}$ is determined by $\lVert X_{1}\rVert$ as follows: $\bar{Y}=0$ when $\lVert X_{1}\rVert<2.5$ ; $\bar{Y}=1$ when $2.5\leq\lVert X_{1}\rVert<3.5$ ; $\bar{Y}=2$ when $\lVert X_{1}\rVert\geq 3.5$ . Then, let $X_{2}=(A+Bg(X_{1}))Y+N$ where $A$ and $B$ are matrices with i.i.d. entries from $\mathrm{Unif}[-2,2]$ and $N\sim\mathcal{N}(\bm{0},{\bf I}_{d_{2}})$ . The $(i,j)^{th}$ element of $g(X_{1})$ is given by $(\max_{k}(X_{1,k}))\cdot\sin(\frac{2\pi i}{s}\min_{k}(X_{1,k})+\frac{2\pi j}{p})$ . The sample sizes of the pretext training data and testing data are $10^{4}$ and $10^{3}$ , respectively. The MLPs used for the pretext task and two SL procedures all have two fully connected hidden layers with ReLU activation. The batch size is $32$ , the number of epochs is $10$ , and the learning rate is $0.001$ .

Appendix J Further Details of the Computer Vision Task

The sample sizes of the pretext task and testing are fixed to be $20000$ and $1000$ , respectively. The edge of the triangle is sampled from $\mathrm{Unif}[8,32]$ , the radius of the circles is sampled from $\mathrm{Unif}[5,10]$ , and the pentagon is drawn within a circle with radius samples from $\mathrm{Unif}[8,64/3]$ . The sizes are chosen to ensure that the average areas of the objects are similar. The pretext task and SL both use convolution neural networks (CNNs) consisting of two convolution layers and two fully connected layers with ReLU activation. The learned representation is obtained from the second convolution layer of the CNN, which has a dimension $d_{2}=12544$ . Since the rotated image $X_{2}$ has the same label as $X_{1}$ , we also use the rotated images as additional labeled data for the downstream task and SL. In the pretext task, the batch size is $64$ , the number of epochs for training is $15$ , and the learning rate is $0.001$ . For SL, the batch size the modified to be $32$ since the sample size is much smaller. For ridge regression, we choose the shrinkage parameter $\lambda$ from $200$ numbers evenly spaced on a log scale over $[0.001,100]$ with $5$ -fold cross-validation.

In Section 6.2, we explain that the unsatisfactory performance of SSL for Triangle vs. Pentagon is potentially due to the analogous characteristics (i.e., the edges and vertices) that determine the orientation of the object. From a theoretical perspective, the full rank condition on $C^{h}(x)$ (that is a necessary condition for the exact matching) is violated since the columns of $C^{h}(x)$ are approximately linearly dependent. This does not happen for Triangle vs. Tangent Circles since the orientation of the circles is determined by curved edges. To further support our observation, we use Grad-CAM (Selvaraju et al., 2017) to visualize the contributing features that the pretext model used for rotation prediction. To highlight the major components, we only show pixels of the heatmap with top-20% intensity in Fig. 4.

Figure 4: Visualization of the contributing features for rotation prediction in the computer vision task.

Similarly to the stylized MNIST dataset, we add dot and dash patterns to the image background. The space in patterns is sampled from $\mathrm{Unif}[8,32]$ .

Figure 5: Geometric shape images with dot vs. dash background. SSL (red) and SL (blue)

Appendix K Experiments on the MNIST Dataset

The sample sizes for the pretext task and testing is $45000$ and $10000$ , respectively. We compare the performance of SSL and SL under different labeled sample sizes $\{50,100,200,400\}$ . The space $d$ in the sparse and dense patterns is sampled from $\mathrm{Unif}[3,8]$ and $\mathrm{Unif}[8,15]$ , respectively. We randomly shift each pattern by $\mathrm{Unif}[0,d]$ to avoid the position of the pattern being correlated with the image orientation. We use the same CNN configuration as the geometric shape task.

Figure 6: The original MNIST dataset and four stylized versions of MNIST. SSL (red) and SL (blue).

Figure 7: A visualization of the contributing features for MNIST dataset with dash vs. dot background.