\clearauthor\Name

Kang Du \Email[email protected]
\addrUniversity of Utah and \NameYu Xiang \Email[email protected]
\addrUniversity of Utah

Low-Rank Approximation of Structural Redundancy for Self-Supervised Learning

Abstract

We study the data-generating mechanism for reconstructive SSL to shed light on its effectiveness. With an infinite amount of labeled samples, we provide a sufficient and necessary condition for perfect linear approximation. The condition reveals a full-rank component that preserves the label classes of Y𝑌Yitalic_Y, along with a redundant component. Motivated by the condition, we propose to approximate the redundant component by a low-rank factorization and measure the approximation quality by introducing a new quantity εssubscript𝜀𝑠\varepsilon_{s}italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, parameterized by the rank of factorization s𝑠sitalic_s. We incorporate εssubscript𝜀𝑠\varepsilon_{s}italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT into the excess risk analysis under both linear regression and ridge regression settings, where the latter regularization approach is to handle scenarios when the dimension of the learned features is much larger than the number of labeled samples n𝑛nitalic_n for downstream tasks. We design three stylized experiments to compare SSL with supervised learning under different settings to support our theoretical findings.

keywords:
Self-supervised learning, redundancy, low-rank approximation, ridge regression.

1 Introduction

Reconstructive self-supervised learning (SSL) has been highly successful in various fields (Pathak et al., 2016; Vincent et al., 2010; Radford et al., 2018; Devlin et al., 2018), where the theme is to extract representations from unlabeled data that are potentially useful for downstream tasks. One of the major advantages of SSL is its significantly reduced dependency on labeled data. Despite abundant empirical evidence, the theoretical understanding of the performance of SSL under limited labeled data is still insufficient.

In reconstructive SSL, a pretext task is designed to predict a target X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with input features X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, which yields the learned representation ψ(X1)𝜓subscript𝑋1\psi(X_{1})italic_ψ ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). Then, the downstream task is to predict the target Y𝑌Yitalic_Y using ψ(X1)𝜓subscript𝑋1\psi(X_{1})italic_ψ ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). Whether the learned representation is useful for the downstream task relies on the connections between the pretext and downstream tasks. To bridge the pretext and downstream tasks, the conditional independence (CI) assumption, namely X1X2|Yperpendicular-toabsentperpendicular-tosubscript𝑋1conditionalsubscript𝑋2𝑌X_{1}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{% \displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0% mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.% 0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}% \mkern 2.0mu{\scriptscriptstyle\perp}}}X_{2}\,|\,Yitalic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_RELOP ⟂ ⟂ end_RELOP italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_Y, has been studied in the seminal work (Lee et al., 2021). For the classification setting, they show that CI is a sufficient condition for a linear predictor to be optimal for the downstream task, that is, ψ(X1)𝜓subscript𝑋1\psi(X_{1})italic_ψ ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) can linearly predict Y𝑌Yitalic_Y perfectly with an infinite number of samples available for the task. Motivated by this key observation, they provide theoretical guarantees showing the superior sample complexity of SSL under general approximate conditional independence settings. However, a fundamental theoretical question for understanding reconstructive SSL still remains:

What is the sufficient and necessary condition on (X1,X2,Y)subscript𝑋1subscript𝑋2𝑌(X_{1},X_{2},Y)( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Y ), in the classification setting, for a linear predictor to be optimal for the downstream task?

To address this question, it is helpful to express X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as X2=h(X1,Y)+Nsubscript𝑋2subscript𝑋1𝑌𝑁X_{2}=h(X_{1},Y)+Nitalic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_h ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ) + italic_N, where (X1,Y)subscript𝑋1𝑌(X_{1},Y)( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ) is for an arbitrary supervised learning task and N:=X2𝖤[X2|X1,Y]assign𝑁subscript𝑋2𝖤conditionalsubscript𝑋2subscript𝑋1𝑌N:=X_{2}-\operatorname{\sf E}[X_{2}|X_{1},Y]italic_N := italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ]. With this expression, roughly speaking, the target Y𝑌Yitalic_Y can be decoded from X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT if hhitalic_h is invertible in some sense. We formalize this notion of invertibility by focusing on classification problems. Our formulation allows for general dependency between X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT when conditioning on Y𝑌Yitalic_Y. Thus there are features in the learned representation ψ(X1)𝜓subscript𝑋1\psi(X_{1})italic_ψ ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) that are redundant for the prediction of Y𝑌Yitalic_Y. For instance, for image classification problems, the redundant features may come from the background in the image; if the object of interest is surrounded by other objects in the background, a pretext task of predicting blocked patches of the image may mistakenly extract too many features from the background (Pathak et al., 2016). Without any constraints, a large percentage of redundant features can potentially make SSL fail. To this end, we introduce a quantity εssubscript𝜀𝑠\varepsilon_{s}italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, indexed by rank s𝑠sitalic_s, for a low-rank approximation of the redundancy in the learned representation. We show how εssubscript𝜀𝑠\varepsilon_{s}italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT affects the performance of SSL through both of our theoretical analysis and experiments.

Our main contributions are summarized below.

  1. 1.

    Under the classification setting, we characterize in Section 3 a sufficient and necessary condition for a linear predictor to be optimal in the downstream task.

  2. 2.

    In Section 4, we introduce a low-rank approximation quantity to characterize the redundancy in the learned representation.

  3. 3.

    Based on the low-rank approximation quantity, we derive finite sample bounds on the excess risk and the corresponding sample complexity for both ordinary least squares and ridge regression estimators in Section 5.

  4. 4.

    In Section 6.1, we design a simulation setting to demonstrate the effectiveness of the low-rank approximation. Our sufficient and necessary condition is partially verified through two computer vision tasks in Section 6.2.

1.1 Related work

Reconstructive SSL is focused on recovering deliberately concealed information in the data. In computer vision, examples include the prediction of blocked patches (Pathak et al., 2016), recovering the color (Zhang et al., 2016), denoising (Vincent et al., 2010), and identifying the rotated angle (Gidaris et al., 2018), while the simple scheme of next word prediction is widely adopted in NLP (Radford et al., 2018; Devlin et al., 2018). From the theoretical perspective, (Saunshi et al., 2020) and (Wei et al., 2021) study how pre-trained language models yield useful representation for downstream tasks. For computer vision tasks, (Pathak et al., 2016) provides a theoretical understanding of features learned by auto-encoders under a multi-view data assumption. Under a general formulation of reconstructive SSL, (Lee et al., 2021) shows that CI is sufficient for a linear predictor to be optimal in the downstream task and provides finite sample analysis. Since CI often fails to hold in practical settings, (Teng et al., 2022) proposes to modify the unlabeled data to make CI hold. Their theoretical analysis suggests that the modification is hurtful rather than helpful for the performance of SSL. The other popular type of SSL is called contrastive SSL, where the goal is to learn representations that make different views of the same data point closer. The CI assumption has been adopted in (Arora et al., 2019; Tosh et al., 2021) to provide theoretical guarantees for contrastive learning. In the context of contrastive SSL, CI is a natural assumption since, ideally, two views are expected to share less information given the label. The literature on SSL is vast and we refer the readers to (Gui et al., 2023; Ozbulak et al., 2023) for detailed reviews.

1.2 Notation

Throughout the paper, delimited-∥∥\lVert\cdot\rVert∥ ⋅ ∥ denotes the l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm for vectors or Frobenius norm for matrices. We use 𝟎0\bm{0}bold_0 and 𝟏1\bm{1}bold_1 to denote vectors or matrices of zeros and ones, respectively. For a full (column) rank matrix Am×n𝐴superscript𝑚𝑛A\in\mathbb{R}^{m\times n}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT with n<m𝑛𝑚n<mitalic_n < italic_m, we use Asuperscript𝐴A^{\dagger}italic_A start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT to denote its (left) pseudoinverse. Let Cov(X)Cov𝑋\mathop{\rm Cov}\nolimits(X)roman_Cov ( italic_X ) denote the covariance matrix of a random vector X𝑋Xitalic_X and Cov(X1,X2)Covsubscript𝑋1subscript𝑋2\mathop{\rm Cov}\nolimits(X_{1},X_{2})roman_Cov ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) denote that of two random vectors X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We use 𝒪~~𝒪\widetilde{\mathcal{O}}over~ start_ARG caligraphic_O end_ARG to hide log factors and \lesssim\lesssim\lesssim to hide constants in inequalities. We use 𝐈dsubscript𝐈𝑑{\bf I}_{d}bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT to denote the identity matrix of size d×d𝑑𝑑d\times ditalic_d × italic_d. A random vector Xd𝑋superscript𝑑X\in\mathbb{R}^{d}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is said to be σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-sub-Gaussian if 𝖤[X]=𝟎𝖤𝑋0\operatorname{\sf E}[X]=\bm{0}sansserif_E [ italic_X ] = bold_0 and 𝖤[etuX]et2σ22𝖤superscript𝑒𝑡superscript𝑢top𝑋superscript𝑒superscript𝑡2superscript𝜎22\operatorname{\sf E}[e^{tu^{\top}X}]\leq e^{\frac{t^{2}\sigma^{2}}{2}}sansserif_E [ italic_e start_POSTSUPERSCRIPT italic_t italic_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT ] ≤ italic_e start_POSTSUPERSCRIPT divide start_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT for any t𝑡t\in\mathbb{R}italic_t ∈ blackboard_R and ud𝑢superscript𝑑u\in\mathbb{R}^{d}italic_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT such that u=1delimited-∥∥𝑢1\lVert u\rVert=1∥ italic_u ∥ = 1.

2 Problem Formulation: Reconstructive SSL

Consider (X1,X2,Y)𝒳1×𝒳2×𝒴subscript𝑋1subscript𝑋2𝑌subscript𝒳1subscript𝒳2𝒴(X_{1},X_{2},Y)\in\mathcal{X}_{1}\times\mathcal{X}_{2}\times\mathcal{Y}( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Y ) ∈ caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × caligraphic_Y where X2d2subscript𝑋2superscriptsubscript𝑑2X_{2}\in\mathbb{R}^{d_{2}}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and Y𝑌Yitalic_Y are the target variables for the pretext and downstream tasks, respectively, and X1d1subscript𝑋1superscriptsubscript𝑑1X_{1}\in\mathbb{R}^{d_{1}}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a vector of features shared by the two prediction tasks. We focus on the classification setting for the downstream task, i.e., Y𝑌Yitalic_Y is categorical. For regression problems, one can consider the continuous target variable being discretized to a set of values. We use Y¯𝒴¯={y1,,yp+1}¯𝑌¯𝒴subscript𝑦1subscript𝑦𝑝1\bar{Y}\in\bar{\mathcal{Y}}=\{y_{1},\ldots,y_{p+1}\}over¯ start_ARG italic_Y end_ARG ∈ over¯ start_ARG caligraphic_Y end_ARG = { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_p + 1 end_POSTSUBSCRIPT } to denote the original label variable and Y=(𝟙Y¯=y1,,𝟙Y¯=yp)𝑌superscriptsubscript1¯𝑌subscript𝑦1subscript1¯𝑌subscript𝑦𝑝topY=(\mathbbm{1}_{\bar{Y}=y_{1}},\ldots,\mathbbm{1}_{\bar{Y}=y_{p}})^{\top}italic_Y = ( blackboard_1 start_POSTSUBSCRIPT over¯ start_ARG italic_Y end_ARG = italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , blackboard_1 start_POSTSUBSCRIPT over¯ start_ARG italic_Y end_ARG = italic_y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT to denote its one-hot encoding with one class excluded to avoid multicolinearity as i=1p+1𝟙Y¯=yi=1superscriptsubscript𝑖1𝑝1subscript1¯𝑌subscript𝑦𝑖1\sum_{i=1}^{p+1}\mathbbm{1}_{\bar{Y}=y_{i}}=1∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p + 1 end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT over¯ start_ARG italic_Y end_ARG = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1, and we will simply refer to Y𝑌Yitalic_Y as the one-hot encoding of Y¯¯𝑌\bar{Y}over¯ start_ARG italic_Y end_ARG throughout this work. We assume p<d2𝑝subscript𝑑2p<d_{2}italic_p < italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT throughout the work. For simplicity, we assume that the optimal predictors for different classes of Y𝑌Yitalic_Y are not linearly dependent, i.e., Cov(𝖤[Y|X1])Cov𝖤conditional𝑌subscript𝑋1\mathop{\rm Cov}\nolimits(\operatorname{\sf E}[Y|X_{1}])roman_Cov ( sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ) has full rank; otherwise, certain classes of Y𝑌Yitalic_Y can be hidden to make it hold.

Concretely, we consider the following reconstructive SSL procedure.

  1. 1.

    Pretext task: Given unlabeled data, predict X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT using X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT under some function class ΨΨ\Psiroman_Ψ, i.e., estimate ψ:=argminψΨ𝖤[X2ψ(X1)2]assignsuperscript𝜓subscript𝜓Ψ𝖤superscriptdelimited-∥∥subscript𝑋2𝜓subscript𝑋12\psi^{*}:=\arg\min_{\psi\in\Psi}\operatorname{\sf E}[\lVert X_{2}-\psi(X_{1})% \rVert^{2}]italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT := roman_arg roman_min start_POSTSUBSCRIPT italic_ψ ∈ roman_Ψ end_POSTSUBSCRIPT sansserif_E [ ∥ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_ψ ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ].

  2. 2.

    Downstream task: Given n𝑛nitalic_n labeled data, regress Y𝑌Yitalic_Y on the learned representation ψ(X1)superscript𝜓subscript𝑋1\psi^{*}(X_{1})italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) using simple regression functions such as linear or ridge regression.

Since there is often a large amount of unlabeled data and one can adopt deep neural networks to achieve universal approximation, we fix ψ(x):=𝖤[X2|X1=x]assignsuperscript𝜓𝑥𝖤conditionalsubscript𝑋2subscript𝑋1𝑥\psi^{*}(x):=\operatorname{\sf E}[X_{2}|X_{1}=x]italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) := sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x ] and focus on analyzing the downstream task. Due to the nature of the small (labeled) sample size of SSL, the function class for the downstream task is often assumed to have lower complexity compared to ΨΨ\Psiroman_Ψ (e.g., smaller parameter space). For theoretical analysis, we consider the class of all linear functions for the downstream task similarly as in (Lee et al., 2021). In practice, the advantage of SSL over supervised learning (SL) is more significant when the labeled sample size n𝑛nitalic_n is relatively small, in which case the dimension of ψsuperscript𝜓\psi^{*}italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT can be larger than n𝑛nitalic_n. To avoid the downstream task being ill-posed, we adopt the ridge estimator. To measure the gap between the SSL prediction and the optimal predictor 𝖤[Y|X1]𝖤conditional𝑌subscript𝑋1\operatorname{\sf E}[Y|X_{1}]sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] in infinite and finite samples, respectively, we define the approximation error and excess risk.

Definition 2.1.

Define the approximation error of SSL as 𝖾𝗋𝗋𝗈𝗋apx:=minβ𝖾𝗋𝗋𝗈𝗋apx(β)assignsuperscriptsubscript𝖾𝗋𝗋𝗈𝗋apxsubscript𝛽subscript𝖾𝗋𝗋𝗈𝗋apx𝛽\mathsf{error}_{\text{apx}}^{*}:=\min_{\beta}\mathsf{error}_{\text{apx}}(\beta)sansserif_error start_POSTSUBSCRIPT apx end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT := roman_min start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT sansserif_error start_POSTSUBSCRIPT apx end_POSTSUBSCRIPT ( italic_β ), where

𝖾𝗋𝗋𝗈𝗋apx(β):=𝖤[𝖤[Y|X1]βψ(X1)2]assignsubscript𝖾𝗋𝗋𝗈𝗋apx𝛽𝖤superscriptdelimited-∥∥𝖤conditional𝑌subscript𝑋1𝛽superscript𝜓subscript𝑋12\mathsf{error}_{\text{apx}}(\beta):=\operatorname{\sf E}\left[\left\lVert% \operatorname{\sf E}[Y|X_{1}]-\beta\psi^{*}(X_{1})\right\rVert^{2}\right]sansserif_error start_POSTSUBSCRIPT apx end_POSTSUBSCRIPT ( italic_β ) := sansserif_E [ ∥ sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] - italic_β italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (1)

with ψ(x)=𝖤[X2|X1=x]superscript𝜓𝑥𝖤conditionalsubscript𝑋2subscript𝑋1𝑥\psi^{*}(x)=\operatorname{\sf E}[X_{2}|X_{1}=x]italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) = sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x ], the optimal predictor of X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT given X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

Note that ψ(x)=𝖤[X2|X1=x]superscript𝜓𝑥𝖤conditionalsubscript𝑋2subscript𝑋1𝑥\psi^{*}(x)=\operatorname{\sf E}[X_{2}|X_{1}=x]italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) = sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x ] can be ensured by a function class with universal approximation power such as deep neural networks.

Definition 2.2.

We say there is an exact matching between Y𝑌Yitalic_Y and X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT given X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT if 𝖾𝗋𝗋𝗈𝗋apx=0superscriptsubscript𝖾𝗋𝗋𝗈𝗋apx0\mathsf{error}_{\text{apx}}^{*}=0sansserif_error start_POSTSUBSCRIPT apx end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 0.

For simplicity, we will omit the intercept b(β):=𝖤[Y]β𝖤[X2]assign𝑏𝛽𝖤𝑌𝛽𝖤subscript𝑋2b(\beta):=\operatorname{\sf E}[Y]-\beta\operatorname{\sf E}[X_{2}]italic_b ( italic_β ) := sansserif_E [ italic_Y ] - italic_β sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] throughout the work. The performance of the downstream task is usually quantified through the so-called excess risk defined with respect to the finite sample analysis. Denote X:=ψ(X1)=𝖤[X2|X1]assign𝑋superscript𝜓subscript𝑋1𝖤conditionalsubscript𝑋2subscript𝑋1X:=\psi^{*}(X_{1})=\operatorname{\sf E}[X_{2}|X_{1}]italic_X := italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ]. Let 𝑿1n×d1subscript𝑿1superscript𝑛subscript𝑑1\bm{X}_{1}\in\mathbb{R}^{n\times d_{1}}bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝒀𝒴n×p𝒀superscript𝒴𝑛𝑝\bm{Y}\in\mathcal{Y}^{n\times p}bold_italic_Y ∈ caligraphic_Y start_POSTSUPERSCRIPT italic_n × italic_p end_POSTSUPERSCRIPT be the labeled data, and 𝑿:=ψ(𝑿1)n×d2assign𝑿superscript𝜓subscript𝑿1superscript𝑛subscript𝑑2\bm{X}:=\psi^{*}(\bm{X}_{1})\in\mathbb{R}^{n\times d_{2}}bold_italic_X := italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denote the learned representation from pretraining. For the downstream task and λ0𝜆0\lambda\geq 0italic_λ ≥ 0, let

β^λ:=argminβ1n𝒀𝑿β2+λβ2=𝒀𝑿(𝑿𝑿+λn𝐈d2)1.assignsubscript^𝛽𝜆subscript𝛽1𝑛superscriptdelimited-∥∥𝒀𝑿superscript𝛽top2𝜆superscriptdelimited-∥∥𝛽2superscript𝒀top𝑿superscriptsuperscript𝑿top𝑿𝜆𝑛subscript𝐈subscript𝑑21\displaystyle\hat{\beta}_{\lambda}:=\arg\min_{\beta}\frac{1}{n}\lVert\bm{Y}-% \bm{X}\beta^{\top}\rVert^{2}+\lambda\lVert\beta\rVert^{2}=\bm{Y}^{\top}\bm{X}(% \bm{X}^{\top}\bm{X}+\lambda n{\bf I}_{d_{2}})^{-1}.over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT := roman_arg roman_min start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∥ bold_italic_Y - bold_italic_X italic_β start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ italic_β ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = bold_italic_Y start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_X ( bold_italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_X + italic_λ italic_n bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT .
Definition 2.3.

The excess risk induced by the estimator β^λsubscript^𝛽𝜆\hat{\beta}_{\lambda}over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT is defined as (β^λ):=𝖾𝗋𝗋𝗈𝗋apx(β^λ)assignsubscript^𝛽𝜆subscript𝖾𝗋𝗋𝗈𝗋apxsubscript^𝛽𝜆\mathcal{R}(\hat{\beta}_{\lambda}):=\mathsf{error}_{\text{apx}}(\hat{\beta}_{% \lambda})caligraphic_R ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ) := sansserif_error start_POSTSUBSCRIPT apx end_POSTSUBSCRIPT ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ).

The term “matching” can be viewed in the following sense: (1) the form of nonlinearity in 𝖤[Y|X1]𝖤conditional𝑌subscript𝑋1\operatorname{\sf E}[Y|X_{1}]sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] should be captured by 𝖤[X2|X1]𝖤conditionalsubscript𝑋2subscript𝑋1\operatorname{\sf E}[X_{2}|X_{1}]sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ]; (2) the “redundant” nonlinearity in 𝖤[X2|X1]𝖤conditionalsubscript𝑋2subscript𝑋1\operatorname{\sf E}[X_{2}|X_{1}]sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] should be linearly dependent so that they can be removed through a linear transform. As a toy example, consider 𝖤[Y|X1]=X12𝖤conditional𝑌subscript𝑋1superscriptsubscript𝑋12\operatorname{\sf E}[Y|X_{1}]=X_{1}^{2}sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and 𝖤[X2|X1]=(X12+sin(X1), 0.5sin(X1))𝖤conditionalsubscript𝑋2subscript𝑋1superscriptsuperscriptsubscript𝑋12subscript𝑋10.5subscript𝑋1top\operatorname{\sf E}[X_{2}|X_{1}]=(-X_{1}^{2}+\sin(X_{1}),\,0.5\sin(X_{1}))^{\top}sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = ( - italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_sin ( start_ARG italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) , 0.5 roman_sin ( start_ARG italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and note they share the same quadratic term X12superscriptsubscript𝑋12X_{1}^{2}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, while the sine functions in 𝖤[X2|X1]𝖤conditionalsubscript𝑋2subscript𝑋1\operatorname{\sf E}[X_{2}|X_{1}]sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] are redundant for predicting Y𝑌Yitalic_Y. Observe that SSL with β=(1, 2)𝛽superscript12top\beta=(-1,\,2)^{\top}italic_β = ( - 1 , 2 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT extracts the quadratic term while eliminating the sine functions. In contrast, 𝖤[X2|X1]=(X1, 0.5cos(X1))𝖤conditionalsubscript𝑋2subscript𝑋1superscriptsubscript𝑋10.5subscript𝑋1top\operatorname{\sf E}[X_{2}|X_{1}]=(X_{1},\,0.5\cos(X_{1}))^{\top}sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , 0.5 roman_cos ( start_ARG italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT will not lead to an exact matching.

Remark 2.4.

This notion of predicting a subset of X𝑋Xitalic_X can be helpful for predicting Y𝑌Yitalic_Y is not limited to reconstructive SSL. For instance, in a series of recent papers Du and Xiang (2022, 2023a, 2023b), the authors have explored a similar direction from an invariance perspective for multi-environment domain adaption, which has partially motivated this study.

3 Necessary and Sufficient Condition for Exact Matching

In an attempt to demystify the matching between the pretext and downstream tasks, we propose to identify the conditions on the generating mechanism of (X1,X2,Y)subscript𝑋1subscript𝑋2𝑌(X_{1},X_{2},Y)( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Y ) that enable an exact matching. The generating mechanism of (X1,Y)subscript𝑋1𝑌(X_{1},Y)( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ) in a supervised learning task is often complicated, and thus we make no assumptions on how (X1,Y)subscript𝑋1𝑌(X_{1},Y)( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ) is generated and focus on the interactions between (X1,Y)subscript𝑋1𝑌(X_{1},Y)( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ) and X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Without loss of generality, we can write X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in the following form

X2=h(X1,Y)+N,subscript𝑋2subscript𝑋1𝑌𝑁X_{2}=h(X_{1},Y)+N,italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_h ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ) + italic_N , (2)

where h(X1,Y):=𝖤[X2|X1,Y]assignsubscript𝑋1𝑌𝖤conditionalsubscript𝑋2subscript𝑋1𝑌h(X_{1},Y):=\operatorname{\sf E}[X_{2}|X_{1},Y]italic_h ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ) := sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ] is the regression function of X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT on (X1,Y)subscript𝑋1𝑌(X_{1},Y)( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ) and therefore the residual variable N:=X2h(X1,Y)assign𝑁subscript𝑋2subscript𝑋1𝑌N:=X_{2}-h(X_{1},Y)italic_N := italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_h ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ) satisfies 𝖤[N|X1,Y]=0𝖤conditional𝑁subscript𝑋1𝑌0\operatorname{\sf E}[N|X_{1},Y]=0sansserif_E [ italic_N | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ] = 0. The function hhitalic_h captures how the label Y𝑌Yitalic_Y and feature X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are encoded into X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Equation (2) can be viewed from a causal perspective through a general structural causal model (SCM) (Pearl, 2009), X2=f(X1,Y,ε)subscript𝑋2𝑓subscript𝑋1𝑌𝜀X_{2}=f(X_{1},Y,\varepsilon)italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_f ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y , italic_ε ), where ε𝜀\varepsilonitalic_ε is a vector of exogenous variables independent of (X1,Y)subscript𝑋1𝑌(X_{1},Y)( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ). Since this general SCM suffers from identifiability issues, we focus on (2), observing that h(X1,Y):=𝖤[X2|X1,Y]=𝖤[f(X1,Y,ε)|X1,Y]assignsubscript𝑋1𝑌𝖤conditionalsubscript𝑋2subscript𝑋1𝑌𝖤conditional𝑓subscript𝑋1𝑌𝜀subscript𝑋1𝑌h(X_{1},Y):=\operatorname{\sf E}[X_{2}|X_{1},Y]=\operatorname{\sf E}[f(X_{1},Y% ,\varepsilon)|X_{1},Y]italic_h ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ) := sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ] = sansserif_E [ italic_f ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y , italic_ε ) | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ]. It is important to note that (2) is valid even when there is no underlying causal graph over (X1,X2,Y)subscript𝑋1subscript𝑋2𝑌(X_{1},X_{2},Y)( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Y ).

Recall that Y𝑌Yitalic_Y is the one-hot encoding of Y¯¯𝑌\bar{Y}over¯ start_ARG italic_Y end_ARG. Observe that an arbitrary function h:(𝒳,𝒴)d:𝒳𝒴superscript𝑑h:(\mathcal{X},\mathcal{Y})\to\mathbb{R}^{d}italic_h : ( caligraphic_X , caligraphic_Y ) → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT can be equivalently written as

h(X,Y)=j=1ph(X,ej)𝟙Y¯=yj=j=1ph(X,ej)ejY:=Ch(X)Y,𝑋𝑌superscriptsubscript𝑗1𝑝𝑋subscript𝑒𝑗subscript1¯𝑌subscript𝑦𝑗superscriptsubscript𝑗1𝑝𝑋subscript𝑒𝑗superscriptsubscript𝑒𝑗top𝑌assignsuperscript𝐶𝑋𝑌h(X,Y)=\sum_{j=1}^{p}h(X,e_{j})\mathbbm{1}_{\bar{Y}=y_{j}}=\sum_{j=1}^{p}h(X,e% _{j})e_{j}^{\top}Y:=C^{h}(X)Y,italic_h ( italic_X , italic_Y ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_h ( italic_X , italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) blackboard_1 start_POSTSUBSCRIPT over¯ start_ARG italic_Y end_ARG = italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_h ( italic_X , italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_Y := italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_X ) italic_Y , (3)

where we use the fact that 𝟙Y¯=yj=ejYsubscript1¯𝑌subscript𝑦𝑗superscriptsubscript𝑒𝑗top𝑌\mathbbm{1}_{\bar{Y}=y_{j}}=e_{j}^{\top}Yblackboard_1 start_POSTSUBSCRIPT over¯ start_ARG italic_Y end_ARG = italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_Y. This simple derivation implies a one-to-one correspondence between hhitalic_h and Chsuperscript𝐶C^{h}italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT, meaning that the general function model (2) can be expressed as

X2=h(X1,Y)+N=Ch(X1)Y+N,subscript𝑋2subscript𝑋1𝑌𝑁superscript𝐶subscript𝑋1𝑌𝑁X_{2}=h(X_{1},Y)+N=C^{h}(X_{1})Y+N,italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_h ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ) + italic_N = italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_Y + italic_N , (4)

with Ch:𝒳1d2×p:superscript𝐶subscript𝒳1superscriptsubscript𝑑2𝑝C^{h}:\mathcal{X}_{1}\to\mathbb{R}^{d_{2}\times p}italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT : caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_p end_POSTSUPERSCRIPT. The role of the latent random matrix Ch(X1)superscript𝐶subscript𝑋1C^{h}(X_{1})italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is to encode the label variable Y𝑌Yitalic_Y into X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, thus we call Chsuperscript𝐶C^{h}italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT the encoding function. The expression in (4) implies the identity 𝖤[X2|X1]=Ch(X1)𝖤[Y|X1]𝖤conditionalsubscript𝑋2subscript𝑋1superscript𝐶subscript𝑋1𝖤conditional𝑌subscript𝑋1\operatorname{\sf E}[X_{2}|X_{1}]=C^{h}(X_{1})\operatorname{\sf E}[Y|X_{1}]sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ], which is equivalent to

𝖤[X2|X1]=(Ch(X1)+O(X1))𝖤[Y|X1]:=C¯h(X1)𝖤[Y|X1],𝖤conditionalsubscript𝑋2subscript𝑋1superscript𝐶subscript𝑋1𝑂subscript𝑋1𝖤conditional𝑌subscript𝑋1assignsuperscript¯𝐶subscript𝑋1𝖤conditional𝑌subscript𝑋1\operatorname{\sf E}[X_{2}|X_{1}]=(C^{h}(X_{1})+O(X_{1}))\operatorname{\sf E}[% Y|X_{1}]:=\bar{C}^{h}(X_{1})\operatorname{\sf E}[Y|X_{1}],sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = ( italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_O ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] := over¯ start_ARG italic_C end_ARG start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] , (5)

for any O:𝒳1d2×p:𝑂subscript𝒳1superscriptsubscript𝑑2𝑝O:\mathcal{X}_{1}\to\mathbb{R}^{d_{2}\times p}italic_O : caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_p end_POSTSUPERSCRIPT such that O(X1)𝖤[Y|X1]=𝟎𝑂subscript𝑋1𝖤conditional𝑌subscript𝑋10O(X_{1})\operatorname{\sf E}[Y|X_{1}]=\bm{0}italic_O ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = bold_0. In words, the rows of O(x)𝑂𝑥O(x)italic_O ( italic_x ) are orthogonal to 𝖤[Y|X1=x]𝖤conditional𝑌subscript𝑋1𝑥\operatorname{\sf E}[Y|X_{1}=x]sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x ] for x𝒳1for-all𝑥subscript𝒳1\forall x\in\mathcal{X}_{1}∀ italic_x ∈ caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. We call such O(x)𝑂𝑥O(x)italic_O ( italic_x ) an orthogonal term. For instance, the orthogonality holds when 𝖤[Y|X1]=(X1,X12)𝖤conditional𝑌subscript𝑋1superscriptsubscript𝑋1superscriptsubscript𝑋12top\operatorname{\sf E}[Y|X_{1}]=(X_{1},X_{1}^{2})^{\top}sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and each row of O(X1)𝑂subscript𝑋1O(X_{1})italic_O ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is (X1,1)subscript𝑋11(-X_{1},1)( - italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , 1 ). Equation (5) defines an equivalent class of encoding functions 𝒞={C¯h}𝒞superscript¯𝐶\mathcal{C}=\{\bar{C}^{h}\}caligraphic_C = { over¯ start_ARG italic_C end_ARG start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT } that results in the same pretext representation ψ(X1)=𝖤[X2|X1]superscript𝜓subscript𝑋1𝖤conditionalsubscript𝑋2subscript𝑋1\psi^{*}(X_{1})=\operatorname{\sf E}[X_{2}|X_{1}]italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ]. This shows that such orthogonal terms do not affect the analysis of SSL, and thus we use approaches-limit\doteq to hide the orthogonal term (added to Chsuperscript𝐶C^{h}italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT) in equations throughout the paper.

Proposition 3.1.

The exact matching in Definition 2.2 holds if and only if βCh(x)𝐈papproaches-limit𝛽superscript𝐶𝑥subscript𝐈𝑝\beta C^{h}(x)\doteq{\bf I}_{p}italic_β italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x ) ≐ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, for x𝒳1for-all𝑥subscript𝒳1\forall x\in\mathcal{X}_{1}∀ italic_x ∈ caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and some βp×d2𝛽superscript𝑝subscript𝑑2\beta\in\mathbb{R}^{p\times d_{2}}italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

Therefore, in this formulation, finding an exact matching is equivalent to inverting the encoding function Chsuperscript𝐶C^{h}italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT. Proposition 3.1 implies that the full rank of Ch(x)superscript𝐶𝑥C^{h}(x)italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x ) for every x𝒳1𝑥subscript𝒳1x\in\mathcal{X}_{1}italic_x ∈ caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a necessary condition for the exact matching. In the following lemma, we provide a sufficient and necessary condition for the exact matching through a full characterization of the invertibility of Chsuperscript𝐶C^{h}italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT.

Lemma 3.2 (sufficient and necessary condition for exact matching).

There is an exact matching between Y𝑌Yitalic_Y and X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT given X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT if and only if

Ch(x)A[𝐈pR(x)] for x𝒳1,formulae-sequenceapproaches-limitsuperscript𝐶𝑥𝐴matrixsubscript𝐈𝑝𝑅𝑥 for for-all𝑥subscript𝒳1C^{h}(x)\doteq A\begin{bmatrix}{\bf I}_{p}\\ R(x)\end{bmatrix}\quad\text{ for }\forall x\in\mathcal{X}_{1},italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x ) ≐ italic_A [ start_ARG start_ROW start_CELL bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_R ( italic_x ) end_CELL end_ROW end_ARG ] for ∀ italic_x ∈ caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (6)

for some invertible matrix Ad2×d2𝐴superscriptsubscript𝑑2subscript𝑑2A\in\mathbb{R}^{d_{2}\times d_{2}}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, an arbitrary matrix function R:𝒳1(d2p)×p:𝑅subscript𝒳1superscriptsubscript𝑑2𝑝𝑝R:\mathcal{X}_{1}\to\mathbb{R}^{(d_{2}-p)\times p}italic_R : caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → blackboard_R start_POSTSUPERSCRIPT ( italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_p ) × italic_p end_POSTSUPERSCRIPT.

The identity map** 𝐈psubscript𝐈𝑝{\bf I}_{p}bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT fully preserves each class of Y𝑌Yitalic_Y, and R(x)𝑅𝑥R(x)italic_R ( italic_x ) represents the redundancy encoded into X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. It is worth noting that redundancy refers to the features extracted from X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT that are predictive for X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, but redundant for the prediction of Y𝑌Yitalic_Y (given the optimal predictor 𝖤[Y|X1]𝖤conditional𝑌subscript𝑋1\operatorname{\sf E}[Y|X_{1}]sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ]). In our stylized MNIST experiment in Section 6.2.2, the dash pattern in the background is redundancy since it is useful for predicting the image orientation (i.e., X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT), but it contains no information about the label. In contrast, the dot pattern is not redundant, since it is independent of both the image orientation and the label. The lemma above reveals that the label Y𝑌Yitalic_Y should be encoded into X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT through an invertible linear mixture (i.e., A𝐴Aitalic_A) of the full label information and some redundancy. When A𝐴Aitalic_A is an identity matrix, the first p𝑝pitalic_p rows of ψ(X1)=𝖤[X2|X1]superscript𝜓subscript𝑋1𝖤conditionalsubscript𝑋2subscript𝑋1\psi^{*}(X_{1})=\operatorname{\sf E}[X_{2}|X_{1}]italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] capture the full label information, thus the downstream task has a sparse solution β=[𝐈p, 0]superscript𝛽subscript𝐈𝑝 0\beta^{*}=[{\bf I}_{p},\,\bm{0}]italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = [ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_0 ]. However, the solutions to the downstream task may not be sparse in general, and we handle this challenge in Section 4. Below are two examples with explicit forms of Chsuperscript𝐶C^{h}italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT.

Example 3.3.

An important special case of model (4) is X2=C~Y+Nsubscript𝑋2~𝐶𝑌𝑁X_{2}=\widetilde{C}Y+Nitalic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = over~ start_ARG italic_C end_ARG italic_Y + italic_N, where ChC~superscript𝐶~𝐶C^{h}\equiv\widetilde{C}italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ≡ over~ start_ARG italic_C end_ARG is a constant function. In this case, the necessary and sufficient condition simplifies to the condition that C~~𝐶\widetilde{C}over~ start_ARG italic_C end_ARG has full rank. Observe that 𝖤[X2|X1]=C~𝖤[Y|X1]𝖤conditionalsubscript𝑋2subscript𝑋1~𝐶𝖤conditional𝑌subscript𝑋1\operatorname{\sf E}[X_{2}|X_{1}]=\widetilde{C}\operatorname{\sf E}[Y|X_{1}]sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = over~ start_ARG italic_C end_ARG sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] implies 𝖾𝗋𝗋𝗈𝗋apx(C~)=0subscript𝖾𝗋𝗋𝗈𝗋apxsuperscript~𝐶0\mathsf{error}_{\text{apx}}(\widetilde{C}^{\dagger})=0sansserif_error start_POSTSUBSCRIPT apx end_POSTSUBSCRIPT ( over~ start_ARG italic_C end_ARG start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ) = 0.

In Appendix A, we show that model X2=C~Y+Nsubscript𝑋2~𝐶𝑌𝑁X_{2}=\widetilde{C}Y+Nitalic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = over~ start_ARG italic_C end_ARG italic_Y + italic_N is equivalent to 𝖤[X2|X1,Y]=𝖤[X2|Y]𝖤conditionalsubscript𝑋2subscript𝑋1𝑌𝖤conditionalsubscript𝑋2𝑌\operatorname{\sf E}[X_{2}|X_{1},Y]=\operatorname{\sf E}[X_{2}|Y]sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ] = sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_Y ], which we call conditional mean independence, which is weaker than CI, i.e., X1X2|Yperpendicular-toabsentperpendicular-tosubscript𝑋1conditionalsubscript𝑋2𝑌X_{1}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{% \displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0% mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.% 0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}% \mkern 2.0mu{\scriptscriptstyle\perp}}}X_{2}\,|\,Yitalic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_RELOP ⟂ ⟂ end_RELOP italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_Y. The setting in Example 3.3 has been studied in (Lee et al., 2021) under CI. Despite the simplicity of conditional mean independence, it can be unrealistic in practical settings, since it requires that the label Y𝑌Yitalic_Y is fully encoded into X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with no redundant information (depending on X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) as if the pretext and downstream tasks are two equivalent prediction tasks. Even though approximate conditional independence has been studied in (Lee et al., 2021), it is unclear if the approximation provides sufficient insights into explaining why and when SSL works (or fails), since conditional independence (or constant Chsuperscript𝐶C^{h}italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT) is not a necessary condition for exact matching.

Example 3.4 (partially linear model).

Define an invertible matrix Ad2×d2𝐴superscriptsubscript𝑑2subscript𝑑2A\in\mathbb{R}^{d_{2}\times d_{2}}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with a column partition as A=[A1|A2|A3]𝐴delimited-[]subscript𝐴1subscript𝐴2subscript𝐴3A=[A_{1}|A_{2}|A_{3}]italic_A = [ italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ], where A1subscript𝐴1A_{1}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT has p𝑝pitalic_p columns, A2subscript𝐴2A_{2}italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT has k𝑘kitalic_k columns such that 1kd2p1𝑘subscript𝑑2𝑝1\leq k\leq d_{2}-p1 ≤ italic_k ≤ italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_p, and A3subscript𝐴3A_{3}italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT has the rest of the columns. Let X2=A1Y+A2a(X1)+Nsubscript𝑋2subscript𝐴1𝑌subscript𝐴2𝑎subscript𝑋1𝑁X_{2}=A_{1}Y+A_{2}a(X_{1})+Nitalic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_Y + italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_a ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_N, where a:𝒳1k:𝑎subscript𝒳1superscript𝑘a:\mathcal{X}_{1}\to\mathbb{R}^{k}italic_a : caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT satisfies 𝖤[a(X1)]=𝟎𝖤𝑎subscript𝑋10\operatorname{\sf E}[a(X_{1})]=\bm{0}sansserif_E [ italic_a ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] = bold_0. Its encoding function is Ch(X1)=A1+A2a(X1)𝟏superscript𝐶subscript𝑋1subscript𝐴1subscript𝐴2𝑎subscript𝑋1superscript1topC^{h}(X_{1})=A_{1}+A_{2}a(X_{1})\bm{1}^{\top}italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_a ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT as derived below. The sufficient and necessary condition is immediately satisfied with A𝐴Aitalic_A and R(x)=[𝟏a(x),𝟎]𝑅𝑥superscript1superscript𝑎top𝑥0topR(x)=[\bm{1}a^{\top}(x),\bm{0}]^{\top}italic_R ( italic_x ) = [ bold_1 italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_x ) , bold_0 ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, where R(x)𝑅𝑥R(x)italic_R ( italic_x ) has d2pksubscript𝑑2𝑝𝑘d_{2}-p-kitalic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_p - italic_k all zero rows.

h(X1,Y)=j=1p(A2a(X1)+A1ej)ejY=(A2a(X1)jej+A1jejej)Y:=Ch(X1)Y.subscript𝑋1𝑌superscriptsubscript𝑗1𝑝subscript𝐴2𝑎subscript𝑋1subscript𝐴1subscript𝑒𝑗superscriptsubscript𝑒𝑗top𝑌subscript𝐴2𝑎subscript𝑋1subscript𝑗superscriptsubscript𝑒𝑗topsubscript𝐴1subscript𝑗subscript𝑒𝑗superscriptsubscript𝑒𝑗top𝑌assignsuperscript𝐶subscript𝑋1𝑌h(X_{1},Y)=\sum_{j=1}^{p}(A_{2}a(X_{1})+A_{1}e_{j})e_{j}^{\top}Y=\big{(}A_{2}a% (X_{1})\sum_{j}e_{j}^{\top}+A_{1}\sum_{j}e_{j}e_{j}^{\top}\big{)}Y:=C^{h}(X_{1% })Y.\vspace{-.6em}italic_h ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_a ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_Y = ( italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_a ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) italic_Y := italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_Y .

Even though conditional independence fails to hold since Chsuperscript𝐶C^{h}italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT is not constant, according to Lemma 3.2, there is still an exact matching.

4 Structural Redundancy

The pretext representation ψ(X1)superscript𝜓subscript𝑋1\psi^{*}(X_{1})italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is typically high-dimensional, designed to capture abundant information for various downstream tasks. Given limited labeled samples (nd2much-less-than𝑛subscript𝑑2n\ll d_{2}italic_n ≪ italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in our notation), the downstream task is a high-dimensional linear regression problem. Without any assumptions such as sparsity of the true coefficients (Tibshirani, 1996; Candes and Tao, 2007) or low effective dimension of the features (Zhang, 2005; Hsu et al., 2012), SSL may not perform well even if the exact matching holds. Since the true coefficients are not necessarily sparse as discussed in Section 3, we explore how low-rank structures in the redundancy (i.e., R(X1)𝑅subscript𝑋1R(X_{1})italic_R ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )) naturally lead to a low effective dimension. In particular, we adopt the definition of effective dimension from (Hsu et al., 2012) in the context of ridge regression (see details below); a closely related notion is called effective degrees of freedom (Efron, 1986; Hastie et al., 2009). Roughly speaking, the effective dimension measures the number of features that are not linearly correlated; when it is low, a small number of labeled samples can be sufficient for reliable estimation

To simplify the notation, denote X:=ψ(X1)assign𝑋superscript𝜓subscript𝑋1X:=\psi^{*}(X_{1})italic_X := italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) with covariance matrix Σ:=Cov(X)assignΣCov𝑋\Sigma:=\mathop{\rm Cov}\nolimits(X)roman_Σ := roman_Cov ( italic_X ). Let {λj}j=1d2superscriptsubscriptsubscript𝜆𝑗𝑗1subscript𝑑2\{\lambda_{j}\}_{j=1}^{d_{2}}{ italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denote the eigenvalues of ΣΣ\Sigmaroman_Σ. The population ridge estimator is given by

βλ=argminβ𝖤[YXβ2]+λβ2=(Σ+λ𝐈d2)1𝖤[XY].subscript𝛽𝜆subscript𝛽𝖤superscriptdelimited-∥∥𝑌𝑋𝛽2𝜆superscriptdelimited-∥∥𝛽2superscriptΣ𝜆subscript𝐈subscript𝑑21𝖤𝑋superscript𝑌top\displaystyle\beta_{\lambda}=\arg\min_{\beta}\operatorname{\sf E}\left[\lVert Y% -X\beta\rVert^{2}\right]+\lambda\lVert\beta\rVert^{2}=(\Sigma+\lambda{\bf I}_{% d_{2}})^{-1}\operatorname{\sf E}[XY^{\top}].italic_β start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT sansserif_E [ ∥ italic_Y - italic_X italic_β ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + italic_λ ∥ italic_β ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( roman_Σ + italic_λ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT sansserif_E [ italic_X italic_Y start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] .

Implicitly dimension reduction is performed in ridge regression when some appropriate shrinkage parameter λ𝜆\lambdaitalic_λ is chosen. The reduced dimension for a chosen λ𝜆\lambdaitalic_λ can be quantified by the effective dimension, defined as dλ=j=1dλjλj+λsubscript𝑑𝜆superscriptsubscript𝑗1𝑑subscript𝜆𝑗subscript𝜆𝑗𝜆d_{\lambda}=\sum_{j=1}^{d}\frac{\lambda_{j}}{\lambda_{j}+\lambda}italic_d start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT divide start_ARG italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_λ end_ARG for Xd𝑋superscript𝑑X\in\mathbb{R}^{d}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Note that the bias of the ridge estimator increases monotonically as λ𝜆\lambdaitalic_λ increases. When ΣΣ\Sigmaroman_Σ has exactly s𝑠sitalic_s nonzero eigenvalues, dλsubscript𝑑𝜆d_{\lambda}italic_d start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT is upper bounded by s𝑠sitalic_s for any λ0𝜆0\lambda\geq 0italic_λ ≥ 0. Besides this special case, a low effective dimension can be achieved under a weak penalty (i.e., small λ𝜆\lambdaitalic_λ) when there is a large percentage of small eigenvalues.

In the next subsection, we demonstrate that a low effective dimension is naturally attained when the redundancy R(X1)𝑅subscript𝑋1R(X_{1})italic_R ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) can be approximated by a low-rank decomposition. Our finite sample analysis on the high-dimensional setting, presented in Section 5, relies on an upper bound of the low effective dimension, utilizing a measure of the low-rank approximation introduced in the following subsection (Lemma H.1). When d2<nsubscript𝑑2𝑛d_{2}<nitalic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < italic_n, dλsubscript𝑑𝜆d_{\lambda}italic_d start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT with λ=0𝜆0\lambda=0italic_λ = 0 offers an interpretation of our upper bound on the excess risk and sample complexity (see details in Theorem 5.2 and Remark 5.3).

4.1 Low-rank Approximation of Redundancy

Recall, redundancy refers to information in X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT useful for predicting X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT but not for the label Y𝑌Yitalic_Y. For instance, in computer vision tasks, consider a label Y𝑌Yitalic_Y determined by the object of interest within a surrounding background. Redundancy arises when the pretext task captures background information unrelated to the label. If the background features simple patterns, such as sky, grassland, or beach, this redundancy can be considered low-rank. Consequently, it is relatively easy to eliminate such redundancy in downstream tasks (recall the cancellation of sine functions below Definition 2.3). In the following, we present the technical details of the low-rank approximation of redundancy.

Denote C~:=𝖤[Ch(X1)]assign~𝐶𝖤superscript𝐶subscript𝑋1\widetilde{C}:=\operatorname{\sf E}[C^{h}(X_{1})]over~ start_ARG italic_C end_ARG := sansserif_E [ italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] and recall that Ch(X1)superscript𝐶subscript𝑋1C^{h}(X_{1})italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) reduces to C~~𝐶\widetilde{C}over~ start_ARG italic_C end_ARG under conditional mean independence. Assume that the necessary and sufficient condition in Lemma 3.2 is satisfied,

X=Ch(X1)𝖤[Y|X1]𝑋superscript𝐶subscript𝑋1𝖤conditional𝑌subscript𝑋1\displaystyle X=C^{h}(X_{1})\operatorname{\sf E}[Y|X_{1}]italic_X = italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] =(C~+A[𝟎(R(X1)R~)])𝖤[Y|X1]absent~𝐶𝐴superscriptmatrix0superscript𝑅subscript𝑋1~𝑅toptop𝖤conditional𝑌subscript𝑋1\displaystyle=\left(\widetilde{C}+A\begin{bmatrix}\bm{0}\,\,\left(R(X_{1})-% \widetilde{R}\right)^{\top}\end{bmatrix}^{\top}\right)\operatorname{\sf E}[Y|X% _{1}]= ( over~ start_ARG italic_C end_ARG + italic_A [ start_ARG start_ROW start_CELL bold_0 ( italic_R ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - over~ start_ARG italic_R end_ARG ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ]
=(C~+Ap+1:d2(R(X1)R~)))𝖤[Y|X1],\displaystyle=(\widetilde{C}+A_{p+1:d_{2}}(R(X_{1})-\widetilde{R})))% \operatorname{\sf E}[Y|X_{1}],= ( over~ start_ARG italic_C end_ARG + italic_A start_POSTSUBSCRIPT italic_p + 1 : italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_R ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - over~ start_ARG italic_R end_ARG ) ) ) sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ,

where R~:=𝖤[R(X1)]assign~𝑅𝖤𝑅subscript𝑋1\widetilde{R}:=\operatorname{\sf E}[R(X_{1})]over~ start_ARG italic_R end_ARG := sansserif_E [ italic_R ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] and Ap+1:d2d2×(d2p)subscript𝐴:𝑝1subscript𝑑2superscriptsubscript𝑑2subscript𝑑2𝑝A_{p+1:d_{2}}\in\mathbb{R}^{d_{2}\times(d_{2}-p)}italic_A start_POSTSUBSCRIPT italic_p + 1 : italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × ( italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_p ) end_POSTSUPERSCRIPT denotes the last d2psubscript𝑑2𝑝d_{2}-pitalic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_p columns of A𝐴Aitalic_A. If the (centered) redundancy R(X1)R~𝑅subscript𝑋1~𝑅R(X_{1})-\widetilde{R}italic_R ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - over~ start_ARG italic_R end_ARG admits a low-rank decomposition, i.e., R(X1)R~=Bg(X1)𝑅subscript𝑋1~𝑅𝐵𝑔subscript𝑋1R(X_{1})-\widetilde{R}=Bg(X_{1})italic_R ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - over~ start_ARG italic_R end_ARG = italic_B italic_g ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) for some B(d2p)×s𝐵superscriptsubscript𝑑2𝑝𝑠B\in\mathbb{R}^{(d_{2}-p)\times s}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_p ) × italic_s end_POSTSUPERSCRIPT and g:𝒳1s×p:𝑔subscript𝒳1superscript𝑠𝑝g:\mathcal{X}_{1}\to\mathbb{R}^{s\times p}italic_g : caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_s × italic_p end_POSTSUPERSCRIPT, where sd2much-less-than𝑠subscript𝑑2s\ll d_{2}italic_s ≪ italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we get

X=(C~+Ap+1:d2Bg(X1))𝖤[Y|X1],𝑋~𝐶subscript𝐴:𝑝1subscript𝑑2𝐵𝑔subscript𝑋1𝖤conditional𝑌subscript𝑋1X=(\widetilde{C}+A_{p+1:d_{2}}Bg(X_{1}))\operatorname{\sf E}[Y|X_{1}],italic_X = ( over~ start_ARG italic_C end_ARG + italic_A start_POSTSUBSCRIPT italic_p + 1 : italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_B italic_g ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] , (7)

which has at most p+s𝑝𝑠p+sitalic_p + italic_s linearly independent components, as rank(C~)prank~𝐶𝑝\text{rank}(\widetilde{C})\leq prank ( over~ start_ARG italic_C end_ARG ) ≤ italic_p and rank(Ap+1:d2B)sranksubscript𝐴:𝑝1subscript𝑑2𝐵𝑠\text{rank}(A_{p+1:d_{2}}B)\leq srank ( italic_A start_POSTSUBSCRIPT italic_p + 1 : italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_B ) ≤ italic_s. This shows that the effective dimension of X𝑋Xitalic_X is bounded by p+s𝑝𝑠p+sitalic_p + italic_s for any λ0𝜆0\lambda\geq 0italic_λ ≥ 0.

Since the low-rank decomposition may not hold exactly for a chosen s𝑠sitalic_s, we identify X^^𝑋\hat{X}over^ start_ARG italic_X end_ARG of the form (C~+Bg(X))𝖤[Y|X1]~𝐶𝐵𝑔𝑋𝖤conditional𝑌subscript𝑋1(\widetilde{C}+Bg(X))\operatorname{\sf E}[Y|X_{1}]( over~ start_ARG italic_C end_ARG + italic_B italic_g ( italic_X ) ) sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] that best approximates X𝑋Xitalic_X. Specifically, for any fixed s𝑠sitalic_s s.t. 1sd2p1𝑠subscript𝑑2𝑝1\leq s\leq d_{2}-p1 ≤ italic_s ≤ italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_p, we consider Bd2×s,g:𝒳1s×p:𝐵superscriptsubscript𝑑2𝑠𝑔subscript𝒳1superscript𝑠𝑝B\in\mathbb{R}^{d_{2}\times s},\,g:\mathcal{X}_{1}\to\mathbb{R}^{s\times p}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_s end_POSTSUPERSCRIPT , italic_g : caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_s × italic_p end_POSTSUPERSCRIPT, and define

εssubscript𝜀𝑠\displaystyle\varepsilon_{s}italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT :=minX^1d2𝖤[XX^2]=minB,g1d2𝖤[(Ch(X1)C~Bg(X1))𝖤[Y|X1]2],assignabsentsubscript^𝑋1subscript𝑑2𝖤superscriptdelimited-∥∥𝑋^𝑋2subscript𝐵𝑔1subscript𝑑2𝖤superscriptdelimited-∥∥superscript𝐶subscript𝑋1~𝐶𝐵𝑔subscript𝑋1𝖤conditional𝑌subscript𝑋12\displaystyle:=\min_{\hat{X}}\frac{1}{d_{2}}\operatorname{\sf E}\left[\left% \lVert X-\hat{X}\right\rVert^{2}\right]=\min_{B,g}\frac{1}{d_{2}}\operatorname% {\sf E}\left[\left\lVert\left(C^{h}(X_{1})-\widetilde{C}-Bg(X_{1})\right)% \operatorname{\sf E}[Y|X_{1}]\right\rVert^{2}\right],:= roman_min start_POSTSUBSCRIPT over^ start_ARG italic_X end_ARG end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG sansserif_E [ ∥ italic_X - over^ start_ARG italic_X end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = roman_min start_POSTSUBSCRIPT italic_B , italic_g end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG sansserif_E [ ∥ ( italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - over~ start_ARG italic_C end_ARG - italic_B italic_g ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (8)

with minimizers {(B,g)}superscript𝐵superscript𝑔\{(B^{*},g^{*})\}{ ( italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } and we use the shorthand X~:=(C~+Bg(X1))𝖤[Y|X1]assign~𝑋~𝐶superscript𝐵superscript𝑔subscript𝑋1𝖤conditional𝑌subscript𝑋1\widetilde{X}:=(\widetilde{C}+B^{*}g^{*}(X_{1}))\operatorname{\sf E}[Y|X_{1}]over~ start_ARG italic_X end_ARG := ( over~ start_ARG italic_C end_ARG + italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] for any pair of optimizer (B,g)superscript𝐵superscript𝑔(B^{*},g^{*})( italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). Without loss of generality, we normalize g(X1)𝑔subscript𝑋1g(X_{1})italic_g ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and assume 𝖤[g(X1)]=1𝖤delimited-∥∥𝑔subscript𝑋11\operatorname{\sf E}[\lVert g(X_{1})\rVert]=1sansserif_E [ ∥ italic_g ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∥ ] = 1. The low-rank approximation error is averaged so that εssubscript𝜀𝑠\varepsilon_{s}italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT does not grow with d2subscript𝑑2d_{2}italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. A challenge for analyzing εssubscript𝜀𝑠\varepsilon_{s}italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is that the minimizers do not have closed-form expressions. When (X1,X2,Y)subscript𝑋1subscript𝑋2𝑌(X_{1},X_{2},Y)( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Y ) follows a Gaussian distribution, we show (8) reduces to a weighted low-rank approximation problem (with no randomness) in Appendix E. However, these weighted problems do not have closed-form expressions in general (Srebro and Jaakkola, 2003; Dutta and Li, 2017). Therefore, we leave the further investigations of εssubscript𝜀𝑠\varepsilon_{s}italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT for future work. For s=0𝑠0s=0italic_s = 0, we simply define

ε0:=1d2𝖤[(Ch(X1)C~)𝖤[Y|X1]2],assignsubscript𝜀01subscript𝑑2𝖤superscriptdelimited-∥∥superscript𝐶subscript𝑋1~𝐶𝖤conditional𝑌subscript𝑋12\varepsilon_{0}:=\frac{1}{d_{2}}\operatorname{\sf E}\left[\left\lVert\left(C^{% h}(X_{1})-\widetilde{C}\right)\operatorname{\sf E}[Y|X_{1}]\right\rVert^{2}% \right],italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT := divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG sansserif_E [ ∥ ( italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - over~ start_ARG italic_C end_ARG ) sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,

which measures how approximately conditional mean independence (i.e., Ch=C~superscript𝐶~𝐶C^{h}=\widetilde{C}italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = over~ start_ARG italic_C end_ARG) holds. There is a tradeoff between the effective dimension of X~~𝑋\widetilde{X}over~ start_ARG italic_X end_ARG (which are no greater than s𝑠sitalic_s) and the approximation quality, as the approximation level εssubscript𝜀𝑠\varepsilon_{s}italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is non-increasing as s𝑠sitalic_s increases. An important special case of the low-rank approximation is when the encoding functions are smooth.

Example 4.1 (smooth encoding function).

Consider a binary classification problem with a scalar predictor X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, i.e., p=d1=1𝑝subscript𝑑11p=d_{1}=1italic_p = italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1, assume that the encoding function Ch:d2:superscript𝐶superscriptsubscript𝑑2C^{h}:\mathbb{R}\to\mathbb{R}^{d_{2}}italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT : blackboard_R → blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is twice continuously differentiable, then its second-order Taylor expansion at a𝑎a\in\mathbb{R}italic_a ∈ blackboard_R is given by

Ch(x)=Ch(a)+[dChdx|x=ad2Chdx2|x=a][xa(xa)2]+𝒪((xa)3),superscript𝐶𝑥superscript𝐶𝑎matrixevaluated-atderivative𝑥superscript𝐶𝑥𝑎evaluated-atderivative𝑥2superscript𝐶𝑥𝑎superscriptmatrix𝑥𝑎superscript𝑥𝑎2top𝒪superscript𝑥𝑎3\displaystyle C^{h}(x)=C^{h}(a)+\begin{bmatrix}\derivative{C^{h}}{x}\big{|}_{x% =a}&\derivative[2]{C^{h}}{x}\big{|}_{x=a}\end{bmatrix}\begin{bmatrix}x-a&(x-a)% ^{2}\end{bmatrix}^{\top}+\mathcal{O}((x-a)^{3}),italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x ) = italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_a ) + [ start_ARG start_ROW start_CELL divide start_ARG roman_d start_ARG italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG roman_d start_ARG italic_x end_ARG end_ARG | start_POSTSUBSCRIPT italic_x = italic_a end_POSTSUBSCRIPT end_CELL start_CELL divide start_ARG start_DIFFOP SUPERSCRIPTOP start_ARG roman_d end_ARG start_ARG 2 end_ARG end_DIFFOP start_ARG italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG SUPERSCRIPTOP start_ARG roman_d start_ARG italic_x end_ARG end_ARG start_ARG 2 end_ARG end_ARG | start_POSTSUBSCRIPT italic_x = italic_a end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL italic_x - italic_a end_CELL start_CELL ( italic_x - italic_a ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + caligraphic_O ( ( italic_x - italic_a ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) ,

where we can choose a𝑎aitalic_a so that Ch(a)=C~superscript𝐶𝑎~𝐶C^{h}(a)=\widetilde{C}italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_a ) = over~ start_ARG italic_C end_ARG. This provides a rank-two approximation for Ch(x)C~superscript𝐶𝑥~𝐶C^{h}(x)-\widetilde{C}italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x ) - over~ start_ARG italic_C end_ARG such that εs=𝒪((xa)3)subscript𝜀𝑠𝒪superscript𝑥𝑎3\varepsilon_{s}=\mathcal{O}((x-a)^{3})italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = caligraphic_O ( ( italic_x - italic_a ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ), where s=2𝑠2s=2italic_s = 2. This example can be generalized to high-order, multi-class, and multivariate cases, and we provide the details in Appendix D.

To understand the impact of the size of εssubscript𝜀𝑠\varepsilon_{s}italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT on how approximately the matching holds (or how small the approximation error is), we derive the following upper bound via a ridge-type estimator. Unlike ridge-type estimators used in practice, the parameter εssubscript𝜀𝑠\varepsilon_{s}italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT that restricts the size of the coefficients is determined by the generating mechanism of (X1,X2,Y)subscript𝑋1subscript𝑋2𝑌(X_{1},X_{2},Y)( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Y ).

Lemma 4.2.

Consider Bsuperscript𝐵B^{*}italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and gsuperscript𝑔g^{*}italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT corresponding to εssubscript𝜀𝑠\varepsilon_{s}italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, we have

minβ𝖾𝗋𝗋𝗈𝗋apx(β)2(p+𝖤[N2])minβ(𝐈pβC~2+βB2+εs||β2),\displaystyle\min_{\beta}\mathsf{error}_{\text{apx}}(\beta)\leq 2(p+% \operatorname{\sf E}[\lVert N\rVert^{2}])\min_{\beta}\left(\lVert{\bf I}_{p}-% \beta\widetilde{C}\rVert^{2}+\lVert\beta B^{*}\rVert^{2}+\varepsilon_{s}||% \beta\rVert^{2}\right),roman_min start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT sansserif_error start_POSTSUBSCRIPT apx end_POSTSUBSCRIPT ( italic_β ) ≤ 2 ( italic_p + sansserif_E [ ∥ italic_N ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ) roman_min start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( ∥ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - italic_β over~ start_ARG italic_C end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_β italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | | italic_β ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

where the minimum of the RHS is attained at βs:=(C~C~+(B)B+εsI)1C~assignsubscript𝛽𝑠superscriptsuperscript~𝐶top~𝐶superscriptsuperscript𝐵topsuperscript𝐵subscript𝜀𝑠𝐼1~𝐶\beta_{s}:=(\widetilde{C}^{\top}\widetilde{C}+(B^{*})^{\top}B^{*}+\varepsilon_% {s}I)^{-1}\widetilde{C}italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT := ( over~ start_ARG italic_C end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG italic_C end_ARG + ( italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over~ start_ARG italic_C end_ARG. The equality holds with the RHS being zero when εs=0subscript𝜀𝑠0\varepsilon_{s}=0italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0.

5 Finite Sample Analysis

Let βargminβ𝖾𝗋𝗋𝗈𝗋apx(β)superscript𝛽subscript𝛽subscript𝖾𝗋𝗋𝗈𝗋apx𝛽\beta^{*}\in\arg\min_{\beta}\mathsf{error}_{\text{apx}}(\beta)italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_arg roman_min start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT sansserif_error start_POSTSUBSCRIPT apx end_POSTSUBSCRIPT ( italic_β ) be a fixed true parameter for the downstream task. Recall that the excess risk is defined as (β^λ)=𝖤[𝖤[Y|X1]β^λX)2]\mathcal{R}(\hat{\beta}_{\lambda})=\operatorname{\sf E}[\lVert\operatorname{% \sf E}[Y|X_{1}]-\hat{\beta}_{\lambda}X)\rVert^{2}]caligraphic_R ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ) = sansserif_E [ ∥ sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT italic_X ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]. Under conditional mean independence, observe that X=𝖤[X2|X1]=C~𝖤[Y|X1]𝑋𝖤conditionalsubscript𝑋2subscript𝑋1~𝐶𝖤conditional𝑌subscript𝑋1X=\operatorname{\sf E}[X_{2}|X_{1}]=\widetilde{C}\operatorname{\sf E}[Y|X_{1}]italic_X = sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = over~ start_ARG italic_C end_ARG sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] is a feature vector with at most p𝑝pitalic_p (out of d2subscript𝑑2d_{2}italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) features that are linearly independent. Since the number of classes p𝑝pitalic_p is often much smaller than the dimension of the learned representation d2subscript𝑑2d_{2}italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, the design matrix 𝑿𝑿\bm{X}bold_italic_X for the downstream task is of low rank. This enables a finite-sample bound on the excessive risk 𝒪~(pnσ2)~𝒪𝑝𝑛superscript𝜎2\widetilde{\mathcal{O}}(\frac{p}{n}\sigma^{2})over~ start_ARG caligraphic_O end_ARG ( divide start_ARG italic_p end_ARG start_ARG italic_n end_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) with sample complexity 𝒪~(p)~𝒪𝑝\widetilde{\mathcal{O}}(p)over~ start_ARG caligraphic_O end_ARG ( italic_p ) (Lee et al., 2021), where the bound is independent of the dimension d2subscript𝑑2d_{2}italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. In the following, we provide a finite-sample analysis of SSL in the general setting when conditional independence can be violated, based on the low-rank approximation defined in (8).

First, we introduce a few technical assumptions. Let ΣΣ\Sigmaroman_Σ, Σ~~Σ\widetilde{\Sigma}over~ start_ARG roman_Σ end_ARG, and Σ¯¯Σ\bar{\Sigma}over¯ start_ARG roman_Σ end_ARG denote the covariance matrix of X𝑋Xitalic_X, X~~𝑋\widetilde{X}over~ start_ARG italic_X end_ARG, and X¯:=X~Xassign¯𝑋~𝑋𝑋\bar{X}:=\widetilde{X}-Xover¯ start_ARG italic_X end_ARG := over~ start_ARG italic_X end_ARG - italic_X, respectively.

Assumption 1

We assume N:=Y𝖤[Y|X1]assign𝑁𝑌𝖤conditional𝑌subscript𝑋1N:=Y-\operatorname{\sf E}[Y|X_{1}]italic_N := italic_Y - sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] is σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-sub-Gaussian, and the whitened feature vectors Σ~1/2X~superscript~Σ12~𝑋\widetilde{\Sigma}^{-1/2}\tilde{X}over~ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT over~ start_ARG italic_X end_ARG and Σ¯1/2X¯superscript¯Σ12¯𝑋\bar{\Sigma}^{-1/2}\bar{X}over¯ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT over¯ start_ARG italic_X end_ARG are ρ2superscript𝜌2\rho^{2}italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-sub-Gaussian. 111 When Σ~~Σ\widetilde{\Sigma}over~ start_ARG roman_Σ end_ARG or Σ¯¯Σ\bar{\Sigma}over¯ start_ARG roman_Σ end_ARG is not invertible, the whitened feature vector is defined through the generalized inverse.

Assumption 2

There exists b~,b¯0~𝑏¯𝑏0\tilde{b},\bar{b}\geq 0over~ start_ARG italic_b end_ARG , over¯ start_ARG italic_b end_ARG ≥ 0 s.t. the following holds almost surely,

  • \bullet

    Σ~1/2X~(𝖤[Y|X1]βX])b~p+s\lVert\widetilde{\Sigma}^{-1/2}\widetilde{X}(\operatorname{\sf E}[Y|X_{1}]-% \beta^{*}X])^{\top}\rVert\leq\tilde{b}\sqrt{p+s}∥ over~ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT over~ start_ARG italic_X end_ARG ( sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] - italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_X ] ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥ ≤ over~ start_ARG italic_b end_ARG square-root start_ARG italic_p + italic_s end_ARG  ;

  • \bullet

    Σ¯1/2X¯(𝖤[Y|X1]βX])b¯d2\lVert\bar{\Sigma}^{-1/2}\bar{X}(\operatorname{\sf E}[Y|X_{1}]-\beta^{*}X])^{% \top}\rVert\leq\bar{b}\sqrt{d_{2}}∥ over¯ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT over¯ start_ARG italic_X end_ARG ( sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] - italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_X ] ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥ ≤ over¯ start_ARG italic_b end_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG  .

Remark 5.1.

A similar assumption has been made in (Lee et al., 2021, Assumption 3.3), which is motivated by (Hsu et al., 2012, Condition 4).

Let λmax(A)subscript𝜆max𝐴\lambda_{\text{max}}(A)italic_λ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( italic_A ) denote the largest eigenvalue of a symmetric real matrix A𝐴Aitalic_A such that A𝟎𝐴0A\neq\bm{0}italic_A ≠ bold_0, λmin0(A)subscript𝜆min0𝐴\lambda_{\text{min}\neq 0}(A)italic_λ start_POSTSUBSCRIPT min ≠ 0 end_POSTSUBSCRIPT ( italic_A ) denote its smallest nonzero eigenvalue, and {λi(A)}subscript𝜆𝑖𝐴\{\lambda_{i}(A)\}{ italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_A ) } denote the set of all its eigenvalues. When X~~𝑋\widetilde{X}over~ start_ARG italic_X end_ARG is good approximates of X𝑋Xitalic_X, we expect Σ~Σ~ΣΣ\widetilde{\Sigma}-\Sigmaover~ start_ARG roman_Σ end_ARG - roman_Σ and Σ¯¯Σ\bar{\Sigma}over¯ start_ARG roman_Σ end_ARG to be close to zero matrices. Therefore, we consider restricting the largest eigenvalues of the two matrices, respectively. A generic bound is provided in (Wolkowicz and Styan, 1980), that is λmax(A)tr(A)d+d1s(A)subscript𝜆max𝐴tr𝐴𝑑𝑑1𝑠𝐴\lambda_{\text{max}}(A)\leq\frac{\text{tr}(A)}{d}+\sqrt{d-1}\cdot s(A)italic_λ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( italic_A ) ≤ divide start_ARG tr ( italic_A ) end_ARG start_ARG italic_d end_ARG + square-root start_ARG italic_d - 1 end_ARG ⋅ italic_s ( italic_A ), where s(A):=tr(A2)dtr2(A)d2assign𝑠𝐴trsuperscript𝐴2𝑑superscripttr2𝐴superscript𝑑2s(A):=\frac{\text{tr}(A^{2})}{d}-\frac{\text{tr}^{2}(A)}{d^{2}}italic_s ( italic_A ) := divide start_ARG tr ( italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_d end_ARG - divide start_ARG tr start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_A ) end_ARG start_ARG italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG is the variance of {λi(A)}subscript𝜆𝑖𝐴\{\lambda_{i}(A)\}{ italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_A ) }. The equality holds when the d1𝑑1d-1italic_d - 1 smallest eigenvalues are equal. However, this bound can be quite loose when d𝑑ditalic_d is large. Instead, we make the following assumption.

Assumption 3

For some universal constants c10subscript𝑐10c_{1}\geq 0italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ 0 and c20subscript𝑐20c_{2}\geq 0italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ 0,

  • \bullet

    λmax(Σ~Σ)c11d2|tr(Σ~Σ)|subscript𝜆max~ΣΣsubscript𝑐11subscript𝑑2tr~ΣΣ\lambda_{\text{max}}(\widetilde{\Sigma}-\Sigma)\leq c_{1}\frac{1}{d_{2}}\cdot% \lvert\text{tr}(\widetilde{\Sigma}-\Sigma)\rvertitalic_λ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( over~ start_ARG roman_Σ end_ARG - roman_Σ ) ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ⋅ | tr ( over~ start_ARG roman_Σ end_ARG - roman_Σ ) |  ;

  • \bullet

    λmax(Σ¯)c21d2tr(Σ¯)subscript𝜆max¯Σsubscript𝑐21subscript𝑑2tr¯Σ\lambda_{\text{max}}(\bar{\Sigma})\leq c_{2}\frac{1}{d_{2}}\cdot\text{tr}(\bar% {\Sigma})italic_λ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( over¯ start_ARG roman_Σ end_ARG ) ≤ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ⋅ tr ( over¯ start_ARG roman_Σ end_ARG )  .

Both inequalities require that the average eigenvalue is comparable to the largest eigenvalue. The assumption can be unrealistic when Σ~Σ~ΣΣ\widetilde{\Sigma}-\Sigmaover~ start_ARG roman_Σ end_ARG - roman_Σ or Σ¯¯Σ\bar{\Sigma}over¯ start_ARG roman_Σ end_ARG has mostly zero eigenvalues but a few large positive eigenvalues. We explain why such a case will not happen when sd2much-less-than𝑠subscript𝑑2s\ll d_{2}italic_s ≪ italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Case I: When rank(Σ)d2much-less-thanrankΣsubscript𝑑2\text{rank}(\Sigma)\ll d_{2}rank ( roman_Σ ) ≪ italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, there exists s=rank(Σ)𝑠rankΣs=\text{rank}(\Sigma)italic_s = rank ( roman_Σ ) such that X~=X~𝑋𝑋\widetilde{X}=Xover~ start_ARG italic_X end_ARG = italic_X and X¯=𝟎¯𝑋0\bar{X}=\bm{0}over¯ start_ARG italic_X end_ARG = bold_0, in which case the inequalities are satisfied with c1=c2=0subscript𝑐1subscript𝑐20c_{1}=c_{2}=0italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0. Case II: In settings when rank(Σ)rankΣ\text{rank}(\Sigma)rank ( roman_Σ ) is comparable to d2subscript𝑑2d_{2}italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (i.e., rank(Σ)rankΣ\text{rank}(\Sigma)rank ( roman_Σ ) is a fraction of d2subscript𝑑2d_{2}italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT), X~~𝑋\widetilde{X}over~ start_ARG italic_X end_ARG satisfies rank(Σ~)p+sd2rank~Σ𝑝𝑠much-less-thansubscript𝑑2\text{rank}(\widetilde{\Sigma})\leq p+s\ll d_{2}rank ( over~ start_ARG roman_Σ end_ARG ) ≤ italic_p + italic_s ≪ italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and thus the rank of Σ~Σ~ΣΣ\widetilde{\Sigma}-\Sigmaover~ start_ARG roman_Σ end_ARG - roman_Σ should be greater than d2pssubscript𝑑2𝑝𝑠d_{2}-p-sitalic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_p - italic_s, meaning that most of eigenvalues are nonzero. Similarly, X¯=XX~¯𝑋𝑋~𝑋\bar{X}=X-\widetilde{X}over¯ start_ARG italic_X end_ARG = italic_X - over~ start_ARG italic_X end_ARG should have at least d2pssubscript𝑑2𝑝𝑠d_{2}-p-sitalic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_p - italic_s linearly independent components, i.e., rank(Σ¯)d2psrank¯Σsubscript𝑑2𝑝𝑠\text{rank}(\bar{\Sigma})\geq d_{2}-p-srank ( over¯ start_ARG roman_Σ end_ARG ) ≥ italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_p - italic_s. Given that X~~𝑋\widetilde{X}over~ start_ARG italic_X end_ARG serves as an approximation for X𝑋Xitalic_X with a lower effective dimension, we make the following technical assumption on the rank.

Assumption 4

rank(Σ~)rank(Σ)rank~ΣrankΣ\text{rank}(\widetilde{\Sigma})\leq\text{rank}(\Sigma)rank ( over~ start_ARG roman_Σ end_ARG ) ≤ rank ( roman_Σ ) and rank(Σ¯)rank(Σ)rank¯ΣrankΣ\text{rank}(\bar{\Sigma})\leq\text{rank}(\Sigma)rank ( over¯ start_ARG roman_Σ end_ARG ) ≤ rank ( roman_Σ ).

Since X𝑋Xitalic_X has rank(Σ)rankΣ\text{rank}(\Sigma)rank ( roman_Σ ) linearly independent components while X~~𝑋\widetilde{X}over~ start_ARG italic_X end_ARG is introduced to approximate rank(Σ~)rank~Σ\text{rank}(\widetilde{\Sigma})rank ( over~ start_ARG roman_Σ end_ARG ) independent components out of them, X¯=XX~¯𝑋𝑋~𝑋\bar{X}=X-\widetilde{X}over¯ start_ARG italic_X end_ARG = italic_X - over~ start_ARG italic_X end_ARG is expected to have less independent components than X𝑋Xitalic_X.

Theorem 5.2.

Under Assumptions 1— 4, for any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), if nρ4(d2+log1δ)much-greater-than𝑛superscript𝜌4subscript𝑑21𝛿n\gg\rho^{4}(d_{2}+\log\frac{1}{\delta})italic_n ≫ italic_ρ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ( italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ), the excess risk of the downstream task induced by β^0subscript^𝛽0\hat{\beta}_{0}over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is upper bounded, with probability at least 1δ1𝛿1-\delta1 - italic_δ, by

(β^0)subscript^𝛽0\displaystyle\mathcal{R}(\hat{\beta}_{0})caligraphic_R ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) 𝖾𝗋𝗋𝗈𝗋apx+𝒪~((1+εs)p+snσ2+εsd2nσ2).absentsuperscriptsubscript𝖾𝗋𝗋𝗈𝗋apx~𝒪1subscript𝜀𝑠𝑝𝑠𝑛superscript𝜎2subscript𝜀𝑠subscript𝑑2𝑛superscript𝜎2\displaystyle\leq\mathsf{error}_{\text{apx}}^{*}+\widetilde{\mathcal{O}}\left(% (1+\varepsilon_{s})\frac{p+s}{n}\sigma^{2}+\varepsilon_{s}\frac{d_{2}}{n}% \sigma^{2}\right).≤ sansserif_error start_POSTSUBSCRIPT apx end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + over~ start_ARG caligraphic_O end_ARG ( ( 1 + italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) divide start_ARG italic_p + italic_s end_ARG start_ARG italic_n end_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT divide start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .
Remark 5.3.

When εs=0subscript𝜀𝑠0\varepsilon_{s}=0italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0, if nρ4(p+s+log1δ)much-greater-than𝑛superscript𝜌4𝑝𝑠1𝛿n\gg\rho^{4}(p+s+\log\frac{1}{\delta})italic_n ≫ italic_ρ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ( italic_p + italic_s + roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ), we have (β^0)𝒪~(p+snσ2)subscript^𝛽0~𝒪𝑝𝑠𝑛superscript𝜎2\mathcal{R}(\hat{\beta}_{0})\leq\widetilde{\mathcal{O}}\left(\frac{p+s}{n}% \sigma^{2}\right)caligraphic_R ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≤ over~ start_ARG caligraphic_O end_ARG ( divide start_ARG italic_p + italic_s end_ARG start_ARG italic_n end_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

The proof of Theorem 5.2 follows a similar idea to that of (Lee et al., 2021, Theorem 3.53.53.53.5); a subtle yet important difference is that we consider approximation errors due to the violations of the exact matching while they consider approximation errors due to choices of the function class ΨΨ\Psiroman_Ψ. When εs=0subscript𝜀𝑠0\varepsilon_{s}=0italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0, the dominating rate of (β^0)subscript^𝛽0\mathcal{R}(\hat{\beta}_{0})caligraphic_R ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is p+snσ2𝑝𝑠𝑛superscript𝜎2\frac{p+s}{n}\sigma^{2}divide start_ARG italic_p + italic_s end_ARG start_ARG italic_n end_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, which shows that SSL enjoys a similar sample complexity as shown in (Lee et al., 2021) even when conditional independence is violated. We have demonstrated in Lemma 4.2 how the approximation error 𝖾𝗋𝗋𝗈𝗋apx(β)subscript𝖾𝗋𝗋𝗈𝗋apxsuperscript𝛽\mathsf{error}_{\text{apx}}(\beta^{*})sansserif_error start_POSTSUBSCRIPT apx end_POSTSUBSCRIPT ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) depends on the approximation level εssubscript𝜀𝑠\varepsilon_{s}italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. We also provide the bound with respect to β^λsubscript^𝛽𝜆\hat{\beta}_{\lambda}over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT, stated below. The proof is largely followed from (Hsu et al., 2012, Theorem 16) and we only outline the main steps in Appendix H.

Corollary 5.4 (Informal).

Under suitable assumptions, the excess risk of the downstream task induced by β^λsubscript^𝛽𝜆\hat{\beta}_{\lambda}over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT can be upper bounded by

(β^λ)𝖾𝗋𝗋𝗈𝗋apx+𝖤[(βλβ)X2]+𝒪(p+sn(1+εsλ)σ~2),subscript^𝛽𝜆superscriptsubscript𝖾𝗋𝗋𝗈𝗋apx𝖤superscriptdelimited-∥∥subscript𝛽𝜆superscript𝛽𝑋2𝒪𝑝𝑠𝑛1subscript𝜀𝑠𝜆superscript~𝜎2\mathcal{R}(\hat{\beta}_{\lambda})\leq\mathsf{error}_{\text{apx}}^{*}+% \operatorname{\sf E}[\lVert(\beta_{\lambda}-\beta^{*})X\rVert^{2}]+\mathcal{O}% \left(\frac{p+s}{n}\left(1+\frac{\varepsilon_{s}}{\lambda}\right)\tilde{\sigma% }^{2}\right),caligraphic_R ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ) ≤ sansserif_error start_POSTSUBSCRIPT apx end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + sansserif_E [ ∥ ( italic_β start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT - italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_X ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + caligraphic_O ( divide start_ARG italic_p + italic_s end_ARG start_ARG italic_n end_ARG ( 1 + divide start_ARG italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG italic_λ end_ARG ) over~ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

with high probability, where σ~2:=𝖾𝗋𝗋𝗈𝗋apx(βλ)+𝖤[(βλβ)X2]+σ2assignsuperscript~𝜎2subscript𝖾𝗋𝗋𝗈𝗋apxsubscript𝛽𝜆𝖤superscriptdelimited-∥∥subscript𝛽𝜆superscript𝛽𝑋2superscript𝜎2\tilde{\sigma}^{2}:=\mathsf{error}_{\text{apx}}(\beta_{\lambda})+\operatorname% {\sf E}[\lVert(\beta_{\lambda}-\beta^{*})X\rVert^{2}]+\sigma^{2}over~ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT := sansserif_error start_POSTSUBSCRIPT apx end_POSTSUBSCRIPT ( italic_β start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ) + sansserif_E [ ∥ ( italic_β start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT - italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_X ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

For simplicity, the parameters that depend on the choice of λ𝜆\lambdaitalic_λ are omitted. The bound requires p+snmuch-less-than𝑝𝑠𝑛p+s\ll nitalic_p + italic_s ≪ italic_n even though n<d2𝑛subscript𝑑2n<d_{2}italic_n < italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, thus an approximation (8) with lower rank is expected in this more challenging setting. The term 𝖤[(βλβ)X2]𝖤superscriptdelimited-∥∥subscript𝛽𝜆superscript𝛽𝑋2\operatorname{\sf E}[\lVert(\beta_{\lambda}-\beta^{*})X\rVert^{2}]sansserif_E [ ∥ ( italic_β start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT - italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_X ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] relies on the difference between βλsubscript𝛽𝜆\beta_{\lambda}italic_β start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT and βsuperscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, as well as the choice of λ𝜆\lambdaitalic_λ. When βλ=βsubscript𝛽𝜆superscript𝛽\beta_{\lambda}=\beta^{*}italic_β start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and εs=0subscript𝜀𝑠0\varepsilon_{s}=0italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0, the dominating rate p+snσ2𝑝𝑠𝑛superscript𝜎2\frac{p+s}{n}\sigma^{2}divide start_ARG italic_p + italic_s end_ARG start_ARG italic_n end_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the same as that in Remark 5.3. This shows that low-rank structures enable SSL to share a similar excess risk upper bound and sample complexity in low- and high-dimensional settings.

6 Experiments

We propose a synthetic dataset and two computer vision tasks to examine the importance of the full rank condition on Ch(x)superscript𝐶𝑥C^{h}(x)italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x ) and the low-rank approximation quality. Recall that a necessary condition for the exact matching is that Ch(x)superscript𝐶𝑥C^{h}(x)italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x ) has full rank for every x𝒳1𝑥subscript𝒳1x\in\mathcal{X}_{1}italic_x ∈ caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. For the synthetic dataset, we ensure that Ch(x)superscript𝐶𝑥C^{h}(x)italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x ) is of full rank and focus on the low-rank approximation. For image data, since Chsuperscript𝐶C^{h}italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT is a latent matrix function, it is not straightforward to test whether Ch(x)superscript𝐶𝑥C^{h}(x)italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x ) has full rank in general. To this end, we design images of simple geometric shapes, that can be seen as abstractions of real images and show how some geometric properties make the rank condition hold or fail. To further understand the importance of the low-rank approximation, we add background patterns to the MNIST dataset and show that certain patterns can lead to poor low-rank approximation. See more details of the experiments in Appendixes IJ, and K. SSL approaches have achieved superior performance on large benchmark datasets, while the function class for downstream tasks is often much larger than linear models (e.g., MLPs), which is beyond the scope of our theoretical analysis. The implementation of our experiments is provided at \urlhttps://github.com/dukang4655/reconstructive_ssl.

6.1 Synthetic Data

Refer to caption
Figure 1: Setting I (top row): n=300𝑛300n=300italic_n = 300 and vary s𝑠sitalic_s. SSL (red), SL1subscriptSL1\text{SL}_{1}SL start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (blue), and SL2subscriptSL2\text{SL}_{2}SL start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (green). 50505050 repeated experiments. Solid lines: mean; shaded region: standard derivation. Setting II (bottom row): s=5𝑠5s=5italic_s = 5 and n𝑛nitalic_n varies.

We use a synthetic dataset to verify our theoretical results when n>d2𝑛subscript𝑑2n>d_{2}italic_n > italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. First, we generate (X1,X2,Y)subscript𝑋1subscript𝑋2𝑌(X_{1},X_{2},Y)( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Y ) with d1=10subscript𝑑110d_{1}=10italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 10, d2=20subscript𝑑220d_{2}=20italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 20, and p=2𝑝2p=2italic_p = 2, where X2=(A+Bg(X1))Y+Nsubscript𝑋2𝐴𝐵𝑔subscript𝑋1𝑌𝑁X_{2}=(A+Bg(X_{1}))Y+Nitalic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( italic_A + italic_B italic_g ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) italic_Y + italic_N, where Bd2×s𝐵superscriptsubscript𝑑2𝑠B\in\mathbb{R}^{d_{2}\times s}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_s end_POSTSUPERSCRIPT. See details of the model parameters in Appendix I. We compare SSL with two supervised learning (SL) procedures in two settings: I. Fix s=5𝑠5s=5italic_s = 5 and vary the labeled sample size n{100,200,400,800,1600}𝑛1002004008001600n\in\{100,200,400,800,1600\}italic_n ∈ { 100 , 200 , 400 , 800 , 1600 }; II. Fix n=300𝑛300n=300italic_n = 300 and vary the low-rank approximation by considering s{1,2,,5,10,15,20}𝑠125101520s\in\{1,2,\ldots,5,10,15,20\}italic_s ∈ { 1 , 2 , … , 5 , 10 , 15 , 20 }. We consider two supervised learning procedures. SL2subscriptSL2\text{SL}_{2}SL start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: Predicting Y𝑌Yitalic_Y by X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and SL2subscriptSL2\text{SL}_{2}SL start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: Predicting Y𝑌Yitalic_Y by (X1,X2)subscript𝑋1subscript𝑋2(X_{1},X_{2})( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). We use MLPs for the pretext task and the two supervised learning procedures. In Fig. 1, the performance of SL1subscriptSL1\text{SL}_{1}SL start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is roughly invariant with respect to s𝑠sitalic_s since it does not use X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for the prediction, making it more robust than SL2subscriptSL2\text{SL}_{2}SL start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. This verifies that predictions using the parents of Y𝑌Yitalic_Y as predictors (which we call causal predictions) are more robust than non-causal predictions under a small sample size. The superior performance of SSL degrades as s𝑠sitalic_s increases. When s<d2p=18𝑠subscript𝑑2𝑝18s<d_{2}-p=18italic_s < italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_p = 18, we have εs=0subscript𝜀𝑠0\varepsilon_{s}=0italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0 according to the factorization in (7). In Fig. 1, when s=20𝑠20s=20italic_s = 20, a low-rank approximation (8) could lead to a large approximation error εs¯subscript𝜀¯𝑠\varepsilon_{\bar{s}}italic_ε start_POSTSUBSCRIPT over¯ start_ARG italic_s end_ARG end_POSTSUBSCRIPT for s¯18¯𝑠18\bar{s}\leq 18over¯ start_ARG italic_s end_ARG ≤ 18. As a consequence, the advantage of SSL over SL1 is much smaller compared with the case with s=0𝑠0s=0italic_s = 0. This indicates that a good low-rank approximation is not only sufficient but also necessary for SSL. In the other setting when s𝑠sitalic_s is fixed to 5555 but the sample size n𝑛nitalic_n varies, as shown in Fig.1, the performance of SSL improves slowly with the increasing sample size, since it already achieves performance that to close to the optimal (i.e., the performance of SL1 with a large n𝑛nitalic_n) under a small n𝑛nitalic_n. SL1 starts to catch up with SSL when n800𝑛800n\geq 800italic_n ≥ 800, while the accuracy of SL2 does not consistently improve as n𝑛nitalic_n increases.

6.2 Computer Vision Tasks

6.2.1 Geometric Shapes (On the Rank Condition)

In computer vision applications, it is common that the dimension of the learned representation is much larger than the labeled sample size (i.e., d2nmuch-greater-thansubscript𝑑2𝑛d_{2}\gg nitalic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≫ italic_n) for SSL. We design a simple task to help understand how the patterns in an image make SSL work or fail; the task is inspired by (Gidaris et al., 2018), where the pretext task is to predict the rotated angles of images. We consider X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as a random image of some objects, where the objects have random sizes and random locations. The goal is to classify the shape of the object Y¯¯𝑌\bar{Y}over¯ start_ARG italic_Y end_ARG. We created X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT by randomly rotating X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT by 00 or 90909090 degrees. Observe that the location and the size of an object are redundant features for predicting its shape and orientation. This stylized setting, even though much simpler than real-world images, is designed this way on purpose in order for the redundancy variable to have low-rank approximations. According to (4), the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT column of Ch(x)superscript𝐶𝑥C^{h}(x)italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x ) can be viewed as a feature vector for the rotation angle corresponding to the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT class of objects. Since we consider the classification of two classes (i.e., p=2𝑝2p=2italic_p = 2), the condition requires that the two feature vectors should not be similar. To verify this necessary condition, we consider two pairs of objects:

Refer to caption
Figure 2: (a) Left: Triangle vs. Tangent Circles. (b) Right: Triangle vs. Pentagon. 50505050 repeated experiments. SSL (red) and SL (blue).

Triangle vs. Tangent Circles (Fig. 2(a)). In this case, the identification of orientation is based on characteristics specific to the two shapes. For triangles, it is natural to focus on the edges and vertices, while those characteristics are not even defined for circles. Thus, we think the rank condition is approximately satisfied. We examine this observation using Grad-CAM (Selvaraju et al., 2017) that visualizes the contributing features that the model extracted from the image (see Fig. 4 from Appendix J). Similarly for the pair of objects below.

Triangle vs. Pentagon (Fig. 2(b)) In this case, the orientation of the two objects can be identified in similar ways, mainly based on the edges and vertices. In this case, the columns of Ch(x)superscript𝐶𝑥C^{h}(x)italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x ) are close to linearly dependent for different x𝒳1𝑥subscript𝒳1x\in\mathcal{X}_{1}italic_x ∈ caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. As a consequence, the necessary condition for the exact matching is violated.

We compare SSL with SL under different labeled sample sizes n{10,20,40,60,80,100}𝑛1020406080100n\in\{10,20,40,60,80,100\}italic_n ∈ { 10 , 20 , 40 , 60 , 80 , 100 }. For Triangle vs. Tangent Circles, as shown in Fig. 2, SSL consistently outperforms SL for small sample sizes (i.e., n60𝑛60n\leq 60italic_n ≤ 60). The performance of SSL improves slower compared with SL for sufficiently large n𝑛nitalic_n, since the prediction error of SSL will be dominated by the population error instead of the estimation error. Recall that our finite-sample bound on the excess risk in Corollary 5.4 converges to the sum of the approximation error and an error term depending on λ𝜆\lambdaitalic_λ as n𝑛n\to\inftyitalic_n → ∞. In contrast, SSL behaves very differently for Triangle vs. Pentagon. From Fig. 2(b), the accuracy of SSL improves slower as n𝑛nitalic_n increases, potentially due to the large approximation error. Overall, SSL has almost no advantage over SL. This experiment shows that the rank condition is crucial for SSL.

6.2.2 Stylized MNIST (On the Low-Rank Approximation)

Refer to caption
Figure 3: SSL(red) and SL(blue).

To examine how the redundancy affects the performance of SSL, we consider the same rotation prediction task for a stylized MNIST dataset illustrated in Fig. 3, where the density of the background pattern varies randomly. A key observation is that the dot pattern does not help identify the image orientation, so it is not encoded into the orientation variable as redundancy. On the contrary, one can tell the orientation of the image simply by the orientation of the dash pattern, thus the pretext task will extract features from the dash pattern as redundant information. Again, we use Grad-CAM to visualize our observation in Fig. 7 from Appendix K. Consequently, a dense dash pattern can lead to poor low-rank approximation. In Fig. 6 from Appendix K, the performance of SSL is almost invariant to the density of the dot pattern while the performance of SL drops as the density increases. In contrast, SSL is quite sensitive to a sparse dash pattern and the performance gets worse as the density increases (see Fig. 3). We have tested the dot vs. dash patterns for the geometric shape images, and similar results are observed as shown in Fig. 5 from Appendix J.

7 Discussion

Many important questions remain to be studied and we list a few in this section. One natural next step is to study nonlinear function classes for the downstream task and characterize the corresponding sufficient and necessary conditions. Since our theoretical results can potentially provide guidance for develo** SSL procedures, especially for designing pretext tasks, it would be worthwhile to design systematic and extensive experiments to better bridge the theories and practical designs. Besides the superior performance under limited labeled samples, the other major advantage of SSL is that the learned representation can be useful for a diverse class of downstream tasks; a theoretical understanding of its ability to generalize to new tasks or unseen environments (e.g., by exploiting invariance Du and Xiang (2023b)) is of great importance.

References

  • Arora et al. (2019) Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229, 2019.
  • Candes and Tao (2007) Emmanuel Candes and Terence Tao. The dantzig selector: Statistical estimation when p is much larger than n. 2007.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • Du and Xiang (2022) Kang Du and Yu Xiang. An invariant matching property for distribution generalization under intervened response. In 2022 30th European Signal Processing Conference (EUSIPCO), pages 1387–1391. IEEE, 2022.
  • Du and Xiang (2023a) Kang Du and Yu Xiang. Generalized invariant matching property via lasso. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023a.
  • Du and Xiang (2023b) Kang Du and Yu Xiang. Learning invariant representations under general interventions on the response. IEEE Journal on Selected Areas in Information Theory, 2023b.
  • Dutta and Li (2017) Aritra Dutta and Xin Li. On a problem of weighted low-rank approximation of matrices. SIAM Journal on Matrix Analysis and Applications, 38(2):530–553, 2017.
  • Efron (1986) Bradley Efron. How biased is the apparent error rate of a prediction rule? Journal of the American statistical Association, 81(394):461–470, 1986.
  • Gidaris et al. (2018) Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.
  • Gui et al. (2023) Jie Gui, Tuo Chen, Qiong Cao, Zhenan Sun, Hao Luo, and Dacheng Tao. A survey of self-supervised learning from multiple perspectives: Algorithms, theory, applications and future trends. arXiv preprint arXiv:2301.05712, 2023.
  • Hastie et al. (2009) Trevor Hastie, Robert Tibshirani, and Jerome H Friedman. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
  • Hsu et al. (2012) Daniel Hsu, Sham M Kakade, and Tong Zhang. Random design analysis of ridge regression. In Conference on learning theory, pages 9–1. JMLR Workshop and Conference Proceedings, 2012.
  • Lee et al. (2021) Jason D Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo. Predicting what you already know helps: Provable self-supervised learning. Advances in Neural Information Processing Systems, 34:309–323, 2021.
  • Ozbulak et al. (2023) Utku Ozbulak, Hyun Jung Lee, Beril Boga, Esla Timothy Anzaku, Homin Park, Arnout Van Messem, Wesley De Neve, and Joris Vankerschaver. Know your self-supervised learning: A survey on image-based generative and discriminative training. arXiv preprint arXiv:2305.13689, 2023.
  • Pathak et al. (2016) Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
  • Pearl (2009) Judea Pearl. Causality. Cambridge University Press, 2009.
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018.
  • Saunshi et al. (2020) Nikunj Saunshi, Sadhika Malladi, and Sanjeev Arora. A mathematical exploration of why language models help solve downstream tasks. arXiv preprint arXiv:2010.03648, 2020.
  • Selvaraju et al. (2017) Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
  • Srebro and Jaakkola (2003) Nathan Srebro and Tommi Jaakkola. Weighted low-rank approximations. In Proceedings of the 20th international conference on machine learning (ICML-03), pages 720–727, 2003.
  • Teng et al. (2022) Jiaye Teng, Weiran Huang, and Haowei He. Can pretext-based self-supervised learning be boosted by downstream data? a theoretical analysis. In International Conference on Artificial Intelligence and Statistics, pages 4198–4216. PMLR, 2022.
  • Tibshirani (1996) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996.
  • Tosh et al. (2021) Christopher Tosh, Akshay Krishnamurthy, and Daniel Hsu. Contrastive learning, multi-view redundancy, and linear models. In Algorithmic Learning Theory, pages 1179–1206. PMLR, 2021.
  • Vincent et al. (2010) Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and Léon Bottou. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(12), 2010.
  • Wei et al. (2021) Colin Wei, Sang Michael Xie, and Tengyu Ma. Why do pretrained language models help in downstream tasks? an analysis of head and prompt tuning. Advances in Neural Information Processing Systems, 34:16158–16170, 2021.
  • Wolkowicz and Styan (1980) Henry Wolkowicz and George PH Styan. Bounds for eigenvalues using traces. Linear algebra and its applications, 29:471–506, 1980.
  • Zhang et al. (2016) Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pages 649–666. Springer, 2016.
  • Zhang (2005) Tong Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural computation, 17(9):2077–2098, 2005.

Appendix A Conditional Mean Independence

Recall that when Chsuperscript𝐶C^{h}italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT is a constant function, we have X2=C~Y+Nsubscript𝑋2~𝐶𝑌𝑁X_{2}=\widetilde{C}Y+Nitalic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = over~ start_ARG italic_C end_ARG italic_Y + italic_N and we require 𝖤[N|X1,Y]=0𝖤conditional𝑁subscript𝑋1𝑌0\operatorname{\sf E}[N|X_{1},Y]=0sansserif_E [ italic_N | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ] = 0 rather than 𝖤[N|Y]=0𝖤conditional𝑁𝑌0\operatorname{\sf E}[N|Y]=0sansserif_E [ italic_N | italic_Y ] = 0.

Proposition A.1.

Model (4) holds with Chsuperscript𝐶C^{h}italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT being a constant function if and only if 𝖤[X2|X1,Y]=𝖤[X2|Y]𝖤conditionalsubscript𝑋2subscript𝑋1𝑌𝖤conditionalsubscript𝑋2𝑌\operatorname{\sf E}[X_{2}|X_{1},Y]=\operatorname{\sf E}[X_{2}|Y]sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ] = sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_Y ].

Proof A.2.

First, assume that X2=C~Y+Nsubscript𝑋2~𝐶𝑌𝑁X_{2}=\widetilde{C}Y+Nitalic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = over~ start_ARG italic_C end_ARG italic_Y + italic_N holds with 𝖤[N|X1,Y]=0𝖤conditional𝑁subscript𝑋1𝑌0\operatorname{\sf E}[N|X_{1},Y]=0sansserif_E [ italic_N | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ] = 0. It follows immediately that

𝖤[X2|X1,Y]=𝖤[C~Y+N|X1,Y]=C~Y𝖤conditionalsubscript𝑋2subscript𝑋1𝑌𝖤~𝐶𝑌conditional𝑁subscript𝑋1𝑌~𝐶𝑌\operatorname{\sf E}[X_{2}|X_{1},Y]=\operatorname{\sf E}[\widetilde{C}Y+N|X_{1% },Y]=\widetilde{C}Ysansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ] = sansserif_E [ over~ start_ARG italic_C end_ARG italic_Y + italic_N | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ] = over~ start_ARG italic_C end_ARG italic_Y

and 𝖤[X2|Y]=𝖤[C~Y+N|Y]=C~Y𝖤conditionalsubscript𝑋2𝑌𝖤~𝐶𝑌conditional𝑁𝑌~𝐶𝑌\operatorname{\sf E}[X_{2}|Y]=\operatorname{\sf E}[\widetilde{C}Y+N|Y]=% \widetilde{C}Ysansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_Y ] = sansserif_E [ over~ start_ARG italic_C end_ARG italic_Y + italic_N | italic_Y ] = over~ start_ARG italic_C end_ARG italic_Y, where we use the fact that 𝖤[N|Y]=𝖤[𝖤[N|X1,Y]|Y]=0𝖤conditional𝑁𝑌𝖤conditional𝖤conditional𝑁subscript𝑋1𝑌𝑌0\operatorname{\sf E}[N|Y]=\operatorname{\sf E}[\operatorname{\sf E}[N|X_{1},Y]% |Y]=0sansserif_E [ italic_N | italic_Y ] = sansserif_E [ sansserif_E [ italic_N | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ] | italic_Y ] = 0. Thus we have 𝖤[X2|X1,Y]=𝖤[X2|Y]𝖤conditionalsubscript𝑋2subscript𝑋1𝑌𝖤conditionalsubscript𝑋2𝑌\operatorname{\sf E}[X_{2}|X_{1},Y]=\operatorname{\sf E}[X_{2}|Y]sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ] = sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_Y ]. Now we prove the other direction. Assume that 𝖤[X2|X1,Y]=𝖤[X2|Y]𝖤conditionalsubscript𝑋2subscript𝑋1𝑌𝖤conditionalsubscript𝑋2𝑌\operatorname{\sf E}[X_{2}|X_{1},Y]=\operatorname{\sf E}[X_{2}|Y]sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ] = sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_Y ], then

𝖤[X2|X1,Y]=𝖤[X2|Y]=i=1p𝖤[X2|Y=ei]𝟙Y¯=yi:=C~Y,𝖤conditionalsubscript𝑋2subscript𝑋1𝑌𝖤conditionalsubscript𝑋2𝑌superscriptsubscript𝑖1𝑝𝖤conditionalsubscript𝑋2𝑌subscript𝑒𝑖subscript1¯𝑌subscript𝑦𝑖assign~𝐶𝑌\operatorname{\sf E}[X_{2}|X_{1},Y]=\operatorname{\sf E}[X_{2}|Y]=\sum_{i=1}^{% p}\operatorname{\sf E}[X_{2}|Y=e_{i}]\mathbbm{1}_{\bar{Y}=y_{i}}:=\widetilde{C% }Y,sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ] = sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_Y ] = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_Y = italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] blackboard_1 start_POSTSUBSCRIPT over¯ start_ARG italic_Y end_ARG = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT := over~ start_ARG italic_C end_ARG italic_Y ,

where C~~𝐶\widetilde{C}over~ start_ARG italic_C end_ARG has columns 𝖤[X2|Y=ei]𝖤conditionalsubscript𝑋2𝑌subscript𝑒𝑖\operatorname{\sf E}[X_{2}|Y=e_{i}]sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_Y = italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ]’s, which implies model (A.1) in the form of X2=C~Y+Nsubscript𝑋2~𝐶𝑌𝑁X_{2}=\widetilde{C}Y+Nitalic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = over~ start_ARG italic_C end_ARG italic_Y + italic_N.

Appendix B Proof of Proposition 3.1

Proof B.1.

First, according to Definition 2.2, the exacting matching is equivalent to

𝖤[Y|X1]=β𝖤[X2|X1]𝖤conditional𝑌subscript𝑋1𝛽𝖤conditionalsubscript𝑋2subscript𝑋1\operatorname{\sf E}[Y|X_{1}]=\beta\operatorname{\sf E}[X_{2}|X_{1}]sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = italic_β sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] (9)

for some βp×d2𝛽superscript𝑝subscript𝑑2\beta\in\mathbb{R}^{p\times d_{2}}italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. By our assumption that Cov(𝖤[Y|X1])Cov𝖤conditional𝑌subscript𝑋1\mathop{\rm Cov}\nolimits(\operatorname{\sf E}[Y|X_{1}])roman_Cov ( sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ) has full rank, β𝛽\betaitalic_β has to have full rank when the exact matching holds. Plugging 𝖤[X2|X1]=Ch(X1)𝖤[Y|X1]𝖤conditionalsubscript𝑋2subscript𝑋1superscript𝐶subscript𝑋1𝖤conditional𝑌subscript𝑋1\operatorname{\sf E}[X_{2}|X_{1}]=C^{h}(X_{1})\operatorname{\sf E}[Y|X_{1}]sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] into (9), we have

(𝐈pβCh(X1))𝖤[Y|X1]=𝟎,subscript𝐈𝑝𝛽superscript𝐶subscript𝑋1𝖤conditional𝑌subscript𝑋10({\bf I}_{p}-\beta C^{h}(X_{1}))\operatorname{\sf E}[Y|X_{1}]=\bm{0},( bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - italic_β italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = bold_0 ,

for some βp×d2𝛽superscript𝑝subscript𝑑2\beta\in\mathbb{R}^{p\times d_{2}}italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Equivalently,

𝐈pβCh(X1)=O¯(X1)subscript𝐈𝑝𝛽superscript𝐶subscript𝑋1¯𝑂subscript𝑋1{\bf I}_{p}-\beta C^{h}(X_{1})=\bar{O}(X_{1})bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - italic_β italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = over¯ start_ARG italic_O end_ARG ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) (10)

for some O¯:𝒳1p×p:¯𝑂subscript𝒳1superscript𝑝𝑝\bar{O}:\mathcal{X}_{1}\to\mathbb{R}^{p\times p}over¯ start_ARG italic_O end_ARG : caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_p × italic_p end_POSTSUPERSCRIPT such that O¯(X1)𝖤[Y|X1]=𝟎¯𝑂subscript𝑋1𝖤conditional𝑌subscript𝑋10\bar{O}(X_{1})\operatorname{\sf E}[Y|X_{1}]=\bm{0}over¯ start_ARG italic_O end_ARG ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = bold_0. Since β𝛽\betaitalic_β has full rank, (10) is equivalent to

β(Ch(X1)β1O¯(X1))=𝐈p,𝛽superscript𝐶subscript𝑋1superscript𝛽1¯𝑂subscript𝑋1subscript𝐈𝑝\beta(C^{h}(X_{1})-\beta^{-1}\bar{O}(X_{1}))={\bf I}_{p},italic_β ( italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_β start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over¯ start_ARG italic_O end_ARG ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) = bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , (11)

where β1superscript𝛽1\beta^{-1}italic_β start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT denotes the right inverse of β𝛽\betaitalic_β and O(X1):=β1O¯(X1)assign𝑂subscript𝑋1superscript𝛽1¯𝑂subscript𝑋1O(X_{1}):=\beta^{-1}\bar{O}(X_{1})italic_O ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) := italic_β start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over¯ start_ARG italic_O end_ARG ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) satisfies O(X1)𝖤[Y|X1]=𝟎𝑂subscript𝑋1𝖤conditional𝑌subscript𝑋10O(X_{1})\operatorname{\sf E}[Y|X_{1}]=\bm{0}italic_O ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = bold_0.

Appendix C Proof of Lemma 3.2

Proof C.1.

For simplicity of notation, we prove the lemma without considering the orthogonal term, while the orthogonal term can be directly added to the final expression of Chsuperscript𝐶C^{h}italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT. Recall the dimension of β𝛽\betaitalic_β is p𝑝pitalic_p by d2subscript𝑑2d_{2}italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. According to Proposition 3.1, it is sufficient to show that βCh(x)=𝐈p𝛽superscript𝐶𝑥subscript𝐈𝑝\beta C^{h}(x)={\bf I}_{p}italic_β italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x ) = bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is equivalent to (6). The “if” part is immediate since β=[C1,𝟎]A1𝛽superscript𝐶10superscript𝐴1\beta=[C^{-1},\bm{0}]A^{-1}italic_β = [ italic_C start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , bold_0 ] italic_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT will lead to βCh(x)=𝐈p𝛽superscript𝐶𝑥subscript𝐈𝑝\beta C^{h}(x)={\bf I}_{p}italic_β italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x ) = bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, x𝒳1for-all𝑥subscript𝒳1\forall x\in\mathcal{X}_{1}∀ italic_x ∈ caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. In the following, we prove the other direction. When the exact matching holds, recall that the full rank of Cov(𝖤[Y|X1])Cov𝖤conditional𝑌subscript𝑋1\mathop{\rm Cov}\nolimits(\operatorname{\sf E}[Y|X_{1}])roman_Cov ( sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ) implies that β𝛽\betaitalic_β has full row rank since p<d2𝑝subscript𝑑2p<d_{2}italic_p < italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The QR decomposition of βsuperscript𝛽top\beta^{\top}italic_β start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT gives β=[C¯,𝟎]A¯𝛽¯𝐶0¯𝐴\beta=[\bar{C},\bm{0}]\bar{A}italic_β = [ over¯ start_ARG italic_C end_ARG , bold_0 ] over¯ start_ARG italic_A end_ARG, where C¯p×p¯𝐶superscript𝑝𝑝\bar{C}\in\mathbb{R}^{p\times p}over¯ start_ARG italic_C end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_p end_POSTSUPERSCRIPT is an invertible lower-triangular matrix and A¯d2×d2¯𝐴superscriptsubscript𝑑2subscript𝑑2\bar{A}\in\mathbb{R}^{d_{2}\times d_{2}}over¯ start_ARG italic_A end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is an orthonormal matrix. Then, βCh(x)=𝐈p𝛽superscript𝐶𝑥subscript𝐈𝑝\beta C^{h}(x)={\bf I}_{p}italic_β italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x ) = bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT implies [C¯,𝟎]B(x)=𝐈p¯𝐶0𝐵𝑥subscript𝐈𝑝[\bar{C},\bm{0}]B(x)={\bf I}_{p}[ over¯ start_ARG italic_C end_ARG , bold_0 ] italic_B ( italic_x ) = bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT with B(x):=A¯Ch(x)assign𝐵𝑥¯𝐴superscript𝐶𝑥B(x):=\bar{A}C^{h}(x)italic_B ( italic_x ) := over¯ start_ARG italic_A end_ARG italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x ). Due to the zero columns in [C¯,𝟎]¯𝐶0[\bar{C},\bm{0}][ over¯ start_ARG italic_C end_ARG , bold_0 ], the first p𝑝pitalic_p rows of B(x)𝐵𝑥B(x)italic_B ( italic_x ) has to be C¯1superscript¯𝐶1\bar{C}^{-1}over¯ start_ARG italic_C end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, while the other rows, denoted by R(x)𝑅𝑥R(x)italic_R ( italic_x ), can be arbitrary. Therefore, we obtain

Ch(x)=A¯1B(x)=A¯1[C¯1R(x)]=A~1[𝐈pR(x)],x𝒳1,formulae-sequencesuperscript𝐶𝑥superscript¯𝐴1𝐵𝑥superscript¯𝐴1matrixsuperscript¯𝐶1𝑅𝑥superscript~𝐴1matrixsubscript𝐈𝑝𝑅𝑥for-all𝑥subscript𝒳1C^{h}(x)=\bar{A}^{-1}B(x)=\bar{A}^{-1}\begin{bmatrix}\bar{C}^{-1}\\ R(x)\end{bmatrix}=\widetilde{A}^{-1}\begin{bmatrix}{\bf I}_{p}\\ R(x)\end{bmatrix},\quad\forall x\in\mathcal{X}_{1},italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x ) = over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B ( italic_x ) = over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL over¯ start_ARG italic_C end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_R ( italic_x ) end_CELL end_ROW end_ARG ] = over~ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_R ( italic_x ) end_CELL end_ROW end_ARG ] , ∀ italic_x ∈ caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,

where A~1superscript~𝐴1\widetilde{A}^{-1}over~ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is the product of A¯1superscript¯𝐴1\bar{A}^{-1}over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT and some elementary matrices introduced to transform C¯1superscript¯𝐶1\bar{C}^{-1}over¯ start_ARG italic_C end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT to an identity matrix 𝐈psubscript𝐈𝑝{\bf I}_{p}bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT.

Appendix D Low-rank approximation of Smooth Encoding Functions

In this section, we study how the smoothness of the encoding function enables the low-rank approximation. Specifically, we will construct the approximation in (8) explicitly with polynomial functions. For simplicity, we present the idea for second-order approximation, while the higher-order cases can be derived in a similar manner.

Let Ch:𝒳1d2×p:superscript𝐶subscript𝒳1superscriptsubscript𝑑2𝑝C^{h}:\mathcal{X}_{1}\to\mathbb{R}^{d_{2}\times p}italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT : caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_p end_POSTSUPERSCRIPT be a twice continuously differentiable matrix function, the Taylor expansion of its jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT column Cjhsubscriptsuperscript𝐶𝑗C^{h}_{j}italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT at a𝒳1𝑎subscript𝒳1a\in\mathcal{X}_{1}italic_a ∈ caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is given by

Cjh(x)superscriptsubscript𝐶𝑗𝑥\displaystyle C_{j}^{h}(x)italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x ) =Cjh(a)+(Cjh)(a)(xa)+[12(xa)(C1jh)′′(a)(xa)12(xa)(Cd2jh)′′(a)(xa)]+𝒪(xa3)absentsuperscriptsubscript𝐶𝑗𝑎superscriptsuperscriptsubscript𝐶𝑗𝑎𝑥𝑎matrix12superscript𝑥𝑎topsuperscriptsuperscriptsubscript𝐶1𝑗′′𝑎𝑥𝑎12superscript𝑥𝑎topsuperscriptsuperscriptsubscript𝐶subscript𝑑2𝑗′′𝑎𝑥𝑎𝒪superscriptnorm𝑥𝑎3\displaystyle=C_{j}^{h}(a)+(C_{j}^{h})^{\prime}(a)(x-a)+\begin{bmatrix}\frac{1% }{2}(x-a)^{\top}(C_{1j}^{h})^{\prime\prime}(a)(x-a)\\ \cdots\\ \frac{1}{2}(x-a)^{\top}(C_{d_{2}j}^{h})^{\prime\prime}(a)(x-a)\end{bmatrix}+% \mathcal{O}(||x-a||^{3})= italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_a ) + ( italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_a ) ( italic_x - italic_a ) + [ start_ARG start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_x - italic_a ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_C start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_a ) ( italic_x - italic_a ) end_CELL end_ROW start_ROW start_CELL ⋯ end_CELL end_ROW start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_x - italic_a ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_C start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_a ) ( italic_x - italic_a ) end_CELL end_ROW end_ARG ] + caligraphic_O ( | | italic_x - italic_a | | start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT )
:=Cjh(a)+Aj(a)ϕ(x)+𝒪(xa3),assignabsentsuperscriptsubscript𝐶𝑗𝑎subscript𝐴𝑗𝑎italic-ϕ𝑥𝒪superscriptnorm𝑥𝑎3\displaystyle:=C_{j}^{h}(a)+A_{j}(a)\phi(x)+\mathcal{O}(||x-a||^{3}),:= italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_a ) + italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a ) italic_ϕ ( italic_x ) + caligraphic_O ( | | italic_x - italic_a | | start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) , (12)

where the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT row of (Cjh)(a)d2×d1superscriptsuperscriptsubscript𝐶𝑗𝑎superscriptsubscript𝑑2subscript𝑑1(C_{j}^{h})^{\prime}(a)\in\mathbb{R}^{d_{2}\times d_{1}}( italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_a ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the derivative of Cijhsubscriptsuperscript𝐶𝑖𝑗C^{h}_{ij}italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT evaluated at x=a𝑥𝑎x=aitalic_x = italic_a and (Cijh)′′(a)superscriptsuperscriptsubscript𝐶𝑖𝑗′′𝑎(C_{ij}^{h})^{\prime\prime}(a)( italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_a ) is the Hessian matrix of Cijhsuperscriptsubscript𝐶𝑖𝑗C_{ij}^{h}italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT evaluated at x=a𝑥𝑎x=aitalic_x = italic_a. We represent Cjh(x)superscriptsubscript𝐶𝑗𝑥C_{j}^{h}(x)italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x ) in a matrix form in (12) by introducing ϕ(x)=(x1a1,x2a2,,xd1ad1,(x1a1)2,(x1a1)(x2a2),(x1a1)(x3a3),,(xd11ad11)(xd1ad1),(xd1ad1)2)d1+d12italic-ϕ𝑥superscriptsubscript𝑥1subscript𝑎1subscript𝑥2subscript𝑎2subscript𝑥subscript𝑑1subscript𝑎subscript𝑑1superscriptsubscript𝑥1subscript𝑎12subscript𝑥1subscript𝑎1subscript𝑥2subscript𝑎2subscript𝑥1subscript𝑎1subscript𝑥3subscript𝑎3subscript𝑥subscript𝑑11subscript𝑎subscript𝑑11subscript𝑥subscript𝑑1subscript𝑎subscript𝑑1superscriptsubscript𝑥subscript𝑑1subscript𝑎subscript𝑑12topsuperscriptsubscript𝑑1superscriptsubscript𝑑12\phi(x)=(x_{1}-a_{1},x_{2}-a_{2},\ldots,x_{d_{1}}-a_{d_{1}},(x_{1}-a_{1})^{2},% (x_{1}-a_{1})(x_{2}-a_{2}),(x_{1}-a_{1})(x_{3}-a_{3}),\ldots,(x_{d_{1}-1}-a_{d% _{1}-1})(x_{d_{1}}-a_{d_{1}}),(x_{d_{1}}-a_{d_{1}})^{2})^{\top}\in\mathbb{R}^{% d_{1}+d_{1}^{2}}italic_ϕ ( italic_x ) = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ) ( italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , ( italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and the coefficient matrix Aj(a)d2×(d1+d12)subscript𝐴𝑗𝑎superscriptsubscript𝑑2subscript𝑑1superscriptsubscript𝑑12A_{j}(a)\in\mathbb{R}^{d_{2}\times(d_{1}+d_{1}^{2})}italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_a ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT consisting of the (scaled) first two order derivatives. This allows us to approximate Chsuperscript𝐶C^{h}italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT by

Ch(x)=Ch(a)+[A1(a)ϕ(x),A2(a)ϕ(x),,Ap(a)ϕ(x)]+𝒪(xa3).superscript𝐶𝑥superscript𝐶𝑎matrixsubscript𝐴1𝑎italic-ϕ𝑥subscript𝐴2𝑎italic-ϕ𝑥subscript𝐴𝑝𝑎italic-ϕ𝑥𝒪superscriptnorm𝑥𝑎3C^{h}(x)=C^{h}(a)+\begin{bmatrix}A_{1}(a)\phi(x),A_{2}(a)\phi(x),\ldots,A_{p}(% a)\phi(x)\end{bmatrix}+\mathcal{O}(||x-a||^{3}).italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x ) = italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_a ) + [ start_ARG start_ROW start_CELL italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_a ) italic_ϕ ( italic_x ) , italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_a ) italic_ϕ ( italic_x ) , … , italic_A start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_a ) italic_ϕ ( italic_x ) end_CELL end_ROW end_ARG ] + caligraphic_O ( | | italic_x - italic_a | | start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) .

Let the maximum rank of the matrices {Ai(a):i=1,,p}conditional-setsubscript𝐴𝑖𝑎𝑖1𝑝\{A_{i}(a):i=1,\ldots,p\}{ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a ) : italic_i = 1 , … , italic_p } be s𝑠sitalic_s, then there exists Bd2×s𝐵superscriptsubscript𝑑2𝑠B\in\mathbb{R}^{d_{2}\times s}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_s end_POSTSUPERSCRIPT and {Di(a)s×(d1+d12):i=1,,p}conditional-setsubscript𝐷𝑖𝑎superscript𝑠subscript𝑑1superscriptsubscript𝑑12𝑖1𝑝\{D_{i}(a)\in\mathbb{R}^{s\times(d_{1}+d_{1}^{2})}:i=1,\ldots,p\}{ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_s × ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT : italic_i = 1 , … , italic_p } such that

Ch(x)=Ch(a)+B[D1(a)ϕ(x),D2(a)ϕ(x),,Dp(a)ϕ(x)]+𝒪(xa3),superscript𝐶𝑥superscript𝐶𝑎𝐵matrixsubscript𝐷1𝑎italic-ϕ𝑥subscript𝐷2𝑎italic-ϕ𝑥subscript𝐷𝑝𝑎italic-ϕ𝑥𝒪superscriptnorm𝑥𝑎3C^{h}(x)=C^{h}(a)+B\begin{bmatrix}D_{1}(a)\phi(x),D_{2}(a)\phi(x),\ldots,D_{p}% (a)\phi(x)\end{bmatrix}+\mathcal{O}(||x-a||^{3}),italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x ) = italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_a ) + italic_B [ start_ARG start_ROW start_CELL italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_a ) italic_ϕ ( italic_x ) , italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_a ) italic_ϕ ( italic_x ) , … , italic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_a ) italic_ϕ ( italic_x ) end_CELL end_ROW end_ARG ] + caligraphic_O ( | | italic_x - italic_a | | start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) , (13)

which is enabled by the decomposition Ai(a)=BDi(a)subscript𝐴𝑖𝑎𝐵subscript𝐷𝑖𝑎A_{i}(a)=BD_{i}(a)italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a ) = italic_B italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a ) for each i𝑖iitalic_i. With the continuity of Ch(x)superscript𝐶𝑥C^{h}(x)italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x ), we can fix a𝑎aitalic_a so that Ch(a)=C~superscript𝐶𝑎~𝐶C^{h}(a)=\widetilde{C}italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_a ) = over~ start_ARG italic_C end_ARG by the mean value theorem. The high-order reminder term can be ignored when the third derivatives of Chsuperscript𝐶C^{h}italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT are all zeros. If this is not the case, we can include high-order terms in ϕitalic-ϕ\phiitalic_ϕ until the reminder term is small enough. If the maximum rank s𝑠sitalic_s is not small, one can still consider the low-rank approximation of {Ai(a)}subscript𝐴𝑖𝑎\{A_{i}(a)\}{ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a ) }, and there will be an additional approximation error term in (13).

Appendix E SSL under Gaussian Distribution

Even though our formulation focuses on the classification setting, the extension of our results to the Gaussian case is straightforward. Assume {X1,X2,Y}subscript𝑋1subscript𝑋2𝑌\{X_{1},X_{2},Y\}{ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Y } are jointly Gaussian, where Y𝑌Yitalic_Y is a scalar Gaussian variable. Let ΣXZ:=Cov(X,Z)assignsubscriptΣ𝑋𝑍Cov𝑋𝑍\Sigma_{XZ}:=\mathop{\rm Cov}\nolimits(X,Z)roman_Σ start_POSTSUBSCRIPT italic_X italic_Z end_POSTSUBSCRIPT := roman_Cov ( italic_X , italic_Z ) for any random vectors X𝑋Xitalic_X and Z𝑍Zitalic_Z. Then, X:=𝖤[X2|X1]=ΣX2X1ΣX1X11X1assign𝑋𝖤conditionalsubscript𝑋2subscript𝑋1subscriptΣsubscript𝑋2subscript𝑋1superscriptsubscriptΣsubscript𝑋1subscript𝑋11subscript𝑋1X:=\operatorname{\sf E}[X_{2}|X_{1}]=\Sigma_{X_{2}X_{1}}\Sigma_{X_{1}X_{1}}^{-% 1}X_{1}italic_X := sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = roman_Σ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝖤[Y|X1]=ΣYX1ΣX1X11X1𝖤conditional𝑌subscript𝑋1subscriptΣ𝑌subscript𝑋1superscriptsubscriptΣsubscript𝑋1subscript𝑋11subscript𝑋1\operatorname{\sf E}[Y|X_{1}]=\Sigma_{YX_{1}}\Sigma_{X_{1}X_{1}}^{-1}X_{1}sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = roman_Σ start_POSTSUBSCRIPT italic_Y italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. If d2d1subscript𝑑2subscript𝑑1d_{2}\geq d_{1}italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and ΣX2X1subscriptΣsubscript𝑋2subscript𝑋1\Sigma_{X_{2}X_{1}}roman_Σ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT has a full rank, it is straightforward to see that an exact matching holds with β=ΣYX1ΣX2X1𝛽subscriptΣ𝑌subscript𝑋1superscriptsubscriptΣsubscript𝑋2subscript𝑋1\beta=\Sigma_{YX_{1}}\Sigma_{X_{2}X_{1}}^{\dagger}italic_β = roman_Σ start_POSTSUBSCRIPT italic_Y italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT. Concretely, there exists Ad1×d2𝐴superscriptsubscript𝑑1subscript𝑑2A\in\mathbb{R}^{d_{1}\times d_{2}}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and bd2𝑏superscriptsubscript𝑑2b\in\mathbb{R}^{d_{2}}italic_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT such that

X2=h(X1,Y)+N=AX1+bY+N,subscript𝑋2subscript𝑋1𝑌𝑁𝐴subscript𝑋1𝑏𝑌𝑁X_{2}=h(X_{1},Y)+N=AX_{1}+bY+N,italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_h ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ) + italic_N = italic_A italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_b italic_Y + italic_N , (14)

where 𝖤[N|X1,Y]=𝟎𝖤conditional𝑁subscript𝑋1𝑌0\operatorname{\sf E}[N|X_{1},Y]=\bm{0}sansserif_E [ italic_N | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y ] = bold_0.

Since the encoding function is not well-defined when Y𝑌Yitalic_Y is continuous, we formulate the low-rank approximation by

εs:=minX^𝖤[XX^2]assignsubscript𝜀𝑠subscript^𝑋𝖤superscriptdelimited-∥∥𝑋^𝑋2\displaystyle\varepsilon_{s}:=\min_{\hat{X}}\operatorname{\sf E}\left[\left% \lVert X-\hat{X}\right\rVert^{2}\right]italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT := roman_min start_POSTSUBSCRIPT over^ start_ARG italic_X end_ARG end_POSTSUBSCRIPT sansserif_E [ ∥ italic_X - over^ start_ARG italic_X end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] =minBd2×d1:rank(B)=s𝖤[ΣX2X1ΣX1X11X1BX12]absentsubscript:𝐵superscriptsubscript𝑑2subscript𝑑1rank𝐵𝑠𝖤superscriptdelimited-∥∥subscriptΣsubscript𝑋2subscript𝑋1superscriptsubscriptΣsubscript𝑋1subscript𝑋11subscript𝑋1𝐵subscript𝑋12\displaystyle=\min_{B\in\mathbb{R}^{d_{2}\times d_{1}}:\text{rank}(B)=s}% \operatorname{\sf E}\left[\left\lVert\Sigma_{X_{2}X_{1}}\Sigma_{X_{1}X_{1}}^{-% 1}X_{1}-BX_{1}\right\rVert^{2}\right]= roman_min start_POSTSUBSCRIPT italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT : rank ( italic_B ) = italic_s end_POSTSUBSCRIPT sansserif_E [ ∥ roman_Σ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_B italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=minBd2×d1:rank(B)=s(ΣX2X1ΣX1X11B)ΣX1X11/22,\displaystyle=\min_{B\in\mathbb{R}^{d_{2}\times d_{1}}:\text{rank}(B)=s}\left% \lVert(\Sigma_{X_{2}X_{1}}\Sigma_{X_{1}X_{1}}^{-1}-B)\Sigma_{X_{1}X_{1}}^{1/2}% \right\rVert^{2},= roman_min start_POSTSUBSCRIPT italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT : rank ( italic_B ) = italic_s end_POSTSUBSCRIPT ∥ ( roman_Σ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - italic_B ) roman_Σ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

which aims to find a weighted low-rank approximation for ΣX2X1ΣX1X11subscriptΣsubscript𝑋2subscript𝑋1superscriptsubscriptΣsubscript𝑋1subscript𝑋11\Sigma_{X_{2}X_{1}}\Sigma_{X_{1}X_{1}}^{-1}roman_Σ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT.

Appendix F Proof of Lemma 4.2

Proof F.1.

We have

minβ𝖾𝗋𝗋𝗈𝗋apx(β)subscript𝛽subscript𝖾𝗋𝗋𝗈𝗋apx𝛽\displaystyle\min_{\beta}\mathsf{error}_{\text{apx}}(\beta)roman_min start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT sansserif_error start_POSTSUBSCRIPT apx end_POSTSUBSCRIPT ( italic_β ) =minβ𝖤[𝖤[Y|X1]βCh(X1)𝖤[Y|X1]]2]\displaystyle=\min_{\beta}\operatorname{\sf E}[\lVert\operatorname{\sf E}[Y|X_% {1}]-\beta C^{h}(X_{1})\operatorname{\sf E}[Y|X_{1}]]\rVert^{2}]= roman_min start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT sansserif_E [ ∥ sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] - italic_β italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
minβ𝖤[𝖤[Y|X1]2]𝖤[𝐈pβ(C~+Bg(X1))2]+εsβ2absentsubscript𝛽𝖤superscriptdelimited-∥∥𝖤conditional𝑌subscript𝑋12𝖤superscriptdelimited-∥∥subscript𝐈𝑝𝛽~𝐶superscript𝐵superscript𝑔subscript𝑋12subscript𝜀𝑠superscriptdelimited-∥∥𝛽2\displaystyle\leq\min_{\beta}\operatorname{\sf E}[\lVert\operatorname{\sf E}[Y% |X_{1}]\rVert^{2}]\operatorname{\sf E}[\lVert{\bf I}_{p}-\beta(\widetilde{C}+B% ^{*}g^{*}(X_{1}))\rVert^{2}]+\varepsilon_{s}\lVert\beta\rVert^{2}≤ roman_min start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT sansserif_E [ ∥ sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] sansserif_E [ ∥ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - italic_β ( over~ start_ARG italic_C end_ARG + italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ italic_β ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
minβ2(p+𝖤[N2])(𝐈pβC~2+βB2)+εsβ2absentsubscript𝛽2𝑝𝖤superscriptdelimited-∥∥𝑁2superscriptdelimited-∥∥subscript𝐈𝑝𝛽~𝐶2superscriptdelimited-∥∥𝛽superscript𝐵2subscript𝜀𝑠superscriptdelimited-∥∥𝛽2\displaystyle\leq\min_{\beta}2(p+\operatorname{\sf E}[\lVert N\rVert^{2}])% \left(\lVert{\bf I}_{p}-\beta\widetilde{C}\rVert^{2}+\lVert\beta B^{*}\rVert^{% 2}\right)+\varepsilon_{s}\lVert\beta\rVert^{2}≤ roman_min start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT 2 ( italic_p + sansserif_E [ ∥ italic_N ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ) ( ∥ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - italic_β over~ start_ARG italic_C end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_β italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ italic_β ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
2(p+𝖤[N2])minβ(𝐈pβC~2+βB2+εsβ2),absent2𝑝𝖤superscriptdelimited-∥∥𝑁2subscript𝛽superscriptdelimited-∥∥subscript𝐈𝑝𝛽~𝐶2superscriptdelimited-∥∥𝛽superscript𝐵2subscript𝜀𝑠superscriptdelimited-∥∥𝛽2\displaystyle\leq 2(p+\operatorname{\sf E}[\lVert N\rVert^{2}])\min_{\beta}% \left(\lVert{\bf I}_{p}-\beta\widetilde{C}\rVert^{2}+\lVert\beta B^{*}\rVert^{% 2}+\varepsilon_{s}\lVert\beta\rVert^{2}\right),≤ 2 ( italic_p + sansserif_E [ ∥ italic_N ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ) roman_min start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( ∥ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - italic_β over~ start_ARG italic_C end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_β italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ italic_β ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

where the first inequality follows from the sub-multiplicativity of the matrix norm and the triangle inequality and the last two inequalities are due to the triangle inequality, and the fact that 𝖤[𝖤[Y|X1]2]=𝖤[YN2]2𝖤[Y2]+2𝖤[N2]2(p+𝖤[N2])𝖤superscriptdelimited-∥∥𝖤conditional𝑌subscript𝑋12𝖤superscriptdelimited-∥∥𝑌𝑁22𝖤superscriptdelimited-∥∥𝑌22𝖤superscriptdelimited-∥∥𝑁22𝑝𝖤superscriptdelimited-∥∥𝑁2\operatorname{\sf E}[\lVert\operatorname{\sf E}[Y|X_{1}]\rVert^{2}]=% \operatorname{\sf E}[\lVert Y-N\rVert^{2}]\leq 2\operatorname{\sf E}[\lVert Y% \rVert^{2}]+2\operatorname{\sf E}[\lVert N\rVert^{2}]\leq 2(p+\operatorname{% \sf E}[\lVert N\rVert^{2}])sansserif_E [ ∥ sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = sansserif_E [ ∥ italic_Y - italic_N ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ 2 sansserif_E [ ∥ italic_Y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + 2 sansserif_E [ ∥ italic_N ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ 2 ( italic_p + sansserif_E [ ∥ italic_N ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ). When εs=0subscript𝜀𝑠0\varepsilon_{s}=0italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0, note that [C~,B]~𝐶superscript𝐵[\widetilde{C},B^{*}][ over~ start_ARG italic_C end_ARG , italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] has rank at most p+sd2𝑝𝑠subscript𝑑2p+s\leq d_{2}italic_p + italic_s ≤ italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, thus there exists at least one solution βp×d2𝛽superscript𝑝subscript𝑑2\beta\in\mathbb{R}^{p\times d_{2}}italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for the equation β[C~,B]=[βC~,βB]=[𝐈p,𝟎]𝛽~𝐶superscript𝐵𝛽~𝐶𝛽superscript𝐵subscript𝐈𝑝0\beta[\widetilde{C},B^{*}]=[\beta\widetilde{C},\beta B^{*}]=[{\bf I}_{p},\bm{0}]italic_β [ over~ start_ARG italic_C end_ARG , italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] = [ italic_β over~ start_ARG italic_C end_ARG , italic_β italic_B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] = [ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_0 ]. The expression of βssubscript𝛽𝑠\beta_{s}italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is a standard expression of ridge-type estimators.

Appendix G Proof of Theorem 5.2

Lemma G.1 (Concentration on the covariance matrix (Lee et al., 2021)).

For 𝐗n×d𝐗superscript𝑛𝑑\bm{X}\in\mathbb{R}^{n\times d}bold_italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT with i.i.d. rows, where each row is ρ2superscript𝜌2\rho^{2}italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-sub-Gaussian with covariance ΣΣ\Sigmaroman_Σ. For any Bd×m𝐵superscript𝑑𝑚B\in\mathbb{R}^{d\times m}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_m end_POSTSUPERSCRIPT with rank k𝑘kitalic_k that is independent of 𝐗𝐗\bm{X}bold_italic_X. For any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), if nρ4(k+log(1δ))much-greater-than𝑛superscript𝜌4𝑘1𝛿n\gg\rho^{4}(k+\log(\frac{1}{\delta}))italic_n ≫ italic_ρ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ( italic_k + roman_log ( start_ARG divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG ) ), with probability at least 1δ101𝛿101-\frac{\delta}{10}1 - divide start_ARG italic_δ end_ARG start_ARG 10 end_ARG, we have

0.9BΣB1nB𝑿𝑿B1.1BΣB.precedes-or-equals0.9superscript𝐵topΣ𝐵1𝑛superscript𝐵topsuperscript𝑿top𝑿𝐵precedes-or-equals1.1superscript𝐵topΣ𝐵0.9B^{\top}\Sigma B\preceq\frac{1}{n}B^{\top}\bm{X}^{\top}\bm{X}B\preceq 1.1B^% {\top}\Sigma B.0.9 italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ italic_B ⪯ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_X italic_B ⪯ 1.1 italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ italic_B .
Lemma G.2 ((Lee et al., 2021)).

Let 𝐏n×n𝐏superscript𝑛𝑛\bm{P}\in\mathbb{R}^{n\times n}bold_italic_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT be a projection matrix and let 𝐙n×d𝐙superscript𝑛𝑑\bm{Z}\in\mathbb{R}^{n\times d}bold_italic_Z ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT be a matrix with i.i.d. rows, where each row of 𝐙𝐙\bm{Z}bold_italic_Z is mean zero (conditioning on 𝐏𝐏\bm{P}bold_italic_P being rank k𝑘kitalic_k) σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-sub-Gaussian. For any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), with probability at least 1δ1𝛿1-\delta1 - italic_δ,

𝑷𝒁\lesssimσk(1+log(k/δ)).delimited-∥∥𝑷𝒁\lesssim𝜎𝑘1𝑘𝛿\lVert\bm{P}\bm{Z}\rVert\lesssim\sigma\sqrt{k(1+\log(k/\delta))}.∥ bold_italic_P bold_italic_Z ∥ italic_σ square-root start_ARG italic_k ( 1 + roman_log ( start_ARG italic_k / italic_δ end_ARG ) ) end_ARG .

G.1 Technical Lemmas

Lemma G.3.

Under Assumptions 3 and 4, for any s𝑠sitalic_s such that 1sd2p1𝑠subscript𝑑2𝑝1\leq s\leq d_{2}-p1 ≤ italic_s ≤ italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_p,

  • \bullet

    Σ~ satisfies Σ~a~(1+εs)Σ for some a~0precedes-or-equals~Σ satisfies ~Σ~𝑎1subscript𝜀𝑠Σ for some ~𝑎0\widetilde{\Sigma}\text{\, satisfies \,}\widetilde{\Sigma}\preceq\tilde{a}(1+% \varepsilon_{s})\Sigma\text{\,\, for some \,}\tilde{a}\geq 0over~ start_ARG roman_Σ end_ARG satisfies over~ start_ARG roman_Σ end_ARG ⪯ over~ start_ARG italic_a end_ARG ( 1 + italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) roman_Σ for some over~ start_ARG italic_a end_ARG ≥ 0 ;

  • \bullet

    Σ¯ satisfies Σ¯a¯εsΣ for some a¯0precedes-or-equals¯Σ satisfies ¯Σ¯𝑎subscript𝜀𝑠Σ for some ¯𝑎0\bar{\Sigma}\text{\, satisfies \,}\bar{\Sigma}\preceq\bar{a}\varepsilon_{s}% \Sigma\text{\,\, for some \,}\bar{a}\geq 0over¯ start_ARG roman_Σ end_ARG satisfies over¯ start_ARG roman_Σ end_ARG ⪯ over¯ start_ARG italic_a end_ARG italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT roman_Σ for some over¯ start_ARG italic_a end_ARG ≥ 0  .

Proof G.4.

To prove AaBprecedes-or-equals𝐴𝑎𝐵A\preceq aBitalic_A ⪯ italic_a italic_B for symmetrical matrices A,Bd×d𝐴𝐵superscript𝑑𝑑A,B\in\mathbb{R}^{d\times d}italic_A , italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT and a0𝑎0a\geq 0italic_a ≥ 0, one can simply prove λj(A)aλj(B),jsubscript𝜆𝑗𝐴𝑎subscript𝜆𝑗𝐵for-all𝑗\lambda_{j}(A)\leq a\lambda_{j}(B),\forall jitalic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_A ) ≤ italic_a italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_B ) , ∀ italic_j. This holds immediately for zero eigenvalues λj(B)subscript𝜆𝑗𝐵\lambda_{j}(B)italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_B )’s if rank(A)rank(B)rank𝐴rank𝐵\text{rank}(A)\leq\text{rank}(B)rank ( italic_A ) ≤ rank ( italic_B ). Therefore, by Assumption 4, we will focus on the case when ΣΣ\Sigmaroman_Σ has all positive eigenvalues. First,

εs=1d2𝖤[X~X2]1d2|𝖤[X~2]𝖤[X2]|=1d2|tr(Σ~Σ)|.subscript𝜀𝑠1subscript𝑑2𝖤superscriptdelimited-∥∥~𝑋𝑋21subscript𝑑2𝖤superscriptdelimited-∥∥~𝑋2𝖤superscriptdelimited-∥∥𝑋21subscript𝑑2tr~ΣΣ\displaystyle\varepsilon_{s}=\frac{1}{d_{2}}\operatorname{\sf E}[\lVert% \widetilde{X}-X\rVert^{2}]\geq\frac{1}{d_{2}}\left\lvert\operatorname{\sf E}[% \lVert\widetilde{X}\rVert^{2}]-\operatorname{\sf E}[\lVert X\rVert^{2}]\right% \rvert=\frac{1}{d_{2}}\left\lvert\text{tr}(\widetilde{\Sigma}-\Sigma)\right\rvert.italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG sansserif_E [ ∥ over~ start_ARG italic_X end_ARG - italic_X ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≥ divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG | sansserif_E [ ∥ over~ start_ARG italic_X end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - sansserif_E [ ∥ italic_X ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] | = divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG | tr ( over~ start_ARG roman_Σ end_ARG - roman_Σ ) | .

In the following, we will use the fact that λmin0(A)𝐈dAλmax(A)𝐈dprecedes-or-equalssubscript𝜆min0𝐴subscript𝐈𝑑𝐴precedes-or-equalssubscript𝜆max𝐴subscript𝐈𝑑\lambda_{\text{min}\neq 0}(A){\bf I}_{d}\preceq A\preceq\lambda_{\text{max}}(A% ){\bf I}_{d}italic_λ start_POSTSUBSCRIPT min ≠ 0 end_POSTSUBSCRIPT ( italic_A ) bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ⪯ italic_A ⪯ italic_λ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( italic_A ) bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT for any symmetric matrix Ad×d𝐴superscript𝑑𝑑A\in\mathbb{R}^{d\times d}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT. Using Assumption 3,

Σ~Σλmax(Σ~Σ)𝐈d2c1d2|tr(Σ~Σ)|𝐈d2c1εs𝐈d2,precedes-or-equals~ΣΣsubscript𝜆max~ΣΣsubscript𝐈subscript𝑑2subscript𝑐1subscript𝑑2tr~ΣΣsubscript𝐈subscript𝑑2subscript𝑐1subscript𝜀𝑠subscript𝐈subscript𝑑2\widetilde{\Sigma}-\Sigma\preceq\lambda_{\text{max}}(\widetilde{\Sigma}-\Sigma% ){\bf I}_{d_{2}}\leq\frac{c_{1}}{d_{2}}\lvert\text{tr}(\widetilde{\Sigma}-% \Sigma)\rvert{\bf I}_{d_{2}}\leq c_{1}\varepsilon_{s}{\bf I}_{d_{2}},over~ start_ARG roman_Σ end_ARG - roman_Σ ⪯ italic_λ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( over~ start_ARG roman_Σ end_ARG - roman_Σ ) bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ divide start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG | tr ( over~ start_ARG roman_Σ end_ARG - roman_Σ ) | bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ,

which implies,

λi(Σ~)λi(Σ)+c1εsλi(Σ)+λi(Σ)λmin0(Σ)c1εs=(1+c1εsλmin0(Σ))λi(Σ)a~(1+εs)λi(Σ),subscript𝜆𝑖~Σsubscript𝜆𝑖Σsubscript𝑐1subscript𝜀𝑠subscript𝜆𝑖Σsubscript𝜆𝑖Σsubscript𝜆min0Σsubscript𝑐1subscript𝜀𝑠1subscript𝑐1subscript𝜀𝑠subscript𝜆min0Σsubscript𝜆𝑖Σ~𝑎1subscript𝜀𝑠subscript𝜆𝑖Σ\lambda_{i}(\widetilde{\Sigma})\leq\lambda_{i}(\Sigma)+c_{1}\varepsilon_{s}% \leq\lambda_{i}(\Sigma)+\frac{\lambda_{i}(\Sigma)}{\lambda_{\text{min}\neq 0}(% \Sigma)}c_{1}\varepsilon_{s}=\left(1+\frac{c_{1}\varepsilon_{s}}{\lambda_{% \text{min}\neq 0}(\Sigma)}\right)\lambda_{i}(\Sigma)\leq\tilde{a}(1+% \varepsilon_{s})\lambda_{i}(\Sigma),italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG roman_Σ end_ARG ) ≤ italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_Σ ) + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ≤ italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_Σ ) + divide start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_Σ ) end_ARG start_ARG italic_λ start_POSTSUBSCRIPT min ≠ 0 end_POSTSUBSCRIPT ( roman_Σ ) end_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = ( 1 + divide start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT min ≠ 0 end_POSTSUBSCRIPT ( roman_Σ ) end_ARG ) italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_Σ ) ≤ over~ start_ARG italic_a end_ARG ( 1 + italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_Σ ) ,

for every i𝑖iitalic_i and a~max(1,c1λmin0(Σ))~𝑎1subscript𝑐1subscript𝜆min0Σ\tilde{a}\geq\max(1,\frac{c_{1}}{\lambda_{\text{min}\neq 0}(\Sigma)})over~ start_ARG italic_a end_ARG ≥ roman_max ( 1 , divide start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT min ≠ 0 end_POSTSUBSCRIPT ( roman_Σ ) end_ARG ). This immediately leads to Σ~a~(1+εs)Σprecedes-or-equals~Σ~𝑎1subscript𝜀𝑠Σ\widetilde{\Sigma}\preceq\tilde{a}(1+\varepsilon_{s})\Sigmaover~ start_ARG roman_Σ end_ARG ⪯ over~ start_ARG italic_a end_ARG ( 1 + italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) roman_Σ.

Finally, recall the fact that εs=1d2tr(Σ¯)subscript𝜀𝑠1subscript𝑑2tr¯Σ\varepsilon_{s}=\frac{1}{d_{2}}\cdot\text{tr}(\bar{\Sigma})italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ⋅ tr ( over¯ start_ARG roman_Σ end_ARG ). By Assumption 3, we have

Σ¯λmax(Σ¯)𝐈d2c2d2tr(Σ¯)𝐈d2=c2εs𝐈d2c2λmin0(Σ)εsΣ:=a¯εsΣ.precedes-or-equals¯Σsubscript𝜆max¯Σsubscript𝐈subscript𝑑2precedes-or-equalssubscript𝑐2subscript𝑑2tr¯Σsubscript𝐈subscript𝑑2subscript𝑐2subscript𝜀𝑠subscript𝐈subscript𝑑2precedes-or-equalssubscript𝑐2subscript𝜆min0Σsubscript𝜀𝑠Σassign¯𝑎subscript𝜀𝑠Σ\bar{\Sigma}\preceq\lambda_{\text{max}}(\bar{\Sigma}){\bf I}_{d_{2}}\preceq% \frac{c_{2}}{d_{2}}\text{tr}(\bar{\Sigma}){\bf I}_{d_{2}}=c_{2}\varepsilon_{s}% {\bf I}_{d_{2}}\preceq\frac{c_{2}}{\lambda_{\text{min}\neq 0}(\Sigma)}% \varepsilon_{s}\Sigma:=\bar{a}\varepsilon_{s}\Sigma.over¯ start_ARG roman_Σ end_ARG ⪯ italic_λ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( over¯ start_ARG roman_Σ end_ARG ) bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⪯ divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG tr ( over¯ start_ARG roman_Σ end_ARG ) bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⪯ divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT min ≠ 0 end_POSTSUBSCRIPT ( roman_Σ ) end_ARG italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT roman_Σ := over¯ start_ARG italic_a end_ARG italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT roman_Σ .

G.2 Proof of Theorem 5.2

Proof G.5.

First, recall that βargminβ𝖾𝗋𝗋𝗈𝗋apx(β)superscript𝛽subscript𝛽subscript𝖾𝗋𝗋𝗈𝗋apx𝛽\beta^{*}\in\arg\min_{\beta}\mathsf{error}_{\text{apx}}(\beta)italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_arg roman_min start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT sansserif_error start_POSTSUBSCRIPT apx end_POSTSUBSCRIPT ( italic_β ) and the shorthand X:=𝖤[X2|X1]assign𝑋𝖤conditionalsubscript𝑋2subscript𝑋1X:=\operatorname{\sf E}[X_{2}|X_{1}]italic_X := sansserif_E [ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ]. By the triangle inequality, we have

(β^0)=𝖤[𝖤[Y|X1]β^0X)2]𝖾𝗋𝗋𝗈𝗋apx+𝖤[(ββ^0)X2].\mathcal{R}(\hat{\beta}_{0})=\operatorname{\sf E}[\lVert\operatorname{\sf E}[Y% |X_{1}]-\hat{\beta}_{0}X)\rVert^{2}]\leq\mathsf{error}_{\text{apx}}^{*}+% \operatorname{\sf E}[\lVert(\beta^{*}-\hat{\beta}_{0})X\rVert^{2}].caligraphic_R ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = sansserif_E [ ∥ sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_X ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ sansserif_error start_POSTSUBSCRIPT apx end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + sansserif_E [ ∥ ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_X ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .

Denote a(X1):=𝖤[Y|X1]βXassign𝑎subscript𝑋1𝖤conditional𝑌subscript𝑋1superscript𝛽𝑋a(X_{1}):=\operatorname{\sf E}[Y|X_{1}]-\beta^{*}Xitalic_a ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) := sansserif_E [ italic_Y | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] - italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_X.

Recall that X=X~+X¯𝑋~𝑋¯𝑋X=\widetilde{X}+\bar{X}italic_X = over~ start_ARG italic_X end_ARG + over¯ start_ARG italic_X end_ARG. Then, we can write Y=βX~+βX¯+a(X1)+N𝑌superscript𝛽~𝑋superscript𝛽¯𝑋𝑎subscript𝑋1𝑁Y=\beta^{*}\widetilde{X}+\beta^{*}\bar{X}+a(X_{1})+Nitalic_Y = italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT over~ start_ARG italic_X end_ARG + italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT over¯ start_ARG italic_X end_ARG + italic_a ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_N, with N𝑁Nitalic_N satisfying 𝖤[N|X1]=0𝖤conditional𝑁subscript𝑋10\operatorname{\sf E}[N|X_{1}]=0sansserif_E [ italic_N | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = 0 by the tower property. The definition of β^0subscript^𝛽0\hat{\beta}_{0}over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT implies

𝒀𝑿β^02𝒀𝑿(β)2=a(𝑿1)+𝑵2.superscriptdelimited-∥∥𝒀𝑿superscriptsubscript^𝛽0top2superscriptdelimited-∥∥𝒀𝑿superscriptsuperscript𝛽top2superscriptdelimited-∥∥𝑎subscript𝑿1𝑵2\lVert\bm{Y}-\bm{X}\hat{\beta}_{0}^{\top}\rVert^{2}\leq\lVert\bm{Y}-\bm{X}(% \beta^{*})^{\top}\rVert^{2}=\lVert a(\bm{X}_{1})+\bm{N}\rVert^{2}.∥ bold_italic_Y - bold_italic_X over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ∥ bold_italic_Y - bold_italic_X ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ italic_a ( bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + bold_italic_N ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

By rearranging the terms, we get

𝑿(ββ^0))2\displaystyle\lVert\bm{X}(\beta^{*}-\hat{\beta}_{0})^{\top})\rVert^{2}∥ bold_italic_X ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT a(𝑿1),𝑿~(ββ^0)𝑵,𝑿~(ββ^0)absent𝑎subscript𝑿1bold-~𝑿superscriptsuperscript𝛽subscript^𝛽0top𝑵bold-~𝑿superscriptsuperscript𝛽subscript^𝛽0top\displaystyle\leq\langle a(\bm{X}_{1}),\bm{\tilde{X}}(\beta^{*}-\hat{\beta}_{0% })^{\top}\rangle-\langle\bm{N},\bm{\tilde{X}}(\beta^{*}-\hat{\beta}_{0})^{\top}\rangle≤ ⟨ italic_a ( bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , overbold_~ start_ARG bold_italic_X end_ARG ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⟩ - ⟨ bold_italic_N , overbold_~ start_ARG bold_italic_X end_ARG ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⟩
+a(𝑿1),𝑿¯(ββ^0)𝑵,𝑿¯(ββ^0).𝑎subscript𝑿1bold-¯𝑿superscriptsuperscript𝛽subscript^𝛽0top𝑵bold-¯𝑿superscriptsuperscript𝛽subscript^𝛽0top\displaystyle+\langle a(\bm{X}_{1}),\bm{\bar{X}}(\beta^{*}-\hat{\beta}_{0})^{% \top}\rangle-\langle\bm{N},\bm{\bar{X}}(\beta^{*}-\hat{\beta}_{0})^{\top}\rangle.+ ⟨ italic_a ( bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , overbold_¯ start_ARG bold_italic_X end_ARG ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⟩ - ⟨ bold_italic_N , overbold_¯ start_ARG bold_italic_X end_ARG ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⟩ .

We bound the first two inner products in the following, and the other two follow similarly. First,

a(𝑿1),𝑿~(ββ^0)𝑎subscript𝑿1bold-~𝑿superscriptsuperscript𝛽subscript^𝛽0top\displaystyle\langle a(\bm{X}_{1}),\bm{\tilde{X}}(\beta^{*}-\hat{\beta}_{0})^{% \top}\rangle⟨ italic_a ( bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , overbold_~ start_ARG bold_italic_X end_ARG ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⟩ =Σ~1/2𝑿~a(𝑿1),Σ~1/2(ββ^0)absentsuperscript~Σ12superscriptbold-~𝑿top𝑎subscript𝑿1superscript~Σ12superscriptsuperscript𝛽subscript^𝛽0top\displaystyle=\langle\widetilde{\Sigma}^{-1/2}\bm{\tilde{X}}^{\top}a(\bm{X}_{1% }),\widetilde{\Sigma}^{1/2}(\beta^{*}-\hat{\beta}_{0})^{\top}\rangle= ⟨ over~ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_a ( bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , over~ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⟩
Σ~1/2𝑿~a(𝑿1)Σ~1/2(ββ^0)absentdelimited-∥∥superscript~Σ12superscriptbold-~𝑿top𝑎subscript𝑿1delimited-∥∥superscript~Σ12superscriptsuperscript𝛽subscript^𝛽0top\displaystyle\leq\lVert\widetilde{\Sigma}^{-1/2}\bm{\tilde{X}}^{\top}a(\bm{X}_% {1})\rVert\lVert\widetilde{\Sigma}^{1/2}(\beta^{*}-\hat{\beta}_{0})^{\top}\rVert≤ ∥ over~ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_a ( bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∥ ∥ over~ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥
1.1a~b~1+εsn(p+s)Σ1/2(ββ^0)absent1.1~𝑎~𝑏1subscript𝜀𝑠𝑛𝑝𝑠delimited-∥∥superscriptΣ12superscriptsuperscript𝛽subscript^𝛽0top\displaystyle\leq 1.1\tilde{a}\tilde{b}\sqrt{1+\varepsilon_{s}}\sqrt{n(p+s)}% \lVert\Sigma^{1/2}(\beta^{*}-\hat{\beta}_{0})^{\top}\rVert≤ 1.1 over~ start_ARG italic_a end_ARG over~ start_ARG italic_b end_ARG square-root start_ARG 1 + italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG square-root start_ARG italic_n ( italic_p + italic_s ) end_ARG ∥ roman_Σ start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥
\lesssim1+εsp+s𝑿(ββ^0),\lesssim1subscript𝜀𝑠𝑝𝑠delimited-∥∥𝑿superscriptsuperscript𝛽subscript^𝛽0top\displaystyle\lesssim\sqrt{1+\varepsilon_{s}}\sqrt{p+s}\lVert\bm{X}(\beta^{*}-% \hat{\beta}_{0})^{\top}\rVert,square-root start_ARG 1 + italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG square-root start_ARG italic_p + italic_s end_ARG ∥ bold_italic_X ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥ , (15)

where the inequality Σ~1/2𝐗~a(𝐗1)1.1b~n(p+s)delimited-∥∥superscript~Σ12superscriptbold-~𝐗top𝑎subscript𝐗11.1~𝑏𝑛𝑝𝑠\lVert\widetilde{\Sigma}^{-1/2}\bm{\tilde{X}}^{\top}a(\bm{X}_{1})\rVert\leq 1.% 1\tilde{b}\sqrt{n(p+s)}∥ over~ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT overbold_~ start_ARG bold_italic_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_a ( bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∥ ≤ 1.1 over~ start_ARG italic_b end_ARG square-root start_ARG italic_n ( italic_p + italic_s ) end_ARG is due to Assumption 2 and the covariance concentration in Lemma G.1, and the last inequality is simply due to the covariance concentration. The replacement of Σ~~Σ\widetilde{\Sigma}over~ start_ARG roman_Σ end_ARG by ΣΣ\Sigmaroman_Σ is by Lemma G.3. Let 𝐏𝐗~subscript𝐏bold-~𝐗\bm{P}_{\bm{\tilde{X}}}bold_italic_P start_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG end_POSTSUBSCRIPT denote the projection matrix defined with respect to 𝐗~bold-~𝐗\bm{\tilde{X}}overbold_~ start_ARG bold_italic_X end_ARG, we have

𝑵,𝑿~(ββ^0)𝑵bold-~𝑿superscriptsuperscript𝛽subscript^𝛽0top\displaystyle\langle\bm{N},\bm{\tilde{X}}(\beta^{*}-\hat{\beta}_{0})^{\top}\rangle⟨ bold_italic_N , overbold_~ start_ARG bold_italic_X end_ARG ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⟩ =𝑷𝑿~𝑵,𝑿~(ββ^0)absentsubscript𝑷bold-~𝑿𝑵bold-~𝑿superscriptsuperscript𝛽subscript^𝛽0top\displaystyle=\langle\bm{P}_{\bm{\tilde{X}}}\bm{N},\bm{\tilde{X}}(\beta^{*}-% \hat{\beta}_{0})^{\top}\rangle= ⟨ bold_italic_P start_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG end_POSTSUBSCRIPT bold_italic_N , overbold_~ start_ARG bold_italic_X end_ARG ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⟩
𝑷𝑿~𝑵𝑿~(ββ^0).absentdelimited-∥∥subscript𝑷bold-~𝑿𝑵delimited-∥∥bold-~𝑿superscriptsuperscript𝛽subscript^𝛽0top\displaystyle\leq\lVert\bm{P}_{\bm{\tilde{X}}}\bm{N}\rVert\lVert\bm{\tilde{X}}% (\beta^{*}-\hat{\beta}_{0})^{\top}\rVert.≤ ∥ bold_italic_P start_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_X end_ARG end_POSTSUBSCRIPT bold_italic_N ∥ ∥ overbold_~ start_ARG bold_italic_X end_ARG ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥ .
\lesssimσ1+εs(p+s)(1+logp+sδ)𝑿(ββ^0),\lesssim𝜎1subscript𝜀𝑠𝑝𝑠1𝑝𝑠𝛿delimited-∥∥𝑿superscriptsuperscript𝛽subscript^𝛽0top\displaystyle\lesssim\sigma\sqrt{1+\varepsilon_{s}}\sqrt{(p+s)\left(1+\log% \frac{p+s}{\delta}\right)}\lVert\bm{X}(\beta^{*}-\hat{\beta}_{0})^{\top}\rVert,italic_σ square-root start_ARG 1 + italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG square-root start_ARG ( italic_p + italic_s ) ( 1 + roman_log divide start_ARG italic_p + italic_s end_ARG start_ARG italic_δ end_ARG ) end_ARG ∥ bold_italic_X ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥ , (16)

where the last bound is due to Lemma G.2 and the replacement of 𝐗~bold-~𝐗\bm{\tilde{X}}overbold_~ start_ARG bold_italic_X end_ARG by 𝐗𝐗\bm{X}bold_italic_X follows from the covariance concentration and Lemma G.3. Since we make no assumptions on the rank of Σ¯¯Σ\bar{\Sigma}over¯ start_ARG roman_Σ end_ARG, it has at most rank d2subscript𝑑2d_{2}italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Similarly, we get

a(𝑿1),𝑿¯(ββ^0)𝑎subscript𝑿1bold-¯𝑿superscriptsuperscript𝛽subscript^𝛽0top\displaystyle\langle a(\bm{X}_{1}),\bm{\bar{X}}(\beta^{*}-\hat{\beta}_{0})^{% \top}\rangle⟨ italic_a ( bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , overbold_¯ start_ARG bold_italic_X end_ARG ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⟩ \lesssimεsd2𝑿(ββ^0)\lesssimsubscript𝜀𝑠subscript𝑑2delimited-∥∥𝑿superscriptsuperscript𝛽subscript^𝛽0top\displaystyle\lesssim\sqrt{\varepsilon_{s}}\sqrt{d_{2}}\lVert\bm{X}(\beta^{*}-% \hat{\beta}_{0})^{\top}\rVertsquare-root start_ARG italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ∥ bold_italic_X ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥ (17)
𝑵,𝑿¯(ββ^0)𝑵bold-¯𝑿superscriptsuperscript𝛽subscript^𝛽0top\displaystyle\langle\bm{N},\bm{\bar{X}}(\beta^{*}-\hat{\beta}_{0})^{\top}\rangle⟨ bold_italic_N , overbold_¯ start_ARG bold_italic_X end_ARG ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⟩ \lesssimσεsd2(1+logd2δ)𝑿(ββ^0).\lesssim𝜎subscript𝜀𝑠subscript𝑑21subscript𝑑2𝛿delimited-∥∥𝑿superscriptsuperscript𝛽subscript^𝛽0top\displaystyle\lesssim\sigma\sqrt{\varepsilon_{s}}\sqrt{d_{2}\left(1+\log\frac{% d_{2}}{\delta}\right)}\lVert\bm{X}(\beta^{*}-\hat{\beta}_{0})^{\top}\rVert.italic_σ square-root start_ARG italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 1 + roman_log divide start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_δ end_ARG ) end_ARG ∥ bold_italic_X ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥ . (18)

Combining (15),  (16), (17), and (18) yields

𝑿(ββ^0)\lesssimσ1+εs(p+s)(1+logp+sδ)+σεsd2(1+logd2δ).delimited-∥∥𝑿superscriptsuperscript𝛽subscript^𝛽0top\lesssim𝜎1subscript𝜀𝑠𝑝𝑠1𝑝𝑠𝛿𝜎subscript𝜀𝑠subscript𝑑21subscript𝑑2𝛿\lVert\bm{X}(\beta^{*}-\hat{\beta}_{0})^{\top}\rVert\lesssim\sigma\sqrt{1+% \varepsilon_{s}}\sqrt{(p+s)\left(1+\log\frac{p+s}{\delta}\right)}+\sigma\sqrt{% \varepsilon_{s}}\sqrt{d_{2}\left(1+\log\frac{d_{2}}{\delta}\right)}.∥ bold_italic_X ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∥ italic_σ square-root start_ARG 1 + italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG square-root start_ARG ( italic_p + italic_s ) ( 1 + roman_log divide start_ARG italic_p + italic_s end_ARG start_ARG italic_δ end_ARG ) end_ARG + italic_σ square-root start_ARG italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 1 + roman_log divide start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_δ end_ARG ) end_ARG .

Finally, the covariance concentration implies

𝖤[(ββ^0)X2]\lesssim(1+εs)(p+s)(1+logp+sδ)nσ2+εsd2(1+logd2δ)nσ2.𝖤superscriptdelimited-∥∥superscript𝛽subscript^𝛽0𝑋2\lesssim1subscript𝜀𝑠𝑝𝑠1𝑝𝑠𝛿𝑛superscript𝜎2subscript𝜀𝑠subscript𝑑21subscript𝑑2𝛿𝑛superscript𝜎2\operatorname{\sf E}[\lVert(\beta^{*}-\hat{\beta}_{0})X\rVert^{2}]\lesssim(1+% \varepsilon_{s})\frac{(p+s)\left(1+\log\frac{p+s}{\delta}\right)}{n}\sigma^{2}% +\varepsilon_{s}\frac{d_{2}\left(1+\log\frac{d_{2}}{\delta}\right)}{n}\sigma^{% 2}.sansserif_E [ ∥ ( italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_X ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ( 1 + italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) divide start_ARG ( italic_p + italic_s ) ( 1 + roman_log divide start_ARG italic_p + italic_s end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_n end_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT divide start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 1 + roman_log divide start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_n end_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Appendix H Proof of Corollary 5.4

Lemma H.1.

Under Assumptions 3 and 4, for λ>0𝜆0\lambda>0italic_λ > 0, there exists c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT such that

dλc2(1+εsλ)(p+s).subscript𝑑𝜆subscript𝑐21subscript𝜀𝑠𝜆𝑝𝑠d_{\lambda}\leq c_{2}\left(1+\frac{\varepsilon_{s}}{\lambda}\right)(p+s).italic_d start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ≤ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 1 + divide start_ARG italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG italic_λ end_ARG ) ( italic_p + italic_s ) .
Remark H.2.

It is known that dλ=𝖤[(Σ+λ𝐈)1/2X2]subscript𝑑𝜆𝖤superscriptdelimited-∥∥superscriptΣ𝜆𝐈12𝑋2d_{\lambda}=\operatorname{\sf E}[\lVert(\Sigma+\lambda{\bf I})^{-1/2}X\rVert^{% 2}]italic_d start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = sansserif_E [ ∥ ( roman_Σ + italic_λ bold_I ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_X ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (Hsu et al., 2012). When εs=0subscript𝜀𝑠0\varepsilon_{s}=0italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0, we have 𝖤[(Σ+λ𝐈)1/2X2]rank(Σ)=p+s𝖤superscriptdelimited-∥∥superscriptΣ𝜆𝐈12𝑋2rankΣ𝑝𝑠\operatorname{\sf E}[\lVert(\Sigma+\lambda{\bf I})^{-1/2}X\rVert^{2}]\leq\text% {rank}(\Sigma)=p+ssansserif_E [ ∥ ( roman_Σ + italic_λ bold_I ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_X ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ rank ( roman_Σ ) = italic_p + italic_s, where the equality holds when λ=0𝜆0\lambda=0italic_λ = 0.

Proof H.3.

Let {λ~i}subscript~𝜆𝑖\{\tilde{\lambda}_{i}\}{ over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } and {λi}subscript𝜆𝑖\{\lambda_{i}\}{ italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } denote the eigenvalues of ΣΣ\Sigmaroman_Σ and Σ~~Σ\widetilde{\Sigma}over~ start_ARG roman_Σ end_ARG, respectively. Recall that Assumptions 3 and 4 imply λ~iλic1εssubscript~𝜆𝑖subscript𝜆𝑖subscript𝑐1subscript𝜀𝑠\tilde{\lambda}_{i}-\lambda_{i}\leq c_{1}\varepsilon_{s}over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT as shown in the proof of Lemma G.3, then we have

dλ=i=1dλiλ+λisubscript𝑑𝜆superscriptsubscript𝑖1𝑑subscript𝜆𝑖𝜆subscript𝜆𝑖\displaystyle d_{\lambda}=\sum_{i=1}^{d}\frac{\lambda_{i}}{\lambda+\lambda_{i}}italic_d start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT divide start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_λ + italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG =i=1dλ~iλ+λ~i+λi=1dλiλ~i(λ+λi)(λ+λ~i)absentsuperscriptsubscript𝑖1𝑑subscript~𝜆𝑖𝜆subscript~𝜆𝑖𝜆superscriptsubscript𝑖1𝑑subscript𝜆𝑖subscript~𝜆𝑖𝜆subscript𝜆𝑖𝜆subscript~𝜆𝑖\displaystyle=\sum_{i=1}^{d}\frac{\tilde{\lambda}_{i}}{\lambda+\tilde{\lambda}% _{i}}+\lambda\sum_{i=1}^{d}\frac{\lambda_{i}-\tilde{\lambda}_{i}}{(\lambda+{% \lambda}_{i})(\lambda+\tilde{\lambda}_{i})}= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT divide start_ARG over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_λ + over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG + italic_λ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT divide start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ( italic_λ + italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_λ + over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG
i=1dλ~iλ+λ~i+c1εsi=1d1(λ+λi)(λ+λ~i)absentsuperscriptsubscript𝑖1𝑑subscript~𝜆𝑖𝜆subscript~𝜆𝑖subscript𝑐1subscript𝜀𝑠superscriptsubscript𝑖1𝑑1𝜆subscript𝜆𝑖𝜆subscript~𝜆𝑖\displaystyle\leq\sum_{i=1}^{d}\frac{\tilde{\lambda}_{i}}{\lambda+\tilde{% \lambda}_{i}}+c_{1}\varepsilon_{s}\sum_{i=1}^{d}\frac{1}{(\lambda+{\lambda}_{i% })(\lambda+\tilde{\lambda}_{i})}≤ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT divide start_ARG over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_λ + over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG ( italic_λ + italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_λ + over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG
=i=1dλ~iλ+λ~i+c1εsλλmin0(Σ~)i=1dλλmin0(Σ~)(λ+λi)(λ+λ~i),absentsuperscriptsubscript𝑖1𝑑subscript~𝜆𝑖𝜆subscript~𝜆𝑖subscript𝑐1subscript𝜀𝑠𝜆subscript𝜆min0~Σsuperscriptsubscript𝑖1𝑑𝜆subscript𝜆min0~Σ𝜆subscript𝜆𝑖𝜆subscript~𝜆𝑖\displaystyle=\sum_{i=1}^{d}\frac{\tilde{\lambda}_{i}}{\lambda+\tilde{\lambda}% _{i}}+\frac{c_{1}\varepsilon_{s}}{\lambda\cdot\lambda_{\text{min}\neq 0}(% \widetilde{\Sigma})}\sum_{i=1}^{d}\frac{\lambda\cdot\lambda_{\text{min}\neq 0}% (\widetilde{\Sigma})}{(\lambda+\lambda_{i})(\lambda+\tilde{\lambda}_{i})},= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT divide start_ARG over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_λ + over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG + divide start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG italic_λ ⋅ italic_λ start_POSTSUBSCRIPT min ≠ 0 end_POSTSUBSCRIPT ( over~ start_ARG roman_Σ end_ARG ) end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT divide start_ARG italic_λ ⋅ italic_λ start_POSTSUBSCRIPT min ≠ 0 end_POSTSUBSCRIPT ( over~ start_ARG roman_Σ end_ARG ) end_ARG start_ARG ( italic_λ + italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_λ + over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ,

where λλ+λi1𝜆𝜆subscript𝜆𝑖1\frac{\lambda}{\lambda+\lambda_{i}}\leq 1divide start_ARG italic_λ end_ARG start_ARG italic_λ + italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ≤ 1 for ifor-all𝑖\forall i∀ italic_i and i=1dλ~iλ+λ~ip+ssuperscriptsubscript𝑖1𝑑subscript~𝜆𝑖𝜆subscript~𝜆𝑖𝑝𝑠\sum_{i=1}^{d}\frac{\tilde{\lambda}_{i}}{\lambda+\tilde{\lambda}_{i}}\leq p+s∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT divide start_ARG over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_λ + over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ≤ italic_p + italic_s since rank(Σ~)p+srank~Σ𝑝𝑠\text{rank}(\widetilde{\Sigma})\leq p+srank ( over~ start_ARG roman_Σ end_ARG ) ≤ italic_p + italic_s. Finally, we get

dλp+s+c1εs(p+s)λλmin0(Σ~)c2(1+εsλ)(p+s),subscript𝑑𝜆𝑝𝑠subscript𝑐1subscript𝜀𝑠𝑝𝑠𝜆subscript𝜆min0~Σsubscript𝑐21subscript𝜀𝑠𝜆𝑝𝑠d_{\lambda}\leq p+s+\frac{c_{1}\varepsilon_{s}(p+s)}{\lambda\cdot\lambda_{% \text{min}\neq 0}(\widetilde{\Sigma})}\leq c_{2}\left(1+\frac{\varepsilon_{s}}% {\lambda}\right)(p+s),italic_d start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ≤ italic_p + italic_s + divide start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_p + italic_s ) end_ARG start_ARG italic_λ ⋅ italic_λ start_POSTSUBSCRIPT min ≠ 0 end_POSTSUBSCRIPT ( over~ start_ARG roman_Σ end_ARG ) end_ARG ≤ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 1 + divide start_ARG italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG italic_λ end_ARG ) ( italic_p + italic_s ) ,

where c2max{1,c1λmin0(Σ~)}subscript𝑐21subscript𝑐1subscript𝜆min0~Σc_{2}\geq\max\left\{1,\frac{c_{1}}{\lambda_{\text{min}\neq 0}(\widetilde{% \Sigma})}\right\}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ roman_max { 1 , divide start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT min ≠ 0 end_POSTSUBSCRIPT ( over~ start_ARG roman_Σ end_ARG ) end_ARG }.

Corollary 5.4 Under (Hsu et al., 2012, Condition 2 and 4), and the assumptions in Lemma H.1, the excess risk of the downstream task can be upper bounded by

(β^λ)𝖾𝗋𝗋𝗈𝗋apx+𝖤[(βλβ)X2]+𝒪(p+sn(1+εsλ)σ~2),subscript^𝛽𝜆superscriptsubscript𝖾𝗋𝗋𝗈𝗋apx𝖤superscriptdelimited-∥∥subscript𝛽𝜆superscript𝛽𝑋2𝒪𝑝𝑠𝑛1subscript𝜀𝑠𝜆superscript~𝜎2\mathcal{R}(\hat{\beta}_{\lambda})\leq\mathsf{error}_{\text{apx}}^{*}+% \operatorname{\sf E}[\lVert(\beta_{\lambda}-\beta^{*})X\rVert^{2}]+\mathcal{O}% \left(\frac{p+s}{n}\left(1+\frac{\varepsilon_{s}}{\lambda}\right)\tilde{\sigma% }^{2}\right),caligraphic_R ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ) ≤ sansserif_error start_POSTSUBSCRIPT apx end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + sansserif_E [ ∥ ( italic_β start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT - italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_X ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + caligraphic_O ( divide start_ARG italic_p + italic_s end_ARG start_ARG italic_n end_ARG ( 1 + divide start_ARG italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG italic_λ end_ARG ) over~ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

with high probability, where σ~2:=(𝖤[|𝖤[Y|X]βλX2]+𝖤[(βλβ)X2]+σ2)\tilde{\sigma}^{2}:=\left(\operatorname{\sf E}[\lvert\operatorname{\sf E}[Y|X]% -\beta_{\lambda}X\rVert^{2}]+\operatorname{\sf E}[\lVert(\beta_{\lambda}-\beta% ^{*})X\rVert^{2}]+\sigma^{2}\right)over~ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT := ( sansserif_E [ | sansserif_E [ italic_Y | italic_X ] - italic_β start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT italic_X ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + sansserif_E [ ∥ ( italic_β start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT - italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_X ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

We only outline the main steps and refer the readers to  (Hsu et al., 2012, Theorem 16) for details. First, by the triangular inequality, we have (β^λ)𝖾𝗋𝗋𝗈𝗋apx+𝖤[(β^λβ)X2]subscript^𝛽𝜆superscriptsubscript𝖾𝗋𝗋𝗈𝗋apx𝖤superscriptdelimited-∥∥subscript^𝛽𝜆superscript𝛽𝑋2\mathcal{R}(\hat{\beta}_{\lambda})\leq\mathsf{error}_{\text{apx}}^{*}+% \operatorname{\sf E}[\lVert(\hat{\beta}_{\lambda}-\beta^{*})X\rVert^{2}]caligraphic_R ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ) ≤ sansserif_error start_POSTSUBSCRIPT apx end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + sansserif_E [ ∥ ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT - italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_X ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]. The bound on the second term can be obtained as discussed below. The last term is simply due to d2,λd1,λcλ(1+εsλ)(p+s)subscript𝑑2𝜆subscript𝑑1𝜆subscript𝑐𝜆1subscript𝜀𝑠𝜆𝑝𝑠d_{2,\lambda}\leq d_{1,\lambda}\leq c_{\lambda}(1+\frac{\varepsilon_{s}}{% \lambda})(p+s)italic_d start_POSTSUBSCRIPT 2 , italic_λ end_POSTSUBSCRIPT ≤ italic_d start_POSTSUBSCRIPT 1 , italic_λ end_POSTSUBSCRIPT ≤ italic_c start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( 1 + divide start_ARG italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG italic_λ end_ARG ) ( italic_p + italic_s ) by Lemma H.1, where dl,λ=j=1d(λjλj+λ)lsubscript𝑑𝑙𝜆superscriptsubscript𝑗1𝑑superscriptsubscript𝜆𝑗subscript𝜆𝑗𝜆𝑙d_{l,\lambda}=\sum_{j=1}^{d}\left(\frac{\lambda_{j}}{\lambda_{j}+\lambda}% \right)^{l}italic_d start_POSTSUBSCRIPT italic_l , italic_λ end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( divide start_ARG italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_λ end_ARG ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT for l{1,2}𝑙12l\in\{1,2\}italic_l ∈ { 1 , 2 }. Observe that the ridge estimator with Yj=𝟙Y¯=yjsubscript𝑌𝑗subscript1¯𝑌subscript𝑦𝑗Y_{j}=\mathbbm{1}_{\bar{Y}=y_{j}}italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = blackboard_1 start_POSTSUBSCRIPT over¯ start_ARG italic_Y end_ARG = italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT as the target variable is equivalent to the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT row of the ridge estimator with Y𝑌Yitalic_Y as the target variable. Thus we can provide a bound on 𝖤[(β^λ,jβj)X2]𝖤superscriptdelimited-∥∥subscript^𝛽𝜆𝑗superscriptsubscript𝛽𝑗𝑋2\operatorname{\sf E}[\lVert(\hat{\beta}_{\lambda,j}-\beta_{j}^{*})X\rVert^{2}]sansserif_E [ ∥ ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_j end_POSTSUBSCRIPT - italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_X ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] for each j{1,,p}𝑗1𝑝j\in\{1,\ldots,p\}italic_j ∈ { 1 , … , italic_p } according to (Hsu et al., 2012, Theorem 16). Summing up the inequalities gives 𝖤[(β^λβ)X2]𝖤[(βλβ)X2]+𝖤[(β^λβλ)X2]𝖤superscriptdelimited-∥∥subscript^𝛽𝜆superscript𝛽𝑋2𝖤superscriptdelimited-∥∥subscript𝛽𝜆superscript𝛽𝑋2𝖤superscriptdelimited-∥∥subscript^𝛽𝜆subscript𝛽𝜆𝑋2\operatorname{\sf E}[\lVert(\hat{\beta}_{\lambda}-\beta^{*})X\rVert^{2}]\leq% \operatorname{\sf E}[\lVert(\beta_{\lambda}-\beta^{*})X\rVert^{2}]+% \operatorname{\sf E}[\lVert(\hat{\beta}_{\lambda}-\beta_{\lambda})X\rVert^{2}]sansserif_E [ ∥ ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT - italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_X ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ sansserif_E [ ∥ ( italic_β start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT - italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_X ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + sansserif_E [ ∥ ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT - italic_β start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ) italic_X ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ], where 𝖤[(β^λβλ)X2]𝖤superscriptdelimited-∥∥subscript^𝛽𝜆subscript𝛽𝜆𝑋2\operatorname{\sf E}[\lVert(\hat{\beta}_{\lambda}-\beta_{\lambda})X\rVert^{2}]sansserif_E [ ∥ ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT - italic_β start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ) italic_X ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] is upper bounded in (Hsu et al., 2012, Theorem 16).

Appendix I Synthetic Data: Details of the Data Generation

Let X1𝒩(𝟎,Id1)similar-tosubscript𝑋1𝒩0subscript𝐼subscript𝑑1X_{1}\sim\mathcal{N}(\bm{0},I_{d_{1}})italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , italic_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). Note that 𝖤[X1]=2Γ(5.5)Γ(5)3.08𝖤delimited-∥∥subscript𝑋12Γ5.5Γ53.08\operatorname{\sf E}[\lVert X_{1}\rVert]=\sqrt{2}\frac{\Gamma(5.5)}{\Gamma(5)}% \approx 3.08sansserif_E [ ∥ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ ] = square-root start_ARG 2 end_ARG divide start_ARG roman_Γ ( 5.5 ) end_ARG start_ARG roman_Γ ( 5 ) end_ARG ≈ 3.08, where Γ()Γ\Gamma(\cdot)roman_Γ ( ⋅ ) is the Gamma function. The label Y¯¯𝑌\bar{Y}over¯ start_ARG italic_Y end_ARG is determined by X1delimited-∥∥subscript𝑋1\lVert X_{1}\rVert∥ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ as follows: Y¯=0¯𝑌0\bar{Y}=0over¯ start_ARG italic_Y end_ARG = 0 when X1<2.5delimited-∥∥subscript𝑋12.5\lVert X_{1}\rVert<2.5∥ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ < 2.5; Y¯=1¯𝑌1\bar{Y}=1over¯ start_ARG italic_Y end_ARG = 1 when 2.5X1<3.52.5delimited-∥∥subscript𝑋13.52.5\leq\lVert X_{1}\rVert<3.52.5 ≤ ∥ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ < 3.5; Y¯=2¯𝑌2\bar{Y}=2over¯ start_ARG italic_Y end_ARG = 2 when X13.5delimited-∥∥subscript𝑋13.5\lVert X_{1}\rVert\geq 3.5∥ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ ≥ 3.5. Then, let X2=(A+Bg(X1))Y+Nsubscript𝑋2𝐴𝐵𝑔subscript𝑋1𝑌𝑁X_{2}=(A+Bg(X_{1}))Y+Nitalic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( italic_A + italic_B italic_g ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) italic_Y + italic_N where A𝐴Aitalic_A and B𝐵Bitalic_B are matrices with i.i.d. entries from Unif[2,2]Unif22\mathrm{Unif}[-2,2]roman_Unif [ - 2 , 2 ] and N𝒩(𝟎,𝐈d2)similar-to𝑁𝒩0subscript𝐈subscript𝑑2N\sim\mathcal{N}(\bm{0},{\bf I}_{d_{2}})italic_N ∼ caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). The (i,j)thsuperscript𝑖𝑗𝑡(i,j)^{th}( italic_i , italic_j ) start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT element of g(X1)𝑔subscript𝑋1g(X_{1})italic_g ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is given by (maxk(X1,k))sin(2πismink(X1,k)+2πjp)subscript𝑘subscript𝑋1𝑘2𝜋𝑖𝑠subscript𝑘subscript𝑋1𝑘2𝜋𝑗𝑝(\max_{k}(X_{1,k}))\cdot\sin(\frac{2\pi i}{s}\min_{k}(X_{1,k})+\frac{2\pi j}{p})( roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT ) ) ⋅ roman_sin ( start_ARG divide start_ARG 2 italic_π italic_i end_ARG start_ARG italic_s end_ARG roman_min start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT ) + divide start_ARG 2 italic_π italic_j end_ARG start_ARG italic_p end_ARG end_ARG ). The sample sizes of the pretext training data and testing data are 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT and 103superscript10310^{3}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, respectively. The MLPs used for the pretext task and two SL procedures all have two fully connected hidden layers with ReLU activation. The batch size is 32323232, the number of epochs is 10101010, and the learning rate is 0.0010.0010.0010.001.

Appendix J Further Details of the Computer Vision Task

The sample sizes of the pretext task and testing are fixed to be 20000200002000020000 and 1000100010001000, respectively. The edge of the triangle is sampled from Unif[8,32]Unif832\mathrm{Unif}[8,32]roman_Unif [ 8 , 32 ], the radius of the circles is sampled from Unif[5,10]Unif510\mathrm{Unif}[5,10]roman_Unif [ 5 , 10 ], and the pentagon is drawn within a circle with radius samples from Unif[8,64/3]Unif8643\mathrm{Unif}[8,64/3]roman_Unif [ 8 , 64 / 3 ]. The sizes are chosen to ensure that the average areas of the objects are similar. The pretext task and SL both use convolution neural networks (CNNs) consisting of two convolution layers and two fully connected layers with ReLU activation. The learned representation is obtained from the second convolution layer of the CNN, which has a dimension d2=12544subscript𝑑212544d_{2}=12544italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 12544. Since the rotated image X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT has the same label as X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we also use the rotated images as additional labeled data for the downstream task and SL. In the pretext task, the batch size is 64646464, the number of epochs for training is 15151515, and the learning rate is 0.0010.0010.0010.001. For SL, the batch size the modified to be 32323232 since the sample size is much smaller. For ridge regression, we choose the shrinkage parameter λ𝜆\lambdaitalic_λ from 200200200200 numbers evenly spaced on a log scale over [0.001,100]0.001100[0.001,100][ 0.001 , 100 ] with 5555-fold cross-validation.

In Section 6.2, we explain that the unsatisfactory performance of SSL for Triangle vs. Pentagon is potentially due to the analogous characteristics (i.e., the edges and vertices) that determine the orientation of the object. From a theoretical perspective, the full rank condition on Ch(x)superscript𝐶𝑥C^{h}(x)italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x ) (that is a necessary condition for the exact matching) is violated since the columns of Ch(x)superscript𝐶𝑥C^{h}(x)italic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x ) are approximately linearly dependent. This does not happen for Triangle vs. Tangent Circles since the orientation of the circles is determined by curved edges. To further support our observation, we use Grad-CAM (Selvaraju et al., 2017) to visualize the contributing features that the pretext model used for rotation prediction. To highlight the major components, we only show pixels of the heatmap with top-20% intensity in Fig. 4.

[Uncaptioned image]
Figure 4: Visualization of the contributing features for rotation prediction in the computer vision task.

Similarly to the stylized MNIST dataset, we add dot and dash patterns to the image background. The space in patterns is sampled from Unif[8,32]Unif832\mathrm{Unif}[8,32]roman_Unif [ 8 , 32 ].

[Uncaptioned image]
Figure 5: Geometric shape images with dot vs. dash background. SSL (red) and SL (blue)

Appendix K Experiments on the MNIST Dataset

The sample sizes for the pretext task and testing is 45000450004500045000 and 10000100001000010000, respectively. We compare the performance of SSL and SL under different labeled sample sizes {50,100,200,400}50100200400\{50,100,200,400\}{ 50 , 100 , 200 , 400 }. The space d𝑑ditalic_d in the sparse and dense patterns is sampled from Unif[3,8]Unif38\mathrm{Unif}[3,8]roman_Unif [ 3 , 8 ] and Unif[8,15]Unif815\mathrm{Unif}[8,15]roman_Unif [ 8 , 15 ], respectively. We randomly shift each pattern by Unif[0,d]Unif0𝑑\mathrm{Unif}[0,d]roman_Unif [ 0 , italic_d ] to avoid the position of the pattern being correlated with the image orientation. We use the same CNN configuration as the geometric shape task.

[Uncaptioned image]
Figure 6: The original MNIST dataset and four stylized versions of MNIST. SSL (red) and SL (blue).
[Uncaptioned image]
Figure 7: A visualization of the contributing features for MNIST dataset with dash vs. dot background.