Low-Rank Approximation of Structural Redundancy for Self-Supervised Learning
Abstract
We study the data-generating mechanism for reconstructive SSL to shed light on its effectiveness. With an infinite amount of labeled samples, we provide a sufficient and necessary condition for perfect linear approximation. The condition reveals a full-rank component that preserves the label classes of , along with a redundant component. Motivated by the condition, we propose to approximate the redundant component by a low-rank factorization and measure the approximation quality by introducing a new quantity , parameterized by the rank of factorization . We incorporate into the excess risk analysis under both linear regression and ridge regression settings, where the latter regularization approach is to handle scenarios when the dimension of the learned features is much larger than the number of labeled samples for downstream tasks. We design three stylized experiments to compare SSL with supervised learning under different settings to support our theoretical findings.
Reconstructive self-supervised learning (SSL) has been highly successful in various fields (Pathak et al., 2016; Vincent et al., 2010; Radford et al., 2018; Devlin et al., 2018), where the theme is to extract representations from unlabeled data that are potentially useful for downstream tasks. One of the major advantages of SSL is its significantly reduced dependency on labeled data. Despite abundant empirical evidence, the theoretical understanding of the performance of SSL under limited labeled data is still insufficient.
In reconstructive SSL, a pretext task is designed to predict a target with input features , which yields the learned representation . Then, the downstream task is to predict the target using . Whether the learned representation is useful for the downstream task relies on the connections between the pretext and downstream tasks. To bridge the pretext and downstream tasks, the conditional independence (CI) assumption, namely , has been studied in the seminal work (Lee et al., 2021). For the classification setting, they show that CI is a sufficient condition for a linear predictor to be optimal for the downstream task, that is, can linearly predict perfectly with an infinite number of samples available for the task. Motivated by this key observation, they provide theoretical guarantees showing the superior sample complexity of SSL under general approximate conditional independence settings.
However, a fundamental theoretical question for understanding reconstructive SSL still remains:
What is the sufficient and necessary condition on , in the classification setting, for a linear predictor to be optimal for the downstream task?
To address this question, it is helpful to express as , where is for an arbitrary supervised learning task and . With this expression, roughly speaking, the target can be decoded from if is invertible in some sense. We formalize this notion of invertibility by focusing on classification problems. Our formulation allows for general dependency between and when conditioning on . Thus there are features in the learned representation that are redundant for the prediction of . For instance, for image classification problems, the redundant features may come from the background in the image; if the object of interest is surrounded by other objects in the background, a pretext task of predicting blocked patches of the image may mistakenly extract too many features from the background (Pathak et al., 2016). Without any constraints, a large percentage of redundant features can potentially make SSL fail. To this end, we introduce a quantity , indexed by rank , for a low-rank approximation of the redundancy in the learned representation. We show how affects the performance of SSL through both of our theoretical analysis and experiments.
Our main contributions are summarized below.
1.
Under the classification setting, we characterize in Section 3 a sufficient and necessary condition for a linear predictor to be optimal in the downstream task.
2.
In Section 4, we introduce a low-rank approximation quantity to characterize the redundancy in the learned representation.
3.
Based on the low-rank approximation quantity, we derive finite sample bounds on the excess risk and the corresponding sample complexity for both ordinary least squares and ridge regression estimators in Section 5.
4.
In Section 6.1, we design a simulation setting to demonstrate the effectiveness of the low-rank approximation. Our sufficient and necessary condition is partially verified through two computer vision tasks in Section 6.2.
1.1 Related work
Reconstructive SSL is focused on recovering deliberately concealed information in the data. In computer vision, examples include the prediction of blocked patches (Pathak et al., 2016), recovering the color (Zhang et al., 2016), denoising (Vincent et al., 2010), and identifying the rotated angle (Gidaris et al., 2018), while the simple scheme of next word prediction is widely adopted in NLP (Radford et al., 2018; Devlin et al., 2018). From the theoretical perspective, (Saunshi et al., 2020) and (Wei et al., 2021) study how pre-trained language models yield useful representation for downstream tasks. For computer vision tasks, (Pathak et al., 2016) provides a theoretical understanding of features learned by auto-encoders under a multi-view data assumption. Under a general formulation of reconstructive SSL, (Lee et al., 2021) shows that CI is sufficient for a linear predictor to be optimal in the downstream task and provides finite sample analysis. Since CI often fails to hold in practical settings, (Teng et al., 2022) proposes to modify the unlabeled data to make CI hold. Their theoretical analysis suggests that the modification is hurtful rather than helpful for the performance of SSL. The other popular type of SSL is called contrastive SSL, where the goal is to learn representations that make different views of the same data point closer. The CI assumption has been adopted in (Arora et al., 2019; Tosh et al., 2021) to provide theoretical guarantees for contrastive learning. In the context of contrastive SSL, CI is a natural assumption since, ideally, two views are expected to share less information given the label. The literature on SSL is vast and we refer the readers to (Gui et al., 2023; Ozbulak et al., 2023) for detailed reviews.
1.2 Notation
Throughout the paper, denotes the norm for vectors or Frobenius norm for matrices. We use and to denote vectors or matrices of zeros and ones, respectively. For a full (column) rank matrix with , we use to denote its (left) pseudoinverse. Let denote the covariance matrix of a random vector and denote that of two random vectors and . We use to hide log factors and to hide constants in inequalities. We use to denote the identity matrix of size . A random vector is said to be -sub-Gaussian if and for any and such that .
2 Problem Formulation: Reconstructive SSL
Consider where and are the target variables for the pretext and downstream tasks, respectively, and is a vector of features shared by the two prediction tasks. We focus on the classification setting for the downstream task, i.e., is categorical. For regression problems, one can consider the continuous target variable being discretized to a set of values. We use to denote the original label variable and to denote its one-hot encoding with one class excluded to avoid multicolinearity as , and we will simply refer to as the one-hot encoding of throughout this work. We assume throughout the work. For simplicity, we assume that the optimal predictors for different classes of are not linearly dependent, i.e., has full rank; otherwise, certain classes of can be hidden to make it hold.
Concretely, we consider the following reconstructive SSL procedure.
1.
Pretext task: Given unlabeled data, predict using under some function class , i.e., estimate .
2.
Downstream task: Given labeled data, regress on the learned representation using simple regression functions such as linear or ridge regression.
Since there is often a large amount of unlabeled data and one can adopt deep neural networks to achieve universal approximation, we fix and focus on analyzing the downstream task. Due to the nature of the small (labeled) sample size of SSL, the function class for the downstream task is often assumed to have lower complexity compared to (e.g., smaller parameter space). For theoretical analysis, we consider the class of all linear functions for the downstream task similarly as in (Lee et al., 2021). In practice, the advantage of SSL over supervised learning (SL) is more significant when the labeled sample size is relatively small, in which case the dimension of can be larger than . To avoid the downstream task being ill-posed, we adopt the ridge estimator. To measure the gap between the SSL prediction and the optimal predictor in infinite and finite samples, respectively, we define the approximation error and excess risk.
Definition 2.1.
Define the approximation error of SSL as , where
(1)
with , the optimal predictor of given .
Note that can be ensured by a function class with universal approximation power such as deep neural networks.
Definition 2.2.
We say there is an exact matching between and given if .
For simplicity, we will omit the intercept throughout the work. The performance of the downstream task is usually quantified through the so-called excess risk defined with respect to the finite sample analysis. Denote . Let and be the labeled data, and denote the learned representation from pretraining. For the downstream task and , let
Definition 2.3.
The excess risk induced by the estimator is defined as .
The term “matching” can be viewed in the following sense: (1) the form of nonlinearity in should be captured by ; (2) the “redundant” nonlinearity in should be linearly dependent so that they can be removed through a linear transform. As a toy example, consider and and note they share the same quadratic term , while the sine functions in are redundant for predicting . Observe that SSL with extracts the quadratic term while eliminating the sine functions. In contrast, will not lead to an exact matching.
Remark 2.4.
This notion of predicting a subset of can be helpful for predicting is not limited to reconstructive SSL. For instance, in a series of recent papers Du and Xiang (2022, 2023a, 2023b), the authors have explored a similar direction from an invariance perspective for multi-environment domain adaption, which has partially motivated this study.
3 Necessary and Sufficient Condition for Exact Matching
In an attempt to demystify the matching between the pretext and downstream tasks, we propose to identify the conditions on the generating mechanism of that enable an exact matching. The generating mechanism of in a supervised learning task is often complicated, and thus we make no assumptions on how is generated and focus on the interactions between and . Without loss of generality, we can write in the following form
(2)
where is the regression function of on and therefore the residual variable satisfies . The function captures how the label and feature are encoded into .
Equation (2) can be viewed from a causal perspective through a general structural causal model (SCM) (Pearl, 2009), ,
where is a vector of exogenous variables independent of . Since this general SCM suffers from identifiability issues, we focus on (2), observing that . It is important to note that (2) is valid even when there is no underlying causal graph over .
Recall that is the one-hot encoding of . Observe that an arbitrary function can be equivalently written as
(3)
where we use the fact that . This simple derivation implies a one-to-one correspondence between and , meaning that the general function model (2) can be expressed as
(4)
with . The role of the latent random matrix is to encode the label variable into , thus we call the encoding function. The expression in (4) implies the identity , which is equivalent to
(5)
for any such that . In words, the rows of are orthogonal to for . We call such an orthogonal term. For instance, the orthogonality holds when and each row of is . Equation (5) defines an equivalent class of encoding functions that results in the same pretext representation . This shows that such orthogonal terms do not affect the analysis of SSL, and thus we use to hide the orthogonal term (added to ) in equations throughout the paper.
Proposition 3.1.
The exact matching in Definition 2.2 holds if and only if , for and some .
Therefore, in this formulation, finding an exact matching is equivalent to inverting the encoding function . Proposition 3.1 implies that the full rank of for every is a necessary condition for the exact matching. In the following lemma, we provide a sufficient and necessary condition for the exact matching through a full characterization of the invertibility of .
Lemma 3.2(sufficient and necessary condition for exact matching).
There is an exact matching between and given if and only if
(6)
for some invertible matrix , an arbitrary matrix function .
The identity map** fully preserves each class of , and represents the redundancy encoded into . It is worth noting that redundancy refers to the features extracted from that are predictive for , but redundant for the prediction of (given the optimal predictor ). In our stylized MNIST experiment in Section 6.2.2, the dash pattern in the background is redundancy since it is useful for predicting the image orientation (i.e., ), but it contains no information about the label. In contrast, the dot pattern is not redundant, since it is independent of both the image orientation and the label. The lemma above reveals that the label should be encoded into through an invertible linear mixture (i.e., ) of the full label information and some redundancy. When is an identity matrix, the first rows of capture the full label information, thus the downstream task has a sparse solution . However, the solutions to the downstream task may not be sparse in general, and we handle this challenge in Section 4. Below are two examples with explicit forms of .
Example 3.3.
An important special case of model (4) is , where is a constant function.
In this case, the necessary and sufficient condition simplifies to the condition that has full rank. Observe that implies .
In Appendix A, we show that model is equivalent to , which we call conditional mean independence, which is weaker than CI, i.e., . The setting in Example 3.3 has been studied in (Lee et al., 2021) under CI. Despite the simplicity of conditional mean independence, it can be unrealistic in practical settings, since it requires that the label is fully encoded into with no redundant information (depending on ) as if the pretext and downstream tasks are two equivalent prediction tasks. Even though approximate conditional independence has been studied in (Lee et al., 2021), it is unclear if the approximation provides sufficient insights into explaining why and when SSL works (or fails), since conditional independence (or constant ) is not a necessary condition for exact matching.
Example 3.4(partially linear model).
Define an invertible matrix with a column partition as , where has columns, has columns such that , and has the rest of the columns. Let , where satisfies . Its encoding function is as derived below. The sufficient and necessary condition is immediately satisfied with and ,
where has all zero rows.
Even though conditional independence fails to hold since is not constant, according to Lemma 3.2, there is still an exact matching.
4 Structural Redundancy
The pretext representation is typically high-dimensional, designed to capture abundant information for various downstream tasks. Given limited labeled samples ( in our notation), the downstream task is a high-dimensional linear regression problem. Without any assumptions such as sparsity of the true coefficients (Tibshirani, 1996; Candes and Tao, 2007) or low effective dimension of the features (Zhang, 2005; Hsu et al., 2012), SSL may not perform well even if the exact matching holds. Since the true coefficients are not necessarily sparse as discussed in Section 3, we explore how low-rank structures in the redundancy (i.e., ) naturally lead to a low effective dimension. In particular, we adopt the definition of effective dimension from (Hsu et al., 2012) in the context of ridge regression (see details below); a closely related notion is called effective degrees of freedom (Efron, 1986; Hastie et al., 2009). Roughly speaking, the effective dimension measures the number of features that are not linearly correlated; when it is low, a small number of labeled samples can be sufficient for reliable estimation
To simplify the notation, denote with covariance matrix . Let denote the eigenvalues of .
The population ridge estimator is given by
Implicitly dimension reduction is performed in ridge regression when some appropriate shrinkage parameter is chosen. The reduced dimension for a chosen can be quantified by the effective dimension, defined as for . Note that the bias of the ridge estimator increases monotonically as increases. When has exactly nonzero eigenvalues, is upper bounded by for any . Besides this special case, a low effective dimension can be achieved under a weak penalty (i.e., small ) when there is a large percentage of small eigenvalues.
In the next subsection, we demonstrate that a low effective dimension is naturally attained when the redundancy can be approximated by a low-rank decomposition. Our finite sample analysis on the high-dimensional setting, presented in Section 5, relies on an upper bound of the low effective dimension, utilizing a measure of the low-rank approximation introduced in the following subsection (Lemma H.1). When , with offers an interpretation of our upper bound on the excess risk and sample complexity (see details in Theorem 5.2 and Remark 5.3).
4.1 Low-rank Approximation of Redundancy
Recall, redundancy refers to information in useful for predicting but not for the label . For instance, in computer vision tasks, consider a label determined by the object of interest within a surrounding background. Redundancy arises when the pretext task captures background information unrelated to the label. If the background features simple patterns, such as sky, grassland, or beach, this redundancy can be considered low-rank. Consequently, it is relatively easy to eliminate such redundancy in downstream tasks (recall the cancellation of sine functions below Definition 2.3). In the following, we present the technical details of the low-rank approximation of redundancy.
Denote and recall that reduces to under conditional mean independence. Assume that the necessary and sufficient condition in Lemma 3.2 is satisfied,
where and denotes the last columns of . If the (centered) redundancy admits a low-rank decomposition, i.e., for some and , where , we get
(7)
which has at most linearly independent components, as and . This shows that the effective dimension of is bounded by for any .
Since the low-rank decomposition may not hold exactly for a chosen , we identify of the form that best approximates . Specifically, for any fixed s.t. , we consider , and define
(8)
with minimizers and we use the shorthand for any pair of optimizer . Without loss of generality, we normalize and assume . The low-rank approximation error is averaged so that does not grow with . A challenge for analyzing is that the minimizers do not have closed-form expressions. When follows a Gaussian distribution, we show (8) reduces to a weighted low-rank approximation problem (with no randomness) in Appendix E. However, these weighted problems do not have closed-form expressions in general (Srebro and Jaakkola, 2003; Dutta and Li, 2017). Therefore, we leave the further investigations of
for future work. For , we simply define
which measures how approximately conditional mean independence (i.e., ) holds. There is a tradeoff between the effective dimension of (which are no greater than ) and the approximation quality, as the approximation level is non-increasing as increases. An important special case of the low-rank approximation is when the encoding functions are smooth.
Example 4.1(smooth encoding function).
Consider a binary classification problem with a scalar predictor , i.e., , assume that the encoding function is twice continuously differentiable, then its second-order Taylor expansion at is given by
where we can choose so that . This provides a rank-two approximation for such that , where . This example can be generalized to high-order, multi-class, and multivariate cases, and we provide the details in Appendix D.
To understand the impact of the size of on how approximately the matching holds (or how small the approximation error is), we derive the following upper bound via a ridge-type estimator. Unlike ridge-type estimators used in practice, the parameter that restricts the size of the coefficients is determined by the generating mechanism of .
Lemma 4.2.
Consider and corresponding to , we have
where the minimum of the RHS is attained at . The equality holds with the RHS being zero when .
5 Finite Sample Analysis
Let be a fixed true parameter for the downstream task. Recall that the excess risk is defined as . Under conditional mean independence, observe that is a feature vector with at most (out of ) features that are linearly independent. Since the number of classes is often much smaller than the dimension of the learned representation , the design matrix for the downstream task is of low rank. This enables a finite-sample bound on the excessive risk with sample complexity (Lee et al., 2021), where the bound is independent of the dimension . In the following, we provide a finite-sample analysis of SSL in the general setting when conditional independence can be violated, based on the low-rank approximation defined in (8).
First, we introduce a few technical assumptions. Let , , and denote the covariance matrix of , , and , respectively.
Assumption 1
We assume is -sub-Gaussian, and the whitened feature vectors and are -sub-Gaussian. 111 When or is not invertible, the whitened feature vector is defined through the generalized inverse.
Assumption 2
There exists s.t. the following holds almost surely,
;
.
Remark 5.1.
A similar assumption has been made in (Lee et al., 2021, Assumption 3.3), which is motivated by (Hsu et al., 2012, Condition 4).
Let denote the largest eigenvalue of a symmetric real matrix such that , denote its smallest nonzero eigenvalue, and denote the set of all its eigenvalues. When is good approximates of , we expect and to be close to zero matrices. Therefore, we consider restricting the largest eigenvalues of the two matrices, respectively. A generic bound is provided in (Wolkowicz and Styan, 1980), that is , where is the variance of . The equality holds when the smallest eigenvalues are equal. However, this bound can be quite loose when is large. Instead, we make the following assumption.
Assumption 3
For some universal constants and ,
;
.
Both inequalities require that the average eigenvalue is comparable to the largest eigenvalue. The assumption can be unrealistic when or has mostly zero eigenvalues but a few large positive eigenvalues. We explain why such a case will not happen when . Case I: When , there exists such that and , in which case the inequalities are satisfied with . Case II: In settings when is comparable to (i.e., is a fraction of ), satisfies , and thus the rank of should be greater than , meaning that most of eigenvalues are nonzero. Similarly, should have at least linearly independent components, i.e., . Given that serves as an approximation for with a lower effective dimension, we make the following technical assumption on the rank.
Assumption 4
and .
Since has linearly independent components while is introduced to approximate independent components out of them, is expected to have less independent components than .
Theorem 5.2.
Under Assumptions 1— 4, for any , if , the excess risk of the downstream task induced by is upper bounded, with probability at least , by
Remark 5.3.
When , if , we have .
The proof of Theorem 5.2 follows a similar idea to that of (Lee et al., 2021, Theorem ); a subtle yet important difference is that we consider approximation errors due to the violations of the exact matching while they consider approximation errors due to choices of the function class . When , the dominating rate of is , which shows that SSL enjoys a similar sample complexity as shown in (Lee et al., 2021) even when conditional independence is violated. We have demonstrated in Lemma 4.2 how the approximation error depends on the approximation level . We also provide the bound with respect to , stated below. The proof is largely followed from (Hsu et al., 2012, Theorem 16) and we only outline the main steps in Appendix H.
Corollary 5.4(Informal).
Under suitable assumptions, the excess risk of the downstream task induced by can be upper bounded by
with high probability, where .
For simplicity, the parameters that depend on the choice of are omitted. The bound requires even though , thus an approximation (8) with lower rank is expected in this more challenging setting. The term relies on the difference between and , as well as the choice of . When and , the dominating rate
is the same as that in Remark 5.3. This shows that low-rank structures enable SSL to share a similar excess risk upper bound and sample complexity in low- and high-dimensional settings.
6 Experiments
We propose a synthetic dataset and two computer vision tasks to examine the importance of the full rank condition on and the low-rank approximation quality. Recall that a necessary condition for the exact matching is that has full rank for every . For the synthetic dataset, we ensure that is of full rank and focus on the low-rank approximation. For image data, since is a latent matrix function, it is not straightforward to test whether has full rank in general. To this end, we design images of simple geometric shapes, that can be seen as abstractions of real images and show how some geometric properties make the rank condition hold or fail. To further understand the importance of the low-rank approximation, we add background patterns to the MNIST dataset and show that certain patterns can lead to poor low-rank approximation. See more details of the experiments in Appendixes I, J, and K. SSL approaches have achieved superior performance on large benchmark datasets, while the function class for downstream tasks is often much larger than linear models (e.g., MLPs), which is beyond the scope of our theoretical analysis. The implementation of our experiments is provided at \urlhttps://github.com/dukang4655/reconstructive_ssl.
6.1 Synthetic Data
We use a synthetic dataset to verify our theoretical results when . First, we generate with , , and , where , where . See details of the model parameters in Appendix I. We compare SSL with two supervised learning (SL) procedures in two settings: I. Fix and vary the labeled sample size ; II. Fix and vary the low-rank approximation by considering . We consider two supervised learning procedures. : Predicting by , and : Predicting by . We use MLPs for the pretext task and the two supervised learning procedures. In Fig. 1, the performance of is roughly invariant with respect to since it does not use for the prediction, making it more robust than . This verifies that predictions using the parents of as predictors (which we call causal predictions) are more robust than non-causal predictions under a small sample size. The superior performance of SSL degrades as increases. When , we have according to the factorization in (7). In Fig. 1, when , a low-rank approximation (8) could lead to a large approximation error for . As a consequence, the advantage of SSL over SL1 is much smaller compared with the case with . This indicates that a good low-rank approximation is not only sufficient but also necessary for SSL. In the other setting when is fixed to but the sample size varies, as shown in Fig.1, the performance of SSL improves slowly with the increasing sample size, since it already achieves performance that to close to the optimal (i.e., the performance of SL1 with a large ) under a small . SL1 starts to catch up with SSL when , while the accuracy of SL2 does not consistently improve as increases.
6.2 Computer Vision Tasks
6.2.1 Geometric Shapes (On the Rank Condition)
In computer vision applications, it is common that the dimension of the learned representation is much larger than the labeled sample size (i.e., ) for SSL. We design a simple task to help understand how the patterns in an image make SSL work or fail; the task is inspired by (Gidaris et al., 2018), where the pretext task is to predict the rotated angles of images. We consider as a random image of some objects, where the objects have random sizes and random locations. The goal is to classify the shape of the object . We created by randomly rotating by or degrees. Observe that the location and the size of an object are redundant features for predicting its shape and orientation. This stylized setting, even though much simpler than real-world images, is designed this way on purpose in order for the redundancy variable to have low-rank approximations. According to (4), the column of can be viewed as a feature vector for the rotation angle corresponding to the class of objects. Since we consider the classification of two classes (i.e., ), the condition requires that the two feature vectors should not be similar. To verify this necessary condition, we consider two pairs of objects:
Triangle vs. Tangent Circles (Fig. 2(a)). In this case, the identification of orientation is based on characteristics specific to the two shapes. For triangles, it is natural to focus on the edges and vertices, while those characteristics are not even defined for circles. Thus, we think the rank condition is approximately satisfied. We examine this observation using Grad-CAM (Selvaraju et al., 2017) that visualizes the contributing features that the model extracted from the image (see Fig. 4 from Appendix J). Similarly for the pair of objects below.
Triangle vs. Pentagon (Fig. 2(b)) In this case, the orientation of the two objects can be identified in similar ways, mainly based on the edges and vertices. In this case, the columns of are close to linearly dependent for different . As a consequence, the necessary condition for the exact matching is violated.
We compare SSL with SL under different labeled sample sizes . For Triangle vs. Tangent Circles, as shown in Fig. 2, SSL consistently outperforms SL for small sample sizes (i.e., ). The performance of SSL improves slower compared with SL for sufficiently large , since the prediction error of SSL will be dominated by the population error instead of the estimation error. Recall that our finite-sample bound on the excess risk in Corollary 5.4 converges to the sum of the approximation error and an error term depending on as . In contrast, SSL behaves very differently for Triangle vs. Pentagon. From Fig. 2(b), the accuracy of SSL improves slower as increases, potentially due to the large approximation error. Overall, SSL has almost no advantage over SL. This experiment shows that the rank condition is crucial for SSL.
6.2.2 Stylized MNIST (On the Low-Rank Approximation)
To examine how the redundancy affects the performance of SSL, we consider the same rotation prediction task for a stylized MNIST dataset illustrated in Fig. 3, where the density of the background pattern varies randomly. A key observation is that the dot pattern does not help identify the image orientation, so it is not encoded into the orientation variable as redundancy. On the contrary, one can tell the orientation of the image simply by the orientation of the dash pattern, thus the pretext task will extract features from the dash pattern as redundant information. Again, we use Grad-CAM to visualize our observation in Fig. 7 from Appendix K. Consequently, a dense dash pattern can lead to poor low-rank approximation. In Fig. 6 from Appendix K, the performance of SSL is almost invariant to the density of the dot pattern while the performance of SL drops as the density increases. In contrast, SSL is quite sensitive to a sparse dash pattern and the performance gets worse as the density increases (see Fig. 3). We have tested the dot vs. dash patterns for the geometric shape images, and similar results are observed as shown in Fig. 5 from Appendix J.
7 Discussion
Many important questions remain to be studied and we list a few in this section. One natural next step is to study nonlinear function classes for the downstream task and characterize the corresponding sufficient and necessary conditions. Since our theoretical results can potentially provide guidance for develo** SSL procedures, especially for designing pretext tasks, it would be worthwhile to design systematic and extensive experiments to better bridge the theories and practical designs. Besides the superior performance under limited labeled samples, the other major advantage of SSL is that the learned representation can be useful for a diverse class of downstream tasks; a theoretical understanding of its ability to generalize to new tasks or unseen environments (e.g., by exploiting invariance Du and Xiang (2023b)) is of great importance.
References
Arora et al. (2019)
Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi.
A theoretical analysis of contrastive unsupervised representation learning.
arXiv preprint arXiv:1902.09229, 2019.
Candes and Tao (2007)
Emmanuel Candes and Terence Tao.
The dantzig selector: Statistical estimation when p is much larger than n.
2007.
Devlin et al. (2018)
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
BERT: Pre-training of deep bidirectional transformers for language understanding.
arXiv preprint arXiv:1810.04805, 2018.
Du and Xiang (2022)
Kang Du and Yu Xiang.
An invariant matching property for distribution generalization under intervened response.
In 2022 30th European Signal Processing Conference (EUSIPCO), pages 1387–1391. IEEE, 2022.
Du and Xiang (2023a)
Kang Du and Yu Xiang.
Generalized invariant matching property via lasso.
In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023a.
Du and Xiang (2023b)
Kang Du and Yu Xiang.
Learning invariant representations under general interventions on the response.
IEEE Journal on Selected Areas in Information Theory, 2023b.
Dutta and Li (2017)
Aritra Dutta and Xin Li.
On a problem of weighted low-rank approximation of matrices.
SIAM Journal on Matrix Analysis and Applications, 38(2):530–553, 2017.
Efron (1986)
Bradley Efron.
How biased is the apparent error rate of a prediction rule?
Journal of the American statistical Association, 81(394):461–470, 1986.
Gidaris et al. (2018)
Spyros Gidaris, Praveer Singh, and Nikos Komodakis.
Unsupervised representation learning by predicting image rotations.
arXiv preprint arXiv:1803.07728, 2018.
Gui et al. (2023)
Jie Gui, Tuo Chen, Qiong Cao, Zhenan Sun, Hao Luo, and Dacheng Tao.
A survey of self-supervised learning from multiple perspectives: Algorithms, theory, applications and future trends.
arXiv preprint arXiv:2301.05712, 2023.
Hastie et al. (2009)
Trevor Hastie, Robert Tibshirani, and Jerome H Friedman.
The elements of statistical learning: data mining, inference, and prediction, volume 2.
Springer, 2009.
Hsu et al. (2012)
Daniel Hsu, Sham M Kakade, and Tong Zhang.
Random design analysis of ridge regression.
In Conference on learning theory, pages 9–1. JMLR Workshop and Conference Proceedings, 2012.
Lee et al. (2021)
Jason D Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo.
Predicting what you already know helps: Provable self-supervised learning.
Advances in Neural Information Processing Systems, 34:309–323, 2021.
Ozbulak et al. (2023)
Utku Ozbulak, Hyun Jung Lee, Beril Boga, Esla Timothy Anzaku, Homin Park, Arnout Van Messem, Wesley De Neve, and Joris Vankerschaver.
Know your self-supervised learning: A survey on image-based generative and discriminative training.
arXiv preprint arXiv:2305.13689, 2023.
Pathak et al. (2016)
Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros.
Context encoders: Feature learning by inpainting.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
Pearl (2009)
Judea Pearl.
Causality.
Cambridge University Press, 2009.
Radford et al. (2018)
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al.
Improving language understanding by generative pre-training.
2018.
Saunshi et al. (2020)
Nikunj Saunshi, Sadhika Malladi, and Sanjeev Arora.
A mathematical exploration of why language models help solve downstream tasks.
arXiv preprint arXiv:2010.03648, 2020.
Selvaraju et al. (2017)
Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra.
Grad-CAM: Visual explanations from deep networks via gradient-based localization.
In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
Srebro and Jaakkola (2003)
Nathan Srebro and Tommi Jaakkola.
Weighted low-rank approximations.
In Proceedings of the 20th international conference on machine learning (ICML-03), pages 720–727, 2003.
Teng et al. (2022)
Jiaye Teng, Weiran Huang, and Haowei He.
Can pretext-based self-supervised learning be boosted by downstream data? a theoretical analysis.
In International Conference on Artificial Intelligence and Statistics, pages 4198–4216. PMLR, 2022.
Tibshirani (1996)
Robert Tibshirani.
Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996.
Tosh et al. (2021)
Christopher Tosh, Akshay Krishnamurthy, and Daniel Hsu.
Contrastive learning, multi-view redundancy, and linear models.
In Algorithmic Learning Theory, pages 1179–1206. PMLR, 2021.
Vincent et al. (2010)
Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and Léon Bottou.
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.
Journal of machine learning research, 11(12), 2010.
Wei et al. (2021)
Colin Wei, Sang Michael Xie, and Tengyu Ma.
Why do pretrained language models help in downstream tasks? an analysis of head and prompt tuning.
Advances in Neural Information Processing Systems, 34:16158–16170, 2021.
Wolkowicz and Styan (1980)
Henry Wolkowicz and George PH Styan.
Bounds for eigenvalues using traces.
Linear algebra and its applications, 29:471–506, 1980.
Zhang et al. (2016)
Richard Zhang, Phillip Isola, and Alexei A Efros.
Colorful image colorization.
In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pages 649–666. Springer, 2016.
Zhang (2005)
Tong Zhang.
Learning bounds for kernel regression using effective data dimensionality.
Neural computation, 17(9):2077–2098, 2005.
Appendix A Conditional Mean Independence
Recall that when is a constant function, we have and we require rather than .
Proposition A.1.
Model (4) holds with
being a constant function if and only if .
Proof A.2.
First, assume that holds with . It follows immediately that
and , where we use the fact that . Thus we have . Now we prove the other direction. Assume that , then
where has columns ’s, which implies model (A.1) in the form of .
For simplicity of notation, we prove the lemma without considering the orthogonal term, while the orthogonal term can be directly added to the final expression of . Recall the dimension of is by . According to Proposition 3.1, it is sufficient to show that is equivalent to (6). The “if” part is immediate since will lead to , . In the following, we prove the other direction. When the exact matching holds, recall that the full rank of implies that has full row rank since . The QR decomposition of gives , where is an invertible lower-triangular matrix and is an orthonormal matrix. Then, implies with . Due to the zero columns in , the first rows of has to be , while the other rows, denoted by , can be arbitrary. Therefore, we obtain
where is the product of and some elementary matrices introduced to transform to an identity matrix .
Appendix D Low-rank approximation of Smooth Encoding Functions
In this section, we study how the smoothness of the encoding function enables the low-rank approximation. Specifically, we will construct the approximation in (8) explicitly with polynomial functions. For simplicity, we present the idea for second-order approximation, while the higher-order cases can be derived in a similar manner.
Let be a twice continuously differentiable matrix function, the Taylor expansion of its column at is given by
(12)
where the row of is the derivative of evaluated at and is the Hessian matrix of evaluated at . We represent in a matrix form in (12) by introducing and the coefficient matrix consisting of the (scaled) first two order derivatives. This allows us to approximate by
Let the maximum rank of the matrices be , then there exists and such that
(13)
which is enabled by the decomposition for each . With the continuity of , we can fix so that by the mean value theorem. The high-order reminder term can be ignored when the third derivatives of are all zeros. If this is not the case, we can include high-order terms in until the reminder term is small enough. If the maximum rank is not small, one can still consider the low-rank approximation of , and there will be an additional approximation error term in (13).
Appendix E SSL under Gaussian Distribution
Even though our formulation focuses on the classification setting, the extension of our results to the Gaussian case is straightforward. Assume are jointly Gaussian, where is a scalar Gaussian variable. Let for any random vectors and . Then, and . If and has a full rank, it is straightforward to see that an exact matching holds with . Concretely, there exists and such that
(14)
where .
Since the encoding function is not well-defined when is continuous, we formulate the low-rank approximation by
which aims to find a weighted low-rank approximation for .
where the first inequality follows from the sub-multiplicativity of the matrix norm and the triangle inequality and the last two inequalities are due to the triangle inequality, and the fact that . When , note that has rank at most , thus there exists at least one solution for the equation . The expression of is a standard expression of ridge-type estimators.
Lemma G.1(Concentration on the covariance matrix (Lee et al., 2021)).
For with i.i.d. rows, where each row is -sub-Gaussian with covariance . For any with rank that is independent of . For any , if , with probability at least , we have
Let be a projection matrix and let be a matrix with i.i.d. rows, where each row of is mean zero (conditioning on being rank ) -sub-Gaussian. For any , with probability at least ,
To prove for symmetrical matrices and , one can simply prove . This holds immediately for zero eigenvalues ’s if . Therefore, by Assumption 4, we will focus on the case when has all positive eigenvalues. First,
In the following, we will use the fact that for any symmetric matrix .
Using Assumption 3,
which implies,
for every and . This immediately leads to .
Finally, recall the fact that . By Assumption 3, we have
First, recall that and the shorthand . By the triangle inequality, we have
Denote .
Recall that . Then, we can write , with satisfying by the tower property. The definition of implies
By rearranging the terms, we get
We bound the first two inner products in the following, and the other two follow similarly. First,
(15)
where the inequality is due to Assumption 2 and the covariance concentration in Lemma G.1, and the last inequality is simply due to the covariance concentration.
The replacement of by is by
Lemma G.3. Let denote the projection matrix defined with respect to , we have
(16)
where the last bound is due to Lemma G.2 and the replacement of by follows from the covariance concentration and Lemma G.3. Since we make no assumptions on the rank of , it has at most rank . Similarly, we get
Under Assumptions 3 and 4, for , there exists such that
Remark H.2.
It is known that (Hsu et al., 2012).
When , we have , where the equality holds when .
Proof H.3.
Let and denote the eigenvalues of and , respectively. Recall that Assumptions 3 and 4 imply as shown in the proof of Lemma G.3, then we have
where for and since . Finally, we get
where .
Corollary 5.4Under (Hsu et al., 2012, Condition 2 and 4), and the assumptions in Lemma H.1, the excess risk of the downstream task can be upper bounded by
with high probability, where .
We only outline the main steps and refer the readers to (Hsu et al., 2012, Theorem 16) for details. First, by the triangular inequality, we have . The bound on the second term can be obtained as discussed below.
The last term is simply due to by Lemma H.1, where for . Observe that the ridge estimator with as the target variable is equivalent to the row of the ridge estimator with as the target variable. Thus we can provide a bound on for each according to (Hsu et al., 2012, Theorem 16). Summing up the inequalities gives ,
where is upper bounded in (Hsu et al., 2012, Theorem 16).
Appendix I Synthetic Data: Details of the Data Generation
Let . Note that , where is the Gamma function. The label is determined by as follows: when ; when ; when . Then, let where and are matrices with i.i.d. entries from and . The element of is given by . The sample sizes of the pretext training data and testing data are and , respectively. The MLPs used for the pretext task and two SL procedures all have two fully connected hidden layers with ReLU activation. The batch size is , the number of epochs is , and the learning rate is .
Appendix J Further Details of the Computer Vision Task
The sample sizes of the pretext task and testing are fixed to be and , respectively. The edge of the triangle is sampled from , the radius of the circles is sampled from , and the pentagon is drawn within a circle with radius samples from . The sizes are chosen to ensure that the average areas of the objects are similar. The pretext task and SL both use convolution neural networks (CNNs) consisting of two convolution layers and two fully connected layers with ReLU activation. The learned representation is obtained from the second convolution layer of the CNN, which has a dimension . Since the rotated image has the same label as , we also use the rotated images as additional labeled data for the downstream task and SL. In the pretext task, the batch size is , the number of epochs for training is , and the learning rate is . For SL, the batch size the modified to be since the sample size is much smaller. For ridge regression, we choose the shrinkage parameter from numbers evenly spaced on a log scale over with -fold cross-validation.
In Section 6.2, we explain that the unsatisfactory performance of SSL for Triangle vs. Pentagon is potentially due to the analogous characteristics (i.e., the edges and vertices) that determine the orientation of the object. From a theoretical perspective, the full rank condition on (that is a necessary condition for the exact matching) is violated since the columns of are approximately linearly dependent. This does not happen for Triangle vs. Tangent Circles since the orientation of the circles is determined by curved edges. To further support our observation, we use Grad-CAM (Selvaraju et al., 2017) to visualize the contributing features that the pretext model used for rotation prediction. To highlight the major components, we only show pixels of the heatmap with top-20% intensity in Fig. 4.
Similarly to the stylized MNIST dataset, we add dot and dash patterns to the image background. The space in patterns is sampled from .
Appendix K Experiments on the MNIST Dataset
The sample sizes for the pretext task and testing is
and , respectively. We compare the performance of SSL and SL under different labeled sample sizes . The space in the sparse and dense patterns is sampled from and , respectively. We randomly shift each pattern by to avoid the position of the pattern being correlated with the image orientation. We use the same CNN configuration as the geometric shape task.