-
An Alternate View on Optimal Filtering in an RKHS
Authors:
Benjamin Colburn,
Jose C. Principe,
Luis G. Sanchez Giraldo
Abstract:
Kernel Adaptive Filtering (KAF) are mathematically principled methods which search for a function in a Reproducing Kernel Hilbert Space. While they work well for tasks such as time series prediction and system identification they are plagued by a linear relationship between number of training samples and model size, hampering their use on the very large data sets common in today's data saturated w…
▽ More
Kernel Adaptive Filtering (KAF) are mathematically principled methods which search for a function in a Reproducing Kernel Hilbert Space. While they work well for tasks such as time series prediction and system identification they are plagued by a linear relationship between number of training samples and model size, hampering their use on the very large data sets common in today's data saturated world. Previous methods try to solve this issue by sparsification. We describe a novel view of optimal filtering which may provide a route towards solutions in a RKHS which do not necessarily have this linear growth in model size. We do this by defining a RKHS in which the time structure of a stochastic process is still present. Using correntropy [11], an extension of the idea of a covariance function, we create a time based functional which describes some potentially nonlinear desired map** function. This form of a solution may provide a fruitful line of research for creating more efficient representations of functionals in a RKHS, while theoretically providing computational complexity in the test set similar to Wiener solution.
△ Less
Submitted 19 December, 2023;
originally announced December 2023.
-
FroSSL: Frobenius Norm Minimization for Efficient Multiview Self-Supervised Learning
Authors:
Oscar Skean,
Aayush Dhakal,
Nathan Jacobs,
Luis Gonzalo Sanchez Giraldo
Abstract:
Self-supervised learning (SSL) is a popular paradigm for representation learning. Recent multiview methods can be classified as sample-contrastive, dimension-contrastive, or asymmetric network-based, with each family having its own approach to avoiding informational collapse. While these families converge to solutions of similar quality, it can be empirically shown that some methods are epoch-inef…
▽ More
Self-supervised learning (SSL) is a popular paradigm for representation learning. Recent multiview methods can be classified as sample-contrastive, dimension-contrastive, or asymmetric network-based, with each family having its own approach to avoiding informational collapse. While these families converge to solutions of similar quality, it can be empirically shown that some methods are epoch-inefficient and require longer training to reach a target performance. Two main approaches to improving efficiency are covariance eigenvalue regularization and using more views. However, these two approaches are difficult to combine due to the computational complexity of computing eigenvalues. We present the objective function FroSSL which reconciles both approaches while avoiding eigendecomposition entirely. FroSSL works by minimizing covariance Frobenius norms to avoid collapse and minimizing mean-squared error to achieve augmentation invariance. We show that FroSSL reaches competitive accuracies more quickly than any other SSL method and provide theoretical and empirical support that this faster convergence is due to how FroSSL affects the eigenvalues of the embedding covariance matrices. We also show that FroSSL learns competitive representations on linear probe evaluation when used to train a ResNet18 on several datasets, including STL-10, Tiny Imagenet, and Imagenet-100.
△ Less
Submitted 19 March, 2024; v1 submitted 4 October, 2023;
originally announced October 2023.
-
DiME: Maximizing Mutual Information by a Difference of Matrix-Based Entropies
Authors:
Oscar Skean,
Jhoan Keider Hoyos Osorio,
Austin J. Brockmeier,
Luis Gonzalo Sanchez Giraldo
Abstract:
We introduce an information-theoretic quantity with similar properties to mutual information that can be estimated from data without making explicit assumptions on the underlying distribution. This quantity is based on a recently proposed matrix-based entropy that uses the eigenvalues of a normalized Gram matrix to compute an estimate of the eigenvalues of an uncentered covariance operator in a re…
▽ More
We introduce an information-theoretic quantity with similar properties to mutual information that can be estimated from data without making explicit assumptions on the underlying distribution. This quantity is based on a recently proposed matrix-based entropy that uses the eigenvalues of a normalized Gram matrix to compute an estimate of the eigenvalues of an uncentered covariance operator in a reproducing kernel Hilbert space. We show that a difference of matrix-based entropies (DiME) is well suited for problems involving the maximization of mutual information between random variables. While many methods for such tasks can lead to trivial solutions, DiME naturally penalizes such outcomes. We compare DiME to several baseline estimators of mutual information on a toy Gaussian dataset. We provide examples of use cases for DiME, such as latent factor disentanglement and a multiview representation learning problem where DiME is used to learn a shared representation among views with high mutual information.
△ Less
Submitted 27 July, 2023; v1 submitted 19 January, 2023;
originally announced January 2023.
-
The Representation Jensen-Rényi Divergence
Authors:
Jhoan Keider Hoyos Osorio,
Oscar Skean,
Austin J. Brockmeier,
Luis Gonzalo Sanchez Giraldo
Abstract:
We introduce a divergence measure between data distributions based on operators in reproducing kernel Hilbert spaces defined by kernels. The empirical estimator of the divergence is computed using the eigenvalues of positive definite Gram matrices that are obtained by evaluating the kernel over pairs of data points. The new measure shares similar properties to Jensen-Shannon divergence. Convergenc…
▽ More
We introduce a divergence measure between data distributions based on operators in reproducing kernel Hilbert spaces defined by kernels. The empirical estimator of the divergence is computed using the eigenvalues of positive definite Gram matrices that are obtained by evaluating the kernel over pairs of data points. The new measure shares similar properties to Jensen-Shannon divergence. Convergence of the proposed estimators follows from concentration results based on the difference between the ordered spectrum of the Gram matrices and the integral operators associated with the population quantities. The proposed measure of divergence avoids the estimation of the probability distribution underlying the data. Numerical experiments involving comparing distributions and applications to sampling unbalanced data for classification show that the proposed divergence can achieve state of the art results.
△ Less
Submitted 1 June, 2022; v1 submitted 2 December, 2021;
originally announced December 2021.
-
A Stable Combinatorial Particle Swarm Optimization for Scalable Feature Selection in Gene Expression Data
Authors:
Hassen Dhrif,
Luis G. Sanchez Giraldo,
Miroslav Kubat,
Stefan Wuchty
Abstract:
Evolutionary computation (EC) algorithms, such as discrete and multi-objective versions of particle swarm optimization (PSO), have been applied to solve the Feature selection (FS) problem, tackling the combinatorial explosion of search spaces that are peppered with local minima. Furthermore, high-dimensional FS problems such as finding a small set of biomarkers to make a diagnostic call add an add…
▽ More
Evolutionary computation (EC) algorithms, such as discrete and multi-objective versions of particle swarm optimization (PSO), have been applied to solve the Feature selection (FS) problem, tackling the combinatorial explosion of search spaces that are peppered with local minima. Furthermore, high-dimensional FS problems such as finding a small set of biomarkers to make a diagnostic call add an additional challenge as such methods ability to pick out the most important features must remain unchanged in decision spaces of increasing dimensions and presence of irrelevant features. We developed a combinatorial PSO algorithm, called COMB-PSO, that scales up to high-dimensional gene expression data while still selecting the smallest subsets of genes that allow reliable classification of samples. In particular, COMB-PSO enhances the encoding, speed of convergence, control of divergence and diversity of the conventional PSO algorithm, balancing exploration and exploitation of the search space. Applying our approach on real gene expression data of different cancers, COMB-PSO finds gene sets of smallest size that allow a reliable classification of the underlying disease classes.
△ Less
Submitted 24 January, 2019;
originally announced January 2019.
-
Multivariate Extension of Matrix-based Renyi's α-order Entropy Functional
Authors:
Shujian Yu,
Luis Gonzalo Sanchez Giraldo,
Robert Jenssen,
Jose C. Principe
Abstract:
The matrix-based Renyi's α-order entropy functional was recently introduced using the normalized eigenspectrum of a Hermitian matrix of the projected data in a reproducing kernel Hilbert space (RKHS). However, the current theory in the matrix-based Renyi's α-order entropy functional only defines the entropy of a single variable or mutual information between two random variables. In information the…
▽ More
The matrix-based Renyi's α-order entropy functional was recently introduced using the normalized eigenspectrum of a Hermitian matrix of the projected data in a reproducing kernel Hilbert space (RKHS). However, the current theory in the matrix-based Renyi's α-order entropy functional only defines the entropy of a single variable or mutual information between two random variables. In information theory and machine learning communities, one is also frequently interested in multivariate information quantities, such as the multivariate joint entropy and different interactive quantities among multiple variables. In this paper, we first define the matrix-based Renyi's α-order joint entropy among multiple variables. We then show how this definition can ease the estimation of various information quantities that measure the interactions among multiple variables, such as interactive information and total correlation. We finally present an application to feature selection to show how our definition provides a simple yet powerful way to estimate a widely-acknowledged intractable quantity from data. A real example on hyperspectral image (HSI) band selection is also provided.
△ Less
Submitted 31 July, 2019; v1 submitted 23 August, 2018;
originally announced August 2018.
-
Correspondence of Deep Neural Networks and the Brain for Visual Textures
Authors:
Md Nasir Uddin Laskar,
Luis G Sanchez Giraldo,
Odelia Schwartz
Abstract:
Deep convolutional neural networks (CNNs) trained on objects and scenes have shown intriguing ability to predict some response properties of visual cortical neurons. However, the factors and computations that give rise to such ability, and the role of intermediate processing stages in explaining changes that develop across areas of the cortical hierarchy, are poorly understood. We focused on the s…
▽ More
Deep convolutional neural networks (CNNs) trained on objects and scenes have shown intriguing ability to predict some response properties of visual cortical neurons. However, the factors and computations that give rise to such ability, and the role of intermediate processing stages in explaining changes that develop across areas of the cortical hierarchy, are poorly understood. We focused on the sensitivity to textures as a paradigmatic example, since recent neurophysiology experiments provide rich data pointing to texture sensitivity in secondary but not primary visual cortex. We developed a quantitative approach for selecting a subset of the neural unit population from the CNN that best describes the brain neural recordings. We found that the first two layers of the CNN showed qualitative and quantitative correspondence to the cortical data across a number of metrics. This compatibility was reduced for the architecture alone rather than the learned weights, for some other related hierarchical models, and only mildly in the absence of a nonlinear computation akin to local divisive normalization. Our results show that the CNN class of model is effective for capturing changes that develop across early areas of cortex, and has the potential to facilitate understanding of the computations that give rise to hierarchical processing in the brain.
△ Less
Submitted 7 June, 2018;
originally announced June 2018.
-
Integrating Flexible Normalization into Mid-Level Representations of Deep Convolutional Neural Networks
Authors:
Luis Gonzalo Sanchez Giraldo,
Odelia Schwartz
Abstract:
Deep convolutional neural networks (CNNs) are becoming increasingly popular models to predict neural responses in visual cortex. However, contextual effects, which are prevalent in neural processing and in perception, are not explicitly handled by current CNNs, including those used for neural prediction. In primary visual cortex, neural responses are modulated by stimuli spatially surrounding the…
▽ More
Deep convolutional neural networks (CNNs) are becoming increasingly popular models to predict neural responses in visual cortex. However, contextual effects, which are prevalent in neural processing and in perception, are not explicitly handled by current CNNs, including those used for neural prediction. In primary visual cortex, neural responses are modulated by stimuli spatially surrounding the classical receptive field in rich ways. These effects have been modeled with divisive normalization approaches, including flexible models, where spatial normalization is recruited only to the degree responses from center and surround locations are deemed statistically dependent. We propose a flexible normalization model applied to mid-level representations of deep CNNs as a tractable way to study contextual normalization mechanisms in mid-level cortical areas. This approach captures non-trivial spatial dependencies among mid-level features in CNNs, such as those present in textures and other visual stimuli, that arise from tiling high order features, geometrically. We expect that the proposed approach can make predictions about when spatial normalization might be recruited in mid-level cortical areas. We also expect this approach to be useful as part of the CNN toolkit, therefore going beyond more restrictive fixed forms of normalization.
△ Less
Submitted 24 December, 2018; v1 submitted 5 June, 2018;
originally announced June 2018.
-
Rate-Distortion Auto-Encoders
Authors:
Luis G. Sanchez Giraldo,
Jose C. Principe
Abstract:
A rekindled the interest in auto-encoder algorithms has been spurred by recent work on deep learning. Current efforts have been directed towards effective training of auto-encoder architectures with a large number of coding units. Here, we propose a learning algorithm for auto-encoders based on a rate-distortion objective that minimizes the mutual information between the inputs and the outputs of…
▽ More
A rekindled the interest in auto-encoder algorithms has been spurred by recent work on deep learning. Current efforts have been directed towards effective training of auto-encoder architectures with a large number of coding units. Here, we propose a learning algorithm for auto-encoders based on a rate-distortion objective that minimizes the mutual information between the inputs and the outputs of the auto-encoder subject to a fidelity constraint. The goal is to learn a representation that is minimally committed to the input data, but that is rich enough to reconstruct the inputs up to certain level of distortion. Minimizing the mutual information acts as a regularization term whereas the fidelity constraint can be understood as a risk functional in the conventional statistical learning setting. The proposed algorithm uses a recently introduced measure of entropy based on infinitely divisible matrices that avoids the plug in estimation of densities. Experiments using over-complete bases show that the rate-distortion auto-encoders can learn a regularized input-output map** in an implicit manner.
△ Less
Submitted 16 April, 2014; v1 submitted 27 December, 2013;
originally announced December 2013.
-
Information Theoretic Learning with Infinitely Divisible Kernels
Authors:
Luis G. Sanchez Giraldo,
Jose C. Principe
Abstract:
In this paper, we develop a framework for information theoretic learning based on infinitely divisible matrices. We formulate an entropy-like functional on positive definite matrices based on Renyi's axiomatic definition of entropy and examine some key properties of this functional that lead to the concept of infinite divisibility. The proposed formulation avoids the plug in estimation of density…
▽ More
In this paper, we develop a framework for information theoretic learning based on infinitely divisible matrices. We formulate an entropy-like functional on positive definite matrices based on Renyi's axiomatic definition of entropy and examine some key properties of this functional that lead to the concept of infinite divisibility. The proposed formulation avoids the plug in estimation of density and brings along the representation power of reproducing kernel Hilbert spaces. As an application example, we derive a supervised metric learning algorithm using a matrix based analogue to conditional entropy achieving results comparable with the state of the art.
△ Less
Submitted 4 June, 2013; v1 submitted 15 January, 2013;
originally announced January 2013.
-
Measures of Entropy from Data Using Infinitely Divisible Kernels
Authors:
Luis G. Sanchez Giraldo,
Murali Rao,
Jose C. Principe
Abstract:
Information theory provides principled ways to analyze different inference and learning problems such as hypothesis testing, clustering, dimensionality reduction, classification, among others. However, the use of information theoretic quantities as test statistics, that is, as quantities obtained from empirical data, poses a challenging estimation problem that often leads to strong simplifications…
▽ More
Information theory provides principled ways to analyze different inference and learning problems such as hypothesis testing, clustering, dimensionality reduction, classification, among others. However, the use of information theoretic quantities as test statistics, that is, as quantities obtained from empirical data, poses a challenging estimation problem that often leads to strong simplifications such as Gaussian models, or the use of plug in density estimators that are restricted to certain representation of the data. In this paper, a framework to non-parametrically obtain measures of entropy directly from data using operators in reproducing kernel Hilbert spaces defined by infinitely divisible kernels is presented. The entropy functionals, which bear resemblance with quantum entropies, are defined on positive definite matrices and satisfy similar axioms to those of Renyi's definition of entropy. Convergence of the proposed estimators follows from concentration results on the difference between the ordered spectrum of the Gram matrices and the integral operators associated to the population quantities. In this way, capitalizing on both the axiomatic definition of entropy and on the representation power of positive definite kernels, the proposed measure of entropy avoids the estimation of the probability distribution underlying the data. Moreover, estimators of kernel-based conditional entropy and mutual information are also defined. Numerical experiments on independence tests compare favourably with state of the art.
△ Less
Submitted 1 September, 2014; v1 submitted 11 November, 2012;
originally announced November 2012.