-
Understanding Encoder-Decoder Structures in Machine Learning Using Information Measures
Authors:
Jorge F. Silva,
Victor Faraggi,
Camilo Ramirez,
Alvaro Egana,
Eduardo Pavez
Abstract:
We present new results to model and understand the role of encoder-decoder design in machine learning (ML) from an information-theoretic angle. We use two main information concepts, information sufficiency (IS) and mutual information loss (MIL), to represent predictive structures in machine learning. Our first main result provides a functional expression that characterizes the class of probabilist…
▽ More
We present new results to model and understand the role of encoder-decoder design in machine learning (ML) from an information-theoretic angle. We use two main information concepts, information sufficiency (IS) and mutual information loss (MIL), to represent predictive structures in machine learning. Our first main result provides a functional expression that characterizes the class of probabilistic models consistent with an IS encoder-decoder latent predictive structure. This result formally justifies the encoder-decoder forward stages many modern ML architectures adopt to learn latent (compressed) representations for classification. To illustrate IS as a realistic and relevant model assumption, we revisit some known ML concepts and present some interesting new examples: invariant, robust, sparse, and digital models. Furthermore, our IS characterization allows us to tackle the fundamental question of how much performance (predictive expressiveness) could be lost, using the cross entropy risk, when a given encoder-decoder architecture is adopted in a learning setting. Here, our second main result shows that a mutual information loss quantifies the lack of expressiveness attributed to the choice of a (biased) encoder-decoder ML design. Finally, we address the problem of universal cross-entropy learning with an encoder-decoder design where necessary and sufficiency conditions are established to meet this requirement. In all these results, Shannon's information measures offer new interpretations and explanations for representation learning.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Gaussian process deconvolution
Authors:
Felipe Tobar,
Arnaud Robert,
Jorge F. Silva
Abstract:
Let us consider the deconvolution problem, that is, to recover a latent source $x(\cdot)$ from the observations $\mathbf{y} = [y_1,\ldots,y_N]$ of a convolution process $y = x\star h + η$, where $η$ is an additive noise, the observations in $\mathbf{y}$ might have missing parts with respect to $y$, and the filter $h$ could be unknown. We propose a novel strategy to address this task when $x$ is a…
▽ More
Let us consider the deconvolution problem, that is, to recover a latent source $x(\cdot)$ from the observations $\mathbf{y} = [y_1,\ldots,y_N]$ of a convolution process $y = x\star h + η$, where $η$ is an additive noise, the observations in $\mathbf{y}$ might have missing parts with respect to $y$, and the filter $h$ could be unknown. We propose a novel strategy to address this task when $x$ is a continuous-time signal: we adopt a Gaussian process (GP) prior on the source $x$, which allows for closed-form Bayesian nonparametric deconvolution. We first analyse the direct model to establish the conditions under which the model is well defined. Then, we turn to the inverse problem, where we study i) some necessary conditions under which Bayesian deconvolution is feasible, and ii) to which extent the filter $h$ can be learnt from data or approximated for the blind deconvolution case. The proposed approach, termed Gaussian process deconvolution (GPDC) is compared to other deconvolution methods conceptually, via illustrative examples, and using real-world datasets.
△ Less
Submitted 8 May, 2023; v1 submitted 8 May, 2023;
originally announced May 2023.
-
Studying the Interplay between Information Loss and Operation Loss in Representations for Classification
Authors:
Jorge F. Silva,
Felipe Tobar,
Mario Vicuña,
Felipe Cordova
Abstract:
Information-theoretic measures have been widely adopted in the design of features for learning and decision problems. Inspired by this, we look at the relationship between i) a weak form of information loss in the Shannon sense and ii) the operation loss in the minimum probability of error (MPE) sense when considering a family of lossy continuous representations (features) of a continuous observat…
▽ More
Information-theoretic measures have been widely adopted in the design of features for learning and decision problems. Inspired by this, we look at the relationship between i) a weak form of information loss in the Shannon sense and ii) the operation loss in the minimum probability of error (MPE) sense when considering a family of lossy continuous representations (features) of a continuous observation. We present several results that shed light on this interplay. Our first result offers a lower bound on a weak form of information loss as a function of its respective operation loss when adopting a discrete lossy representation (quantization) instead of the original raw observation. From this, our main result shows that a specific form of vanishing information loss (a weak notion of asymptotic informational sufficiency) implies a vanishing MPE loss (or asymptotic operational sufficiency) when considering a general family of lossy continuous representations. Our theoretical findings support the observation that the selection of feature representations that attempt to capture informational sufficiency is appropriate for learning, but this selection is a rather conservative design principle if the intended goal is achieving MPE in classification. Supporting this last point, and under some structural conditions, we show that it is possible to adopt an alternative notion of informational sufficiency (strictly weaker than pure sufficiency in the mutual information sense) to achieve operational sufficiency in learning.
△ Less
Submitted 30 December, 2021;
originally announced December 2021.
-
Data-Driven Representations for Testing Independence: Modeling, Analysis and Connection with Mutual Information Estimation
Authors:
Mauricio E. Gonzalez,
Jorge F. Silva,
Miguel Videla,
Marcos E. Orchard
Abstract:
This work addresses testing the independence of two continuous and finite-dimensional random variables from the design of a data-driven partition. The empirical log-likelihood statistic is adopted to approximate the sufficient statistics of an oracle test against independence (that knows the two hypotheses). It is shown that approximating the sufficient statistics of the oracle test offers a learn…
▽ More
This work addresses testing the independence of two continuous and finite-dimensional random variables from the design of a data-driven partition. The empirical log-likelihood statistic is adopted to approximate the sufficient statistics of an oracle test against independence (that knows the two hypotheses). It is shown that approximating the sufficient statistics of the oracle test offers a learning criterion for designing a data-driven partition that connects with the problem of mutual information estimation. Applying these ideas in the context of a data-dependent tree-structured partition (TSP), we derive conditions on the TSP's parameters to achieve a strongly consistent distribution-free test of independence over the family of probabilities equipped with a density. Complementing this result, we present finite-length results that show our TSP scheme's capacity to detect the scenario of independence structurally with the data-driven partition as well as new sampling complexity bounds for this detection. Finally, some experimental analyses provide evidence regarding our scheme's advantage for testing independence compared with some strategies that do not use data-driven representations.
△ Less
Submitted 26 October, 2021;
originally announced October 2021.
-
Compressibility Analysis of Asymptotically Mean Stationary Processes
Authors:
Jorge F. Silva
Abstract:
This work provides new results for the analysis of random sequences in terms of $\ell_p$-compressibility. The results characterize the degree in which a random sequence can be approximated by its best $k$-sparse version under different rates of significant coefficients (compressibility analysis). In particular, the notion of strong $\ell_p$-characterization is introduced to denote a random sequenc…
▽ More
This work provides new results for the analysis of random sequences in terms of $\ell_p$-compressibility. The results characterize the degree in which a random sequence can be approximated by its best $k$-sparse version under different rates of significant coefficients (compressibility analysis). In particular, the notion of strong $\ell_p$-characterization is introduced to denote a random sequence that has a well-defined asymptotic limit (sample-wise) of its best $k$-term approximation error when a fixed rate of significant coefficients is considered (fixed-rate analysis). The main theorem of this work shows that the rich family of asymptotically mean stationary (AMS) processes has a strong $\ell_p$-characterization. Furthermore, we present results that characterize and analyze the $\ell_p$-approximation error function for this family of processes. Adding ergodicity in the analysis of AMS processes, we introduce a theorem demonstrating that the approximation error function is constant and determined in closed-form by the stationary mean of the process. Our results and analyses contribute to the theory and understanding of discrete-time sparse processes and, on the technical side, confirm how instrumental the point-wise ergodic theorem is to determine the compressibility expression of discrete-time processes even when stationarity and ergodicity assumptions are relaxed.
△ Less
Submitted 8 July, 2021;
originally announced July 2021.