Search | arXiv e-print repository

doi 10.1016/j.physrep.2022.03.001

Quantifying Relevance in Learning and Inference

Abstract: Learning is a distinctive feature of intelligent behaviour. High-throughput experimental data and Big Data promise to open new windows on complex systems such as cells, the brain or our societies. Yet, the puzzling success of Artificial Intelligence and Machine Learning shows that we still have a poor conceptual understanding of learning. These applications push statistical inference into uncharte… ▽ More Learning is a distinctive feature of intelligent behaviour. High-throughput experimental data and Big Data promise to open new windows on complex systems such as cells, the brain or our societies. Yet, the puzzling success of Artificial Intelligence and Machine Learning shows that we still have a poor conceptual understanding of learning. These applications push statistical inference into uncharted territories where data is high-dimensional and scarce, and prior information on "true" models is scant if not totally absent. Here we review recent progress on understanding learning, based on the notion of "relevance". The relevance, as we define it here, quantifies the amount of information that a dataset or the internal representation of a learning machine contains on the generative model of the data. This allows us to define maximally informative samples, on one hand, and optimal learning machines on the other. These are ideal limits of samples and of machines, that contain the maximal amount of information about the unknown generative process, at a given resolution (or level of compression). Both ideal limits exhibit critical features in the statistical sense: Maximally informative samples are characterised by a power-law frequency distribution (statistical criticality) and optimal learning machines by an anomalously large susceptibility. The trade-off between resolution (i.e. compression) and relevance distinguishes the regime of noisy representations from that of lossy compression. These are separated by a special point characterised by Zipf's law statistics. This identifies samples obeying Zipf's law as the most compressed loss-less representations that are optimal in the sense of maximal relevance. Criticality in optimal learning machines manifests in an exponential degeneracy of energy levels, that leads to unusual thermodynamic properties. △ Less

Submitted 1 February, 2022; originally announced February 2022.

Comments: review article, 63 pages, 14 figures

arXiv:2103.15917 [pdf, other]

Restricted Boltzmann Machines as Models of Interacting Variables

Authors: Nicola Bulso, Yasser Roudi

Abstract: We study the type of distributions that Restricted Boltzmann Machines (RBMs) with different activation functions can express by investigating the effect of the activation function of the hidden nodes on the marginal distribution they impose on observed binary nodes. We report an exact expression for these marginals in the form of a model of interacting binary variables with the explicit form of th… ▽ More We study the type of distributions that Restricted Boltzmann Machines (RBMs) with different activation functions can express by investigating the effect of the activation function of the hidden nodes on the marginal distribution they impose on observed binary nodes. We report an exact expression for these marginals in the form of a model of interacting binary variables with the explicit form of the interactions depending on the hidden node activation function. We study the properties of these interactions in detail and evaluate how the accuracy with which the RBM approximates distributions over binary variables depends on the hidden node activation function and on the number of hidden nodes. When the inferred RBM parameters are weak, an intuitive pattern is found for the expression of the interaction terms which reduces substantially the differences across activation functions. We show that the weak parameter approximation is a good approximation for different RBMs trained on the MNIST dataset. Interestingly, in these cases, the map** reveals that the inferred models are essentially low order interaction models. △ Less

Submitted 29 March, 2021; originally announced March 2021.

Comments: Supplemental material is available as ancillary file and can be downloaded from a link on the right

arXiv:1903.00386 [pdf, other]

On the complexity of logistic regression models

Authors: Nicola Bulso, Matteo Marsili, Yasser Roudi

Abstract: We investigate the complexity of logistic regression models which is defined by counting the number of indistinguishable distributions that the model can represent (Balasubramanian, 1997). We find that the complexity of logistic models with binary inputs does not only depend on the number of parameters but also on the distribution of inputs in a non-trivial way which standard treatments of complex… ▽ More We investigate the complexity of logistic regression models which is defined by counting the number of indistinguishable distributions that the model can represent (Balasubramanian, 1997). We find that the complexity of logistic models with binary inputs does not only depend on the number of parameters but also on the distribution of inputs in a non-trivial way which standard treatments of complexity do not address. In particular, we observe that correlations among inputs induce effective dependencies among parameters thus constraining the model and, consequently, reducing its complexity. We derive simple relations for the upper and lower bounds of the complexity. Furthermore, we show analytically that, defining the model parameters on a finite support rather than the entire axis, decreases the complexity in a manner that critically depends on the size of the domain. Based on our findings, we propose a novel model selection criterion which takes into account the entropy of the input distribution. We test our proposal on the problem of selecting the input variables of a logistic regression model in a Bayesian Model Selection framework. In our numerical tests, we find that, while the reconstruction errors of standard model selection approaches (AIC, BIC, $\ell_1$ regularization) strongly depend on the sparsity of the ground truth, the reconstruction error of our method is always close to the minimum in all conditions of sparsity, data size and strength of input correlations. Finally, we observe that, when considering categorical instead of binary inputs, in a simple and mathematically tractable case, the contribution of the alphabet size to the complexity is very small compared to that of parameter space dimension. We further explore the issue by analysing the dataset of the "13 keys to the White House" which is a method for forecasting the outcomes of US presidential elections. △ Less

Submitted 1 March, 2019; originally announced March 2019.

Comments: 29 pages, 6 figures, The supplementary material is an ancillary file and can be downloaded from a link on the right

arXiv:1809.00652 [pdf, other]

doi 10.3390/e20100755

Minimum Description Length codes are critical

Authors: Ryan John Cubero, Matteo Marsili, Yasser Roudi

Abstract: In the Minimum Description Length (MDL) principle, learning from the data is equivalent to an optimal coding problem. We show that the codes that achieve optimal compression in MDL are critical in a very precise sense. First, when they are taken as generative models of samples, they generate samples with broad empirical distributions and with a high value of the relevance, defined as the entropy o… ▽ More In the Minimum Description Length (MDL) principle, learning from the data is equivalent to an optimal coding problem. We show that the codes that achieve optimal compression in MDL are critical in a very precise sense. First, when they are taken as generative models of samples, they generate samples with broad empirical distributions and with a high value of the relevance, defined as the entropy of the empirical frequencies. These results are derived for different statistical models (Dirichlet model, independent and pairwise dependent spin models, and restricted Boltzmann machines). Second, MDL codes sit precisely at a second order phase transition point where the symmetry between the sampled outcomes is spontaneously broken. The order parameter controlling the phase transition is the coding cost of the samples. The phase transition is a manifestation of the optimality of MDL codes, and it arises because codes that achieve a higher compression do not exist. These results suggest a clear interpretation of the widespread occurrence of statistical criticality as a characterization of samples which are maximally informative on the underlying generative process. △ Less

Submitted 2 October, 2018; v1 submitted 3 September, 2018; originally announced September 2018.

Comments: 23 pages, 5 figures; Corrected the author name, revised Section 2.2 (Large Deviations of the Universal Codes Exhibit Phase Transitions), corrected Eq. (89)

Journal ref: Entropy 2018, 20(10)

arXiv:1607.08379 [pdf, other]

doi 10.1088/1751-8113/49/43/434003

Variational perturbation and extended Plefka approaches to dynamics on random networks: the case of the kinetic Ising model

Authors: Ludovica Bachschmid-Romano, Claudia Battistin, Manfred Opper, Yasser Roudi

Abstract: We describe and analyze some novel approaches for studying the dynamics of Ising spin glass models. We first briefly consider the variational approach based on minimizing the Kullback-Leibler divergence between independent trajectories and the real ones and note that this approach only coincides with the mean field equations from the saddle point approximation to the generating functional when the… ▽ More We describe and analyze some novel approaches for studying the dynamics of Ising spin glass models. We first briefly consider the variational approach based on minimizing the Kullback-Leibler divergence between independent trajectories and the real ones and note that this approach only coincides with the mean field equations from the saddle point approximation to the generating functional when the dynamics is defined through a logistic link function, which is the case for the kinetic Ising model with parallel update. We then spend the rest of the paper develo** two ways of going beyond the saddle point approximation to the generating functional. In the first one, we develop a variational perturbative approximation to the generating functional by expanding the action around a quadratic function of the local fields and conjugate local fields whose parameters are optimized. We derive analytical expressions for the optimal parameters and show that when the optimization is suitably restricted, we recover the mean field equations that are exact for the fully asymmetric random couplings (Mézard and Sakellariou, 2011). However, without this restriction the results are different. We also describe an extended Plefka expansion in which in addition to the magnetization, we also fix the correlation and response functions. Finally, we numerically study the performance of these approximations for Sherrington-Kirkpatrick type couplings for various coupling strengths, degrees of coupling symmetry and external fields. We show that the dynamical equations derived from the extended Plefka expansion outperform the others in all regimes, although it is computationally more demanding. The unconstrained variational approach does not perform well in the small coupling regime, while it approaches dynamical TAP equations of (Roudi and Hertz, 2011) for strong couplings. △ Less

Submitted 28 July, 2016; originally announced July 2016.

arXiv:1603.00952 [pdf, other]

doi 10.1088/1742-5468/2016/09/093404

Sparse model selection in the highly under-sampled regime

Authors: Nicola Bulso, Matteo Marsili, Yasser Roudi

Abstract: We propose a method for recovering the structure of a sparse undirected graphical model when very few samples are available. The method decides about the presence or absence of bonds between pairs of variable by considering one pair at a time and using a closed form formula, analytically derived by calculating the posterior probability for every possible model explaining a two body system using Je… ▽ More We propose a method for recovering the structure of a sparse undirected graphical model when very few samples are available. The method decides about the presence or absence of bonds between pairs of variable by considering one pair at a time and using a closed form formula, analytically derived by calculating the posterior probability for every possible model explaining a two body system using Jeffreys prior. The approach does not rely on the optimisation of any cost functions and consequently is much faster than existing algorithms. Despite this time and computational advantage, numerical results show that for several sparse topologies the algorithm is comparable to the best existing algorithms, and is more accurate in the presence of hidden variables. We apply this approach to the analysis of US stock market data and to neural data, in order to show its efficiency in recovering robust statistical dependencies in real data with non stationary correlations in time and space. △ Less

Submitted 2 January, 2017; v1 submitted 2 March, 2016; originally announced March 2016.

Comments: 54 pages, 26 figures

Journal ref: J. Stat. Mech. (2016) 093404

arXiv:1506.00354 [pdf, other]

doi 10.1016/j.conb.2015.07.006

Learning with hidden variables

Authors: Yasser Roudi, Graham Taylor

Abstract: Learning and inferring features that generate sensory input is a task continuously performed by cortex. In recent years, novel algorithms and learning rules have been proposed that allow neural network models to learn such features from natural images, written text, audio signals, etc. These networks usually involve deep architectures with many layers of hidden neurons. Here we review recent advan… ▽ More Learning and inferring features that generate sensory input is a task continuously performed by cortex. In recent years, novel algorithms and learning rules have been proposed that allow neural network models to learn such features from natural images, written text, audio signals, etc. These networks usually involve deep architectures with many layers of hidden neurons. Here we review recent advancements in this area emphasizing, amongst other things, the processing of dynamical inputs by networks with hidden nodes and the role of single neuron models. These points and the questions they arise can provide conceptual advancements in understanding of learning in the cortex and the relationship between machine learning approaches to learning with hidden nodes and those in cortical circuits. △ Less

Submitted 24 July, 2015; v1 submitted 1 June, 2015; originally announced June 2015.

Comments: revised version accepted in Current Opinion in Neurobiology

Journal ref: Current Opinion in Neurobiology (2015), 35: 110-118

arXiv:1211.3671 [pdf, ps, other]

L$_1$ Regularization for Reconstruction of a non-equilibrium Ising Model

Authors: Hong-Li Zeng, John Hertz, Yasser Roudi

Abstract: The couplings in a sparse asymmetric, asynchronous Ising network are reconstructed using an exact learning algorithm. L$_1$ regularization is used to remove the spurious weak connections that would otherwise be found by simply minimizing the minus likelihood of a finite data set. In order to see how L$_1$ regularization works in detail, we perform the calculation in several ways including (1) by i… ▽ More The couplings in a sparse asymmetric, asynchronous Ising network are reconstructed using an exact learning algorithm. L$_1$ regularization is used to remove the spurious weak connections that would otherwise be found by simply minimizing the minus likelihood of a finite data set. In order to see how L$_1$ regularization works in detail, we perform the calculation in several ways including (1) by iterative minimization of a cost function equal to minus the log likelihood of the data plus an L$_1$ penalty term, and (2) an approximate scheme based on a quadratic expansion of the cost function around its minimum. In these schemes, we track how connections are pruned as the strength of the L$_1$ penalty is increased from zero to large values. The performance of the methods for various coupling strengths is quantified using ROC curves. △ Less

Submitted 15 November, 2012; originally announced November 2012.

Showing 1–8 of 8 results for author: Roudi, Y