-
Restoring balance: principled under/oversampling of data for optimal classification
Authors:
Emanuele Loffredo,
Mauro Pastore,
Simona Cocco,
Rémi Monasson
Abstract:
Class imbalance in real-world data poses a common bottleneck for machine learning tasks, since achieving good generalization on under-represented examples is often challenging. Mitigation strategies, such as under or oversampling the data depending on their abundances, are routinely proposed and tested empirically, but how they should adapt to the data statistics remains poorly understood. In this…
▽ More
Class imbalance in real-world data poses a common bottleneck for machine learning tasks, since achieving good generalization on under-represented examples is often challenging. Mitigation strategies, such as under or oversampling the data depending on their abundances, are routinely proposed and tested empirically, but how they should adapt to the data statistics remains poorly understood. In this work, we determine exact analytical expressions of the generalization curves in the high-dimensional regime for linear classifiers (Support Vector Machines). We also provide a sharp prediction of the effects of under/oversampling strategies depending on class imbalance, first and second moments of the data, and the metrics of performance considered. We show that mixed strategies involving under and oversampling of data lead to performance improvement. Through numerical experiments, we show the relevance of our theoretical predictions on real datasets, on deeper architectures and with sampling strategies based on unsupervised probabilistic models.
△ Less
Submitted 15 May, 2024;
originally announced May 2024.
-
Unlearning regularization for Boltzmann Machines
Authors:
Enrico Ventura,
Simona Cocco,
Rémi Monasson,
Francesco Zamponi
Abstract:
Boltzmann Machines (BMs) are graphical models with interconnected binary units, employed for the unsupervised modeling of data distributions. When trained on real data, BMs show the tendency to behave like critical systems, displaying a high susceptibility of the model under a small rescaling of the inferred parameters. This behaviour is not convenient for the purpose of generating data, because i…
▽ More
Boltzmann Machines (BMs) are graphical models with interconnected binary units, employed for the unsupervised modeling of data distributions. When trained on real data, BMs show the tendency to behave like critical systems, displaying a high susceptibility of the model under a small rescaling of the inferred parameters. This behaviour is not convenient for the purpose of generating data, because it slows down the sampling process, and induces the model to overfit the training-data. In this study, we introduce a regularization method for BMs to improve the robustness of the model under rescaling of the parameters. The new technique shares formal similarities with the unlearning algorithm, an iterative procedure used to improve memory associativity in Hopfield-like neural networks. We test our unlearning regularization on synthetic data generated by two simple models, the Curie-Weiss ferromagnetic model and the Sherrington-Kirkpatrick spin glass model. We show that it outperforms $L_p$-norm schemes and discuss the role of parameter initialization. Eventually, the method is applied to learn the activity of real neuronal cells, confirming its efficacy at shifting the inferred model away from criticality and coming out as a powerful candidate for actual scientific implementations.
△ Less
Submitted 15 May, 2024; v1 submitted 15 November, 2023;
originally announced November 2023.
-
Information content in continuous attractor neural networks is preserved in the presence of moderate disordered background connectivity
Authors:
Tobias Kühn,
Rémi Monasson
Abstract:
Continuous attractor neural networks (CANN) form an appealing conceptual model for the storage of information in the brain. However a drawback of CANN is that they require finely tuned interactions. We here study the effect of quenched noise in the interactions on the coding of positional information within CANN. Using the replica method we compute the Fisher information for a network with positi…
▽ More
Continuous attractor neural networks (CANN) form an appealing conceptual model for the storage of information in the brain. However a drawback of CANN is that they require finely tuned interactions. We here study the effect of quenched noise in the interactions on the coding of positional information within CANN. Using the replica method we compute the Fisher information for a network with position-dependent input and recurrent connections composed of a short-range (in space) and a disordered component. We find that the loss in positional information is small for not too large disorder strength, indicating that CANN have a regime in which the advantageous effects of local connectivity on information storage outweigh the detrimental ones. Furthermore, a substantial part of this information can be extracted with a simple linear readout.
△ Less
Submitted 3 January, 2024; v1 submitted 26 April, 2023;
originally announced April 2023.
-
Transition paths in Potts-like energy landscapes: general properties and application to protein sequence models
Authors:
Eugenio Mauri,
Simona Cocco,
Rémi Monasson
Abstract:
We study transition paths in energy landscapes over multi-categorical Potts configurations using the mean-field approach introduced by Mauri et al., {\em Phys Rev Lett 130, 158402 (2023)}. Paths interpolate between two fixed configurations or are anchored at one extremity only. We characterize the properties of `good' transition paths realizing a trade-off between exploring low-energy regions in t…
▽ More
We study transition paths in energy landscapes over multi-categorical Potts configurations using the mean-field approach introduced by Mauri et al., {\em Phys Rev Lett 130, 158402 (2023)}. Paths interpolate between two fixed configurations or are anchored at one extremity only. We characterize the properties of `good' transition paths realizing a trade-off between exploring low-energy regions in the landscape and being not too long, such as their entropy or the probability of escape from a region of the landscape. We unveil the existence of a phase transition separating a regime in which paths are stretched in between their anchors, from another regime, where paths can explore the energy landscape more globally to minimize the energy. This phase transition is first illustrated and studied in detail on a mathematically tractable Hopfield-Potts toy model, then studied in energy landscapes inferred from protein-sequence data.
△ Less
Submitted 26 June, 2023; v1 submitted 6 April, 2023;
originally announced April 2023.
-
Evolutionary Dynamics of a Lattice Dimer: a Toy Model for Stability vs. Affinity Trade-offs in Proteins
Authors:
Emanuele Loffredo,
Elisabetta Vesconi,
Rostam Razban,
Orit Peleg,
Eugene Shakhnovich,
Simona Cocco,
Rémi Monasson
Abstract:
Understanding how a stressor applied on a biological system shapes its evolution is key to achieving targeted evolutionary control. Here we present a toy model of two interacting lattice proteins to quantify the response to the selective pressure defined by the binding energy. We generate sequence data of proteins and study how the sequence and structural properties of dimers are affected by the a…
▽ More
Understanding how a stressor applied on a biological system shapes its evolution is key to achieving targeted evolutionary control. Here we present a toy model of two interacting lattice proteins to quantify the response to the selective pressure defined by the binding energy. We generate sequence data of proteins and study how the sequence and structural properties of dimers are affected by the applied selective pressure, both during the evolutionary process and in the stationary regime. In particular we show that internal contacts of native structures lose strength, while inter-structure contacts are strengthened due to the folding-binding competition. We discuss how dimerization is achieved through enhanced mutability on the interacting faces, and how the designability of each native structure changes upon introduction of the stressor.
△ Less
Submitted 5 December, 2023; v1 submitted 22 March, 2023;
originally announced March 2023.
-
Computational protein design with evolutionary-based and physics-inspired modeling: current and future synergies
Authors:
Cyril Malbranke,
David Bikard,
Simona Cocco,
Rémi Monasson,
Jérôme Tubiana
Abstract:
Computational protein design facilitates discovery of novel proteins with prescribed structure and functionality. Exciting designs were recently reported using novel data-driven methodologies that can be roughly divided into two categories: evolutionary-based and physics-inspired approaches. The former infer characteristic sequence features shared by sets of evolutionary-related proteins, such as…
▽ More
Computational protein design facilitates discovery of novel proteins with prescribed structure and functionality. Exciting designs were recently reported using novel data-driven methodologies that can be roughly divided into two categories: evolutionary-based and physics-inspired approaches. The former infer characteristic sequence features shared by sets of evolutionary-related proteins, such as conserved or coevolving positions, and recombine them to generate candidates with similar structure and function. The latter estimate key biochemical properties such as structure free energy, conformational entropy or binding affinities using machine learning surrogates, and optimize them to yield improved designs. Here, we review recent progress along both tracks, discuss their strengths and weaknesses, and highlight opportunities for synergistic approaches.
△ Less
Submitted 7 February, 2023; v1 submitted 29 August, 2022;
originally announced August 2022.
-
Disentangling representations in Restricted Boltzmann Machines without adversaries
Authors:
Jorge Fernandez-de-Cossio-Diaz,
Simona Cocco,
Remi Monasson
Abstract:
A goal of unsupervised machine learning is to build representations of complex high-dimensional data, with simple relations to their properties. Such disentangled representations make easier to interpret the significant latent factors of variation in the data, as well as to generate new data with desirable features. Methods for disentangling representations often rely on an adversarial scheme, in…
▽ More
A goal of unsupervised machine learning is to build representations of complex high-dimensional data, with simple relations to their properties. Such disentangled representations make easier to interpret the significant latent factors of variation in the data, as well as to generate new data with desirable features. Methods for disentangling representations often rely on an adversarial scheme, in which representations are tuned to avoid discriminators from being able to reconstruct information about the data properties (labels). Unfortunately adversarial training is generally difficult to implement in practice. Here we propose a simple, effective way of disentangling representations without any need to train adversarial discriminators, and apply our approach to Restricted Boltzmann Machines (RBM), one of the simplest representation-based generative models. Our approach relies on the introduction of adequate constraints on the weights during training, which allows us to concentrate information about labels on a small subset of latent variables. The effectiveness of the approach is illustrated with four examples: the CelebA dataset of facial images, the two-dimensional Ising model, the MNIST dataset of handwritten digits, and the taxonomy of protein families. In addition, we show how our framework allows for analytically computing the cost, in terms of log-likelihood of the data, associated to the disentanglement of their representations.
△ Less
Submitted 8 March, 2023; v1 submitted 23 June, 2022;
originally announced June 2022.
-
Mutational paths with sequence-based models of proteins: from sampling to mean-field characterisation
Authors:
Eugenio Mauri,
Simona Cocco,
Rémi Monasson
Abstract:
Identifying and characterizing mutational paths is an important issue in evolutionary biology and in bioengineering. We here introduce a generic description of mutational paths in terms of the goodness of sequences and of the mutational dynamics (how sequences change) along the path. We first propose an algorithm to sample mutational paths, which we benchmark on exactly solvable models of proteins…
▽ More
Identifying and characterizing mutational paths is an important issue in evolutionary biology and in bioengineering. We here introduce a generic description of mutational paths in terms of the goodness of sequences and of the mutational dynamics (how sequences change) along the path. We first propose an algorithm to sample mutational paths, which we benchmark on exactly solvable models of proteins in silico, and apply to data-driven models of natural proteins learned from sequence data with Restricted Boltzmann Machines. We then use mean-field theory to characterize the properties of mutational paths for different mutational dynamics of interest, and show how it can be used to extend Kimura's estimate of evolutionary distances to sequence-based epistatic models of selection.
△ Less
Submitted 27 March, 2023; v1 submitted 22 April, 2022;
originally announced April 2022.
-
Optimal regularizations for data generation with probabilistic graphical models
Authors:
Arnaud Fanthomme,
F Rizzato,
S Cocco,
R Monasson
Abstract:
Understanding the role of regularization is a central question in Statistical Inference. Empirically, well-chosen regularization schemes often dramatically improve the quality of the inferred models by avoiding overfitting of the training data. We consider here the particular case of L 2 and L 1 regularizations in the Maximum A Posteriori (MAP) inference of generative pairwise graphical models. Ba…
▽ More
Understanding the role of regularization is a central question in Statistical Inference. Empirically, well-chosen regularization schemes often dramatically improve the quality of the inferred models by avoiding overfitting of the training data. We consider here the particular case of L 2 and L 1 regularizations in the Maximum A Posteriori (MAP) inference of generative pairwise graphical models. Based on analytical calculations on Gaussian multivariate distributions and numerical experiments on Gaussian and Potts models we study the likelihoods of the training, test, and 'generated data' (with the inferred models) sets as functions of the regularization strengths. We show in particular that, at its maximum, the test likelihood and the 'generated' likelihood, which quantifies the quality of the generated samples, have remarkably close values. The optimal value for the regularization strength is found to be approximately equal to the inverse sum of the squared couplings incoming on sites on the underlying network of interactions. Our results seem largely independent of the structure of the true underlying interactions that generated the data, of the regularization scheme considered, and are valid when small fluctuations of the posterior distribution around the MAP estimator are taken into account. Connections with empirical works on protein models learned from homologous sequences are discussed.
△ Less
Submitted 2 December, 2021;
originally announced December 2021.
-
Barriers and Dynamical Paths in Alternating Gibbs Sampling of Restricted Boltzmann Machines
Authors:
Clément Roussel,
Simona Cocco,
Rémi Monasson
Abstract:
Restricted Boltzmann Machines (RBM) are bi-layer neural networks used for the unsupervised learning of model distributions from data. The bipartite architecture of RBM naturally defines an elegant sampling procedure, called Alternating Gibbs Sampling (AGS), where the configurations of the latent-variable layer are sampled conditional to the data-variable layer, and vice versa. We study here the pe…
▽ More
Restricted Boltzmann Machines (RBM) are bi-layer neural networks used for the unsupervised learning of model distributions from data. The bipartite architecture of RBM naturally defines an elegant sampling procedure, called Alternating Gibbs Sampling (AGS), where the configurations of the latent-variable layer are sampled conditional to the data-variable layer, and vice versa. We study here the performance of AGS on several analytically tractable models borrowed from statistical mechanics. We show that standard AGS is not more efficient than classical Metropolis-Hastings (MH) sampling of the effective energy landscape defined on the data layer. However, RBM can identify meaningful representations of training data in their latent space. Furthermore, using these representations and combining Gibbs sampling with the MH algorithm in the latent space can enhance the sampling performance of the RBM when the hidden units encode weakly dependent features of the data. We illustrate our findings on three datasets: Bars and Stripes and MNIST, well known in machine learning, and the so-called Lattice Proteins, introduced in theoretical biology to study the sequence-to-structure map** in proteins.
△ Less
Submitted 21 October, 2021; v1 submitted 13 July, 2021;
originally announced July 2021.
-
Low-Dimensional Manifolds Support Multiplexed Integrations in Recurrent Neural Networks
Authors:
Arnaud Fanthomme,
Rémi Monasson
Abstract:
We study the learning dynamics and the representations emerging in Recurrent Neural Networks trained to integrate one or multiple temporal signals. Combining analytical and numerical investigations, we characterize the conditions under which a RNN with n neurons learns to integrate D(n) scalar signals of arbitrary duration. We show, both for linear and ReLU neurons, that its internal state lives c…
▽ More
We study the learning dynamics and the representations emerging in Recurrent Neural Networks trained to integrate one or multiple temporal signals. Combining analytical and numerical investigations, we characterize the conditions under which a RNN with n neurons learns to integrate D(n) scalar signals of arbitrary duration. We show, both for linear and ReLU neurons, that its internal state lives close to a D-dimensional manifold, whose shape is related to the activation function. Each neuron therefore carries, to various degrees, information about the value of all integrals. We discuss the deep analogy between our results and the concept of mixed selectivity forged by computational neuroscientists to interpret cortical recordings.
△ Less
Submitted 20 November, 2020;
originally announced November 2020.
-
Survival probability and size of lineages in antibody affinity maturation
Authors:
Marco Molari,
Rémi Monasson,
Simona Cocco
Abstract:
Affinity Maturation (AM) is the process through which the immune system is able to develop potent antibodies against new pathogens it encounters, and is at the base of the efficacy of vaccines. At its core AM is analogous to a Darwinian evolutionary process, where B-cells mutate and are selected on the base of their affinity for an Antigen (Ag), and Ag availability tunes the selective pressure. In…
▽ More
Affinity Maturation (AM) is the process through which the immune system is able to develop potent antibodies against new pathogens it encounters, and is at the base of the efficacy of vaccines. At its core AM is analogous to a Darwinian evolutionary process, where B-cells mutate and are selected on the base of their affinity for an Antigen (Ag), and Ag availability tunes the selective pressure. In cases when this selective pressure is high the number of B-cells might quickly decrease and the population might risk extinction in what is known as a population bottleneck. Here we study the probability for a B-cell lineage to survive this bottleneck scenario as a function of the progenitor affinity for the Ag. Using recursive relations and probability generating functions we derive expressions for the average extinction time and progeny size for lineages that go extinct. We then extend our results to the full population, both in the absence and presence of competition for T-cell help, and quantify the population survival probability as a function of Ag concentration and initial population size. Our study suggests the population bottleneck phenomenology might represent a limit case in the space of biologically plausible maturation scenarios, whose characterization could help guide the process of vaccine development.
△ Less
Submitted 10 May, 2021; v1 submitted 22 October, 2020;
originally announced October 2020.
-
Gaussian Closure Scheme in the Quasi-Linkage Equilibrium Regime of Evolving Genome Populations
Authors:
Eugenio Mauri,
Simona Cocco,
Rémi Monasson
Abstract:
Describing the evolution of a population of genomes evolving in a complex fitness landscape is generally very hard. We here introduce an approximate Gaussian closure scheme to characterize analytically the statistics of a genomic population in the so-called Quasi--Linkage Equilibrium (QLE) regime, applicable to generic values of the rates of mutation or recombination and fitness functions. The G…
▽ More
Describing the evolution of a population of genomes evolving in a complex fitness landscape is generally very hard. We here introduce an approximate Gaussian closure scheme to characterize analytically the statistics of a genomic population in the so-called Quasi--Linkage Equilibrium (QLE) regime, applicable to generic values of the rates of mutation or recombination and fitness functions. The Gaussian approximation is illustrated on a short-range fitness landscape with two far away and competing maxima. It unveils the existence of a phase transition from a broad to a polarized distribution of genomes as the strength of epistatic couplings is increased, characterized by slow coarsening dynamics of competing allele domains. Results of the closure scheme are corroborated by numerical simulations.
△ Less
Submitted 13 October, 2020;
originally announced October 2020.
-
Inferring epistasis from genomic data with comparable mutation and outcrossing rate
Authors:
Hong-Li Zeng,
Eugenio Mauri,
Vito Dichio,
Simona Cocco,
Remi Monasson,
Erik Aurell
Abstract:
We consider a population evolving due to mutation, selection and recombination, where selection includes single-locus terms (additive fitness) and two-loci terms (pairwise epistatic fitness). We further consider the problem of inferring fitness in the evolutionary dynamics from one or several snap-shots of the distribution of genotypes in the population. In the recent literature this has been done…
▽ More
We consider a population evolving due to mutation, selection and recombination, where selection includes single-locus terms (additive fitness) and two-loci terms (pairwise epistatic fitness). We further consider the problem of inferring fitness in the evolutionary dynamics from one or several snap-shots of the distribution of genotypes in the population. In the recent literature this has been done by applying the Quasi-Linkage Equilibrium (QLE) regime first obtained by Kimura in the limit of high recombination. Here we show that the approach also works in the interesting regime where the effects of mutations are comparable to or larger than recombination. This leads to a modified main epistatic fitness inference formula where the rates of mutation and recombination occur together. We also derive this formula using by a previously developed Gaussian closure that formally remains valid when recombination is absent. The findings are validated through numerical simulations.
△ Less
Submitted 4 May, 2021; v1 submitted 30 June, 2020;
originally announced June 2020.
-
On the Spectrum of Multi-Space Euclidean Random Matrices
Authors:
Aldo Battista,
Remi Monasson
Abstract:
We consider the additive superimposition of an extensive number of independent Euclidean Random Matrices in the high-density regime. The resolvent is computed with techniques from free probability theory, as well as with the replica method of statistical physics of disordered systems. Results for the spectrum and eigenmodes are shown for a few applications relevant to computational neuroscience, a…
▽ More
We consider the additive superimposition of an extensive number of independent Euclidean Random Matrices in the high-density regime. The resolvent is computed with techniques from free probability theory, as well as with the replica method of statistical physics of disordered systems. Results for the spectrum and eigenmodes are shown for a few applications relevant to computational neuroscience, and are corroborated by numerical simulations.
△ Less
Submitted 15 May, 2020; v1 submitted 17 March, 2020;
originally announced March 2020.
-
'Place-cell' emergence and learning of invariant data with restricted Boltzmann machines: breaking and dynamical restoration of continuous symmetries in the weight space
Authors:
Moshir Harsh,
Jérôme Tubiana,
Simona Cocco,
Remi Monasson
Abstract:
Distributions of data or sensory stimuli often enjoy underlying invariances. How and to what extent those symmetries are captured by unsupervised learning methods is a relevant question in machine learning and in computational neuroscience. We study here, through a combination of numerical and analytical tools, the learning dynamics of Restricted Boltzmann Machines (RBM), a neural network paradigm…
▽ More
Distributions of data or sensory stimuli often enjoy underlying invariances. How and to what extent those symmetries are captured by unsupervised learning methods is a relevant question in machine learning and in computational neuroscience. We study here, through a combination of numerical and analytical tools, the learning dynamics of Restricted Boltzmann Machines (RBM), a neural network paradigm for representation learning. As learning proceeds from a random configuration of the network weights, we show the existence of, and characterize a symmetry-breaking phenomenon, in which the latent variables acquire receptive fields focusing on limited parts of the invariant manifold supporting the data. The symmetry is restored at large learning times through the diffusion of the receptive field over the invariant manifold; hence, the RBM effectively spans a continuous attractor in the space of network weights. This symmetry-breaking phenomenon takes place only if the amount of data available for training exceeds some critical value, depending on the network size and the intensity of symmetry-induced correlations in the data; below this 'retarded-learning' threshold, the network weights are essentially noisy and overfit the data.
△ Less
Submitted 30 December, 2019;
originally announced December 2019.
-
Capacity-resolution trade-off in the optimal learning of multiple low-dimensional manifolds by attractor neural networks
Authors:
Aldo Battista,
Rémi Monasson
Abstract:
Recurrent neural networks (RNN) are powerful tools to explain how attractors may emerge from noisy, high-dimensional dynamics. We study here how to learn the ~N^(2) pairwise interactions in a RNN with N neurons to embed L manifolds of dimension D << N. We show that the capacity, i.e. the maximal ratio L/N, decreases as |log(epsilon)|^(-D), where epsilon is the error on the position encoded by t…
▽ More
Recurrent neural networks (RNN) are powerful tools to explain how attractors may emerge from noisy, high-dimensional dynamics. We study here how to learn the ~N^(2) pairwise interactions in a RNN with N neurons to embed L manifolds of dimension D << N. We show that the capacity, i.e. the maximal ratio L/N, decreases as |log(epsilon)|^(-D), where epsilon is the error on the position encoded by the neural activity along each manifold. Hence, RNN are flexible memory devices capable of storing a large number of manifolds at high spatial resolution. Our results rely on a combination of analytical tools from statistical mechanics and random matrix theory, extending Gardner's classical theory of learning to the case of patterns with strong spatial correlations.
△ Less
Submitted 13 January, 2020; v1 submitted 14 October, 2019;
originally announced October 2019.
-
Inference of compressed Potts graphical models
Authors:
Francesca Rizzato,
Alice Coucke,
Eleonora de Leonardis,
J. P. Barton,
Jérôme Tubiana,
Remi Monasson,
Simona Cocco
Abstract:
We consider the problem of inferring a graphical Potts model on a population of variables, with a non-uniform number of Potts colors (symbols) across variables. This inverse Potts problem generally involves the inference of a large number of parameters, often larger than the number of available data, and, hence, requires the introduction of regularization. We study here a double regularization sch…
▽ More
We consider the problem of inferring a graphical Potts model on a population of variables, with a non-uniform number of Potts colors (symbols) across variables. This inverse Potts problem generally involves the inference of a large number of parameters, often larger than the number of available data, and, hence, requires the introduction of regularization. We study here a double regularization scheme, in which the number of colors available to each variable is reduced, and interaction networks are made sparse. To achieve this color compression scheme, only Potts states with large empirical frequency (exceeding some threshold) are explicitly modeled on each site, while the others are grouped into a single state. We benchmark the performances of this mixed regularization approach, with two inference algorithms, the Adaptive Cluster Expansion (ACE) and the PseudoLikelihood Maximization (PLM) on synthetic data obtained by sampling disordered Potts models on an Erdos-Renyi random graphs. We show in particular that color compression does not affect the quality of reconstruction of the parameters corresponding to high-frequency symbols, while drastically reducing the number of the other parameters and thus the computational time. Our procedure is also applied to multi-sequence alignments of protein families, with similar results.
△ Less
Submitted 3 January, 2020; v1 submitted 30 July, 2019;
originally announced July 2019.
-
Learning Compositional Representations of Interacting Systems with Restricted Boltzmann Machines: Comparative Study of Lattice Proteins
Authors:
Jérôme Tubiana,
Simona Cocco,
Rémi Monasson
Abstract:
A Restricted Boltzmann Machine (RBM) is an unsupervised machine-learning bipartite graphical model that jointly learns a probability distribution over data and extracts their relevant statistical features. As such, RBM were recently proposed for characterizing the patterns of coevolution between amino acids in protein sequences and for designing new sequences. Here, we study how the nature of the…
▽ More
A Restricted Boltzmann Machine (RBM) is an unsupervised machine-learning bipartite graphical model that jointly learns a probability distribution over data and extracts their relevant statistical features. As such, RBM were recently proposed for characterizing the patterns of coevolution between amino acids in protein sequences and for designing new sequences. Here, we study how the nature of the features learned by RBM changes with its defining parameters, such as the dimensionality of the representations (size of the hidden layer) and the sparsity of the features. We show that for adequate values of these parameters, RBM operate in a so-called compositional phase in which visible configurations sampled from the RBM are obtained by recombining these features. We then compare the performance of RBM with other standard representation learning algorithms, including Principal or Independent Component Analysis, autoencoders (AE), variational auto-encoders (VAE), and their sparse variants. We show that RBM, due to the stochastic map** between data configurations and representations, better capture the underlying interactions in the system and are significantly more robust with respect to sample size than deterministic methods such as PCA or ICA. In addition, this stochastic map** is not prescribed a priori as in VAE, but learned from data, which allows RBM to show good performance even with shallow architectures. All numerical results are illustrated on synthetic lattice-protein data, that share similar statistical features with real protein sequences, and for which ground-truth interactions are known.
△ Less
Submitted 18 February, 2019;
originally announced February 2019.
-
Learning protein constitutive motifs from sequence data
Authors:
Jérôme Tubiana,
Simona Cocco,
Rémi Monasson
Abstract:
Statistical analysis of evolutionary-related protein sequences provides insights about their structure, function, and history. We show that Restricted Boltzmann Machines (RBM), designed to learn complex high-dimensional data and their statistical features, can efficiently model protein families from sequence information. We here apply RBM to twenty protein families, and present detailed results fo…
▽ More
Statistical analysis of evolutionary-related protein sequences provides insights about their structure, function, and history. We show that Restricted Boltzmann Machines (RBM), designed to learn complex high-dimensional data and their statistical features, can efficiently model protein families from sequence information. We here apply RBM to twenty protein families, and present detailed results for two short protein domains, Kunitz and WW, one long chaperone protein, Hsp70, and synthetic lattice proteins for benchmarking. The features inferred by the RBM are biologically interpretable: they are related to structure (such as residue-residue tertiary contacts, extended secondary motifs ($α$-helix and $β$-sheet) and intrinsically disordered regions), to function (such as activity and ligand specificity), or to phylogenetic identity. In addition, we use RBM to design new protein sequences with putative properties by composing and turning up or down the different modes at will. Our work therefore shows that RBM are a versatile and practical tool to unveil and exploit the genotype-phenotype relationship for protein families.
△ Less
Submitted 27 February, 2019; v1 submitted 23 March, 2018;
originally announced March 2018.
-
Adaptation of olfactory receptor abundances for efficient coding
Authors:
Tiberiu Tesileanu,
Simona Cocco,
Remi Monasson,
Vijay Balasubramanian
Abstract:
Olfactory receptor usage is highly heterogeneous, with some receptor types being orders of magnitude more abundant than others. We propose an explanation for this striking fact: the receptor distribution is tuned to maximally represent information about the olfactory environment in a regime of efficient coding that is sensitive to the global context of correlated sensor responses. This model predi…
▽ More
Olfactory receptor usage is highly heterogeneous, with some receptor types being orders of magnitude more abundant than others. We propose an explanation for this striking fact: the receptor distribution is tuned to maximally represent information about the olfactory environment in a regime of efficient coding that is sensitive to the global context of correlated sensor responses. This model predicts that in mammals, where olfactory sensory neurons are replaced regularly, receptor abundances should continuously adapt to odor statistics. Experimentally, increased exposure to odorants leads variously, but reproducibly, to increased, decreased, or unchanged abundances of different activated receptors. We demonstrate that this diversity of effects is required for efficient coding when sensors are broadly correlated, and provide an algorithm for predicting which olfactory receptors should increase or decrease in abundance following specific environmental changes. Finally, we give simple dynamical rules for neural birth and death processes that might underlie this adaptation.
△ Less
Submitted 22 January, 2019; v1 submitted 28 January, 2018;
originally announced January 2018.
-
Statistical Physics and Representations in Real and Artificial Neural Networks
Authors:
Simona Cocco,
Rémi Monasson,
Lorenzo Posani,
Sophie Rosay,
Jérôme Tubiana
Abstract:
This document presents the material of two lectures on statistical physics and neural representations, delivered by one of us (R.M.) at the Fundamental Problems in Statistical Physics XIV summer school in July 2017. In a first part, we consider the neural representations of space (maps) in the hippocampus. We introduce an extension of the Hopfield model, able to store multiple spatial maps as cont…
▽ More
This document presents the material of two lectures on statistical physics and neural representations, delivered by one of us (R.M.) at the Fundamental Problems in Statistical Physics XIV summer school in July 2017. In a first part, we consider the neural representations of space (maps) in the hippocampus. We introduce an extension of the Hopfield model, able to store multiple spatial maps as continuous, finite-dimensional attractors. The phase diagram and dynamical properties of the model are analyzed. We then show how spatial representations can be dynamically decoded using an effective Ising model capturing the correlation structure in the neural data, and compare applications to data obtained from hippocampal multi-electrode recordings and by (sub)sampling our attractor model. In a second part, we focus on the problem of learning data representations in machine learning, in particular with artificial neural networks. We start by introducing data representations through some illustrations. We then analyze two important algorithms, Principal Component Analysis and Restricted Boltzmann Machines, with tools from statistical physics.
△ Less
Submitted 7 September, 2017;
originally announced September 2017.
-
Innovation rather than improvement: a solvable high-dimensional model highlights the limitations of scalar fitness
Authors:
Mikhail Tikhonov,
Remi Monasson
Abstract:
Much of our understanding of ecological and evolutionary mechanisms derives from analysis of low-dimensional models: with few interacting species, or few axes defining "fitness". It is not always clear to what extent the intuition derived from low-dimensional models applies to the complex, high-dimensional reality. For instance, most naturally occurring microbial communities are strikingly diverse…
▽ More
Much of our understanding of ecological and evolutionary mechanisms derives from analysis of low-dimensional models: with few interacting species, or few axes defining "fitness". It is not always clear to what extent the intuition derived from low-dimensional models applies to the complex, high-dimensional reality. For instance, most naturally occurring microbial communities are strikingly diverse, harboring a large number of coexisting species, each of which contributes to sha** the environment of others. Understanding the eco-evolutionary interplay in these systems is an important challenge, and an exciting new domain for statistical physics. Recent work identified a promising new platform for investigating highly diverse ecosystems, based on the classic resource competition model of MacArthur. Here, we describe how the same analytical framework can be used to study evolutionary questions. Our analysis illustrates how, at high dimension, the intuition promoted by a one-dimensional (scalar) notion of fitness can become misleading. Specifically, while the low-dimensional picture emphasizes organism cost or efficiency, we exhibit a regime where cost becomes irrelevant for survival, and link this observation to generic properties of high-dimensional geometry.
△ Less
Submitted 11 December, 2017; v1 submitted 17 August, 2017;
originally announced August 2017.
-
Inverse Statistical Physics of Protein Sequences: A Key Issues Review
Authors:
Simona Cocco,
Christoph Feinauer,
Matteo Figliuzzi,
Remi Monasson,
Martin Weigt
Abstract:
In the course of evolution, proteins undergo important changes in their amino acid sequences, while their three-dimensional folded structure and their biological function remain remarkably conserved. Thanks to modern sequencing techniques, sequence data accumulate at unprecedented pace. This provides large sets of so-called homologous, i.e.~evolutionarily related protein sequences, to which method…
▽ More
In the course of evolution, proteins undergo important changes in their amino acid sequences, while their three-dimensional folded structure and their biological function remain remarkably conserved. Thanks to modern sequencing techniques, sequence data accumulate at unprecedented pace. This provides large sets of so-called homologous, i.e.~evolutionarily related protein sequences, to which methods of inverse statistical physics can be applied. Using sequence data as the basis for the inference of Boltzmann distributions from samples of microscopic configurations or observables, it is possible to extract information about evolutionary constraints and thus protein function and structure. Here we give an overview over some biologically important questions, and how statistical-mechanics inspired modeling approaches can help to answer them. Finally, we discuss some open questions, which we expect to be addressed over the next years.
△ Less
Submitted 3 March, 2017;
originally announced March 2017.
-
Inference of principal components of noisy correlation matrices with prior information
Authors:
Rémi Monasson
Abstract:
The problem of infering the top component of a noisy sample covariance matrix with prior information about the distribution of its entries is considered, in the framework of the spiked covariance model. Using the replica method of statistical physics the computation of the overlap between the top components of the sample and population covariance matrices is formulated as an explicit optimization…
▽ More
The problem of infering the top component of a noisy sample covariance matrix with prior information about the distribution of its entries is considered, in the framework of the spiked covariance model. Using the replica method of statistical physics the computation of the overlap between the top components of the sample and population covariance matrices is formulated as an explicit optimization problem for any kind of entry-wise prior information. The approach is illustrated on the case of top components including large entries, and the corresponding phase diagram is shown. The calculation predicts that the maximal sampling noise level at which the recovery of the top population component remains possible is higher than its counterpart in the spiked covariance model with no prior information.
△ Less
Submitted 19 December, 2016;
originally announced December 2016.
-
Emergence of Compositional Representations in Restricted Boltzmann Machines
Authors:
Jérôme Tubiana,
Rémi Monasson
Abstract:
Extracting automatically the complex set of features composing real high-dimensional data is crucial for achieving high performance in machine--learning tasks. Restricted Boltzmann Machines (RBM) are empirically known to be efficient for this purpose, and to be able to generate distributed and graded representations of the data. We characterize the structural conditions (sparsity of the weights, l…
▽ More
Extracting automatically the complex set of features composing real high-dimensional data is crucial for achieving high performance in machine--learning tasks. Restricted Boltzmann Machines (RBM) are empirically known to be efficient for this purpose, and to be able to generate distributed and graded representations of the data. We characterize the structural conditions (sparsity of the weights, low effective temperature, nonlinearities in the activation functions of hidden units, and adaptation of fields maintaining the activity in the visible layer) allowing RBM to operate in such a compositional phase. Evidence is provided by the replica analysis of an adequate statistical ensemble of random RBMs and by RBM trained on the handwritten digits dataset MNIST.
△ Less
Submitted 2 March, 2017; v1 submitted 21 November, 2016;
originally announced November 2016.
-
Benchmarking inverse statistical approaches for protein structure and design with exactly solvable models
Authors:
Hugo Jacquin,
Amy Gilson,
Eugene Shakhnovich,
Simona Cocco,
Rémi Monasson
Abstract:
Inverse statistical approaches to determine protein structure and function from Multiple Sequence Alignments (MSA) are emerging as powerful tools in computational biology. However the underlying assumptions of the relationship between the inferred effective Potts Hamiltonian and real protein structure and energetics remain untested so far. Here we use lattice protein model (LP) to benchmark those…
▽ More
Inverse statistical approaches to determine protein structure and function from Multiple Sequence Alignments (MSA) are emerging as powerful tools in computational biology. However the underlying assumptions of the relationship between the inferred effective Potts Hamiltonian and real protein structure and energetics remain untested so far. Here we use lattice protein model (LP) to benchmark those inverse statistical approaches. We build MSA of highly stable sequences in target LP structures, and infer the effective pairwise Potts Hamiltonians from those MSA. We find that inferred Potts Hamiltonians reproduce many important aspects of 'true' LP structures and energetics. Careful analysis reveals that effective pairwise couplings in inferred Potts Hamiltonians depend not only on the energetics of the native structure but also on competing folds; in particular, the coupling values reflect both positive design (stabilization of native conformation) and negative design (destabilization of competing folds). In addition to providing detailed structural information, the inferred Potts models used as protein Hamiltonian for design of new sequences are able to generate with high probability completely new sequences with the desired folds, which is not possible using independent-site models. Those are remarkable results as the effective LP Hamiltonians used to generate MSA are not simple pairwise models due to the competition between the folds. Our findings elucidate the reasons for the success of inverse approaches to the modelling of proteins from sequence data, and their limitations.
△ Less
Submitted 15 November, 2016;
originally announced November 2016.
-
A collective phase in resource competition in a highly diverse ecosystem
Authors:
Mikhail Tikhonov,
Remi Monasson
Abstract:
Organisms shape their own environment, which in turn affects their survival. This feedback becomes especially important for communities containing a large number of species; however, few existing approaches allow studying this regime, except in simulations. Here, we use methods of statistical physics to analytically solve a classic ecological model of resource competition introduced by MacArthur i…
▽ More
Organisms shape their own environment, which in turn affects their survival. This feedback becomes especially important for communities containing a large number of species; however, few existing approaches allow studying this regime, except in simulations. Here, we use methods of statistical physics to analytically solve a classic ecological model of resource competition introduced by MacArthur in 1969. We show that the non-intuitive phenomenology of highly diverse ecosystems includes a phase where the environment constructed by the community becomes fully decoupled from the outside world.
△ Less
Submitted 5 September, 2016;
originally announced September 2016.
-
On the entropy of protein families
Authors:
John Barton,
Arup Chakraborty,
Simona Cocco,
Hugo Jacquin,
Rémi Monasson
Abstract:
Proteins are essential components of living systems, capable of performing a huge variety of tasks at the molecular level, such as recognition, signalling, copy, transport, ... The protein sequences realizing a given function may largely vary across organisms, giving rise to a protein family. Here, we estimate the entropy of those families based on different approaches, including Hidden Markov Mod…
▽ More
Proteins are essential components of living systems, capable of performing a huge variety of tasks at the molecular level, such as recognition, signalling, copy, transport, ... The protein sequences realizing a given function may largely vary across organisms, giving rise to a protein family. Here, we estimate the entropy of those families based on different approaches, including Hidden Markov Models used for protein databases and inferred statistical models reproducing the low-order (1-and 2-point) statistics of multi-sequence alignments. We also compute the entropic cost, that is, the loss in entropy resulting from a constraint acting on the protein, such as the fixation of one particular amino-acid on a specific site, and relate this notion to the escape probability of the HIV virus. The case of lattice proteins, for which the entropy can be computed exactly, allows us to provide another illustration of the concept of cost, due to the competition of different folds. The relevance of the entropy in relation to directed evolution experiments is stressed.
△ Less
Submitted 26 December, 2015;
originally announced December 2015.
-
Direct-Coupling Analysis of nucleotide coevolution facilitates RNA secondary and tertiary structure prediction
Authors:
Eleonora De Leonardis,
Benjamin Lutz,
Sebastian Ratz,
Simona Cocco,
Remi Monasson,
Alexander Schug,
Martin Weigt
Abstract:
Despite the biological importance of non-coding RNA, their structural characterization remains challenging. Making use of the rapidly growing sequence databases, we analyze nucleotide coevolution across homologous sequences via Direct-Coupling Analysis to detect nucleotide-nucleotide contacts. For a representative set of riboswitches, we show that the results of Direct-Coupling Analysis in combina…
▽ More
Despite the biological importance of non-coding RNA, their structural characterization remains challenging. Making use of the rapidly growing sequence databases, we analyze nucleotide coevolution across homologous sequences via Direct-Coupling Analysis to detect nucleotide-nucleotide contacts. For a representative set of riboswitches, we show that the results of Direct-Coupling Analysis in combination with a generalized Nussinov algorithm systematically improve the results of RNA secondary structure prediction beyond traditional covariance approaches based on mutual information. Even more importantly, we show that the results of Direct-Coupling Analysis are enriched in tertiary structure contacts. By integrating these predictions into molecular modeling tools, systematically improved tertiary structure predictions can be obtained, as compared to using secondary structure information alone.
△ Less
Submitted 12 October, 2015;
originally announced October 2015.
-
Transitions between spatial attractors in place-cell networks
Authors:
R Monasson,
S Rosay
Abstract:
The spontaneous transitions between D-dimensional spatial maps in an attractor neural network are studied. Two scenarios for the transition from on map to another are found, depending on the level of noise: (1) through a mixed state, partly localized in both maps around positions where the maps are most similar; (2) through a weakly localized state in one of the two maps, followed by a condensatio…
▽ More
The spontaneous transitions between D-dimensional spatial maps in an attractor neural network are studied. Two scenarios for the transition from on map to another are found, depending on the level of noise: (1) through a mixed state, partly localized in both maps around positions where the maps are most similar; (2) through a weakly localized state in one of the two maps, followed by a condensation in the arrival map. Our predictions are confirmed by numerical simulations, and qualitatively compared to recent recordings of hippocampal place cells during quick-environment-changing experiments in rats.
△ Less
Submitted 21 July, 2015;
originally announced July 2015.
-
Learning probabilities from random observables in high dimensions: the maximum entropy distribution and others
Authors:
Tomoyuki Obuchi,
Simona Cocco,
Rémi Monasson
Abstract:
We consider the problem of learning a target probability distribution over a set of $N$ binary variables from the knowledge of the expectation values (with this target distribution) of $M$ observables, drawn uniformly at random. The space of all probability distributions compatible with these $M$ expectation values within some fixed accuracy, called version space, is studied. We introduce a biased…
▽ More
We consider the problem of learning a target probability distribution over a set of $N$ binary variables from the knowledge of the expectation values (with this target distribution) of $M$ observables, drawn uniformly at random. The space of all probability distributions compatible with these $M$ expectation values within some fixed accuracy, called version space, is studied. We introduce a biased measure over the version space, which gives a boost increasing exponentially with the entropy of the distributions and with an arbitrary inverse `temperature' $Γ$. The choice of $Γ$ allows us to interpolate smoothly between the unbiased measure over all distributions in the version space ($Γ=0$) and the pointwise measure concentrated at the maximum entropy distribution ($Γ\to \infty$). Using the replica method we compute the volume of the version space and other quantities of interest, such as the distance $R$ between the target distribution and the center-of-mass distribution over the version space, as functions of $α=(\log M)/N$ and $Γ$ for large $N$. Phase transitions at critical values of $α$ are found, corresponding to qualitative improvements in the learning of the target distribution and to the decrease of the distance $R$. However, for fixed $α$, the distance $R$ does not vary with $Γ$, which means that the maximum entropy distribution is not closer to the target distribution than any other distribution compatible with the observable values. Our results are confirmed by Monte Carlo sampling of the version space for small system sizes ($N\le 10$).
△ Less
Submitted 21 July, 2015; v1 submitted 10 March, 2015;
originally announced March 2015.
-
Estimating the principal components of correlation matrices from all their empirical eigenvectors
Authors:
Rémi Monasson,
Dario Villamaina
Abstract:
We consider the problem of estimating the principal components of a population correlation matrix from a limited number of measurement data. Using a combination of random matrix and information-theoretic tools, we show that all the eigenmodes of the sample correlation matrices are informative, and not only the top ones. We show how this information can be exploited when prior information about the…
▽ More
We consider the problem of estimating the principal components of a population correlation matrix from a limited number of measurement data. Using a combination of random matrix and information-theoretic tools, we show that all the eigenmodes of the sample correlation matrices are informative, and not only the top ones. We show how this information can be exploited when prior information about the principal component, such as whether it is localized or not, is available by map** the estimation problem onto the search for the ground state of a spin-glass-like effective Hamiltonian encoding the prior. Results are illustrated numerically on the spiked covariance model.
△ Less
Submitted 30 November, 2015; v1 submitted 1 March, 2015;
originally announced March 2015.
-
Stochastic Ratchet Mechanisms for Replacement of Proteins Bound to DNA
Authors:
Simona Cocco,
John F. Marko,
Remi Monasson
Abstract:
Experiments indicate that unbinding rates of proteins from DNA can depend on the concentration of proteins in nearby solution. Here we present a theory of multi-step replacement of DNA-bound proteins by solution-phase proteins. For four different kinetic scenarios we calculate the depen- dence of protein unbinding and replacement rates on solution protein concentration. We find (1) strong effects…
▽ More
Experiments indicate that unbinding rates of proteins from DNA can depend on the concentration of proteins in nearby solution. Here we present a theory of multi-step replacement of DNA-bound proteins by solution-phase proteins. For four different kinetic scenarios we calculate the depen- dence of protein unbinding and replacement rates on solution protein concentration. We find (1) strong effects of progressive 'rezip**' of the solution-phase protein onto DNA sites liberated by 'unzip**' of the originally bound protein; (2) that a model in which solution-phase proteins bind non-specifically to DNA can describe experiments on exchanges between the non specific DNA- binding proteins Fis-Fis and Fis-HU; (3) that a binding specific model describes experiments on the exchange of CueR proteins on specific binding sites.
△ Less
Submitted 26 May, 2014;
originally announced May 2014.
-
Large Pseudo-Counts and $L_2$-Norm Penalties Are Necessary for the Mean-Field Inference of Ising and Potts Models
Authors:
J. P. Barton,
S. Cocco,
E. De Leonardis,
R. Monasson
Abstract:
Mean field (MF) approximation offers a simple, fast way to infer direct interactions between elements in a network of correlated variables, a common, computationally challenging problem with practical applications in fields ranging from physics and biology to the social sciences. However, MF methods achieve their best performance with strong regularization, well beyond Bayesian expectations, an em…
▽ More
Mean field (MF) approximation offers a simple, fast way to infer direct interactions between elements in a network of correlated variables, a common, computationally challenging problem with practical applications in fields ranging from physics and biology to the social sciences. However, MF methods achieve their best performance with strong regularization, well beyond Bayesian expectations, an empirical fact that is poorly understood. In this work, we study the influence of pseudo-count and $L_2$-norm regularization schemes on the quality of inferred Ising or Potts interaction networks from correlation data within the MF approximation. We argue, based on the analysis of small systems, that the optimal value of the regularization strength remains finite even if the sampling noise tends to zero, in order to correct for systematic biases introduced by the MF approximation. Our claim is corroborated by extensive numerical studies of diverse model systems and by the analytical study of the $m$-component spin model, for large but finite $m$. Additionally we find that pseudo-count regularization is robust against sampling noise, and often outperforms $L_2$-norm regularization, particularly when the underlying network of interactions is strongly heterogeneous. Much better performances are generally obtained for the Ising model than for the Potts model, for which only couplings incoming onto medium-frequency symbols are reliably inferred.
△ Less
Submitted 1 May, 2014;
originally announced May 2014.
-
Crosstalk and transitions between multiple spatial maps in an attractor neural network model of the hippocampus: Collective motion of the activity (II)
Authors:
Rémi Monasson,
Sophie Rosay
Abstract:
The dynamics of a neural model for hippocampal place cells storing spatial maps is studied. In the absence of external input, depending on the number of cells and on the values of control parameters (number of environments stored, level of neural noise, average level of activity, connectivity of place cells), a 'clump' of spatially-localized activity can diffuse, or remains pinned due to crosstalk…
▽ More
The dynamics of a neural model for hippocampal place cells storing spatial maps is studied. In the absence of external input, depending on the number of cells and on the values of control parameters (number of environments stored, level of neural noise, average level of activity, connectivity of place cells), a 'clump' of spatially-localized activity can diffuse, or remains pinned due to crosstalk between the environments. In the single-environment case, the macroscopic coefficient of diffusion of the clump and its effective mobility are calculated analytically from first principles, and corroborated by numerical simulations. In the multi-environment case the heights and the widths of the pinning barriers are analytically characterized with the replica method; diffusion within one map is then in competition with transitions between different maps. Possible mechanisms enhancing mobility are proposed and tested.
△ Less
Submitted 13 February, 2014; v1 submitted 11 October, 2013;
originally announced October 2013.
-
Cross-talk and transitions between multiple spatial maps in an attractor neural network model of the hippocampus: phase diagram (I)
Authors:
Rémi Monasson,
Sophie Rosay
Abstract:
We study the stable phases of an attractor neural network model, with binary units, for hippocampal place cells encoding 1D or 2D spatial maps or environments. Using statistical mechanics tools we show that, below critical values for the noise in the neural response and for the number of environments, the network activity is spatially localized in one environment. We calculate the number of stored…
▽ More
We study the stable phases of an attractor neural network model, with binary units, for hippocampal place cells encoding 1D or 2D spatial maps or environments. Using statistical mechanics tools we show that, below critical values for the noise in the neural response and for the number of environments, the network activity is spatially localized in one environment. We calculate the number of stored environments. For high noise and loads the network activity extends over space, either uniformly or with spatial heterogeneities due to the cross-talk between the maps, and memory of environments is lost. Analytical predictions are corroborated by numerical simulations.
△ Less
Submitted 4 April, 2013;
originally announced April 2013.
-
From principal component to direct coupling analysis of coevolution in proteins: Low-eigenvalue modes are needed for structure prediction
Authors:
Simona Cocco,
Remi Monasson,
Martin Weigt
Abstract:
Various approaches have explored the covariation of residues in multiple-sequence alignments of homologous proteins to extract functional and structural information. Among those are principal component analysis (PCA), which identifies the most correlated groups of residues, and direct coupling analysis (DCA), a global inference method based on the maximum entropy principle, which aims at predictin…
▽ More
Various approaches have explored the covariation of residues in multiple-sequence alignments of homologous proteins to extract functional and structural information. Among those are principal component analysis (PCA), which identifies the most correlated groups of residues, and direct coupling analysis (DCA), a global inference method based on the maximum entropy principle, which aims at predicting residue-residue contacts. In this paper, inspired by the statistical physics of disordered systems, we introduce the Hopfield-Potts model to naturally interpolate between these two approaches. The Hopfield-Potts model allows us to identify relevant 'patterns' of residues from the knowledge of the eigenmodes and eigenvalues of the residue-residue correlation matrix. We show how the computation of such statistical patterns makes it possible to accurately predict residue-residue contacts with a much smaller number of parameters than DCA. This dimensional reduction allows us to avoid overfitting and to extract contact information from multiple-sequence alignments of reduced size. In addition, we show that low-eigenvalue correlation modes, discarded by PCA, are important to recover structural information: the corresponding patterns are highly localized, that is, they are concentrated in few sites, which we find to be in close contact in the three-dimensional protein fold.
△ Less
Submitted 27 August, 2013; v1 submitted 13 December, 2012;
originally announced December 2012.
-
Adaptive cluster expansion for the inverse Ising problem: convergence, algorithm and tests
Authors:
Simona Cocco,
Rémi Monasson
Abstract:
We present a procedure to solve the inverse Ising problem, that is to find the interactions between a set of binary variables from the measure of their equilibrium correlations. The method consists in constructing and selecting specific clusters of variables, based on their contributions to the cross-entropy of the Ising model. Small contributions are discarded to avoid overfitting and to make the…
▽ More
We present a procedure to solve the inverse Ising problem, that is to find the interactions between a set of binary variables from the measure of their equilibrium correlations. The method consists in constructing and selecting specific clusters of variables, based on their contributions to the cross-entropy of the Ising model. Small contributions are discarded to avoid overfitting and to make the computation tractable. The properties of the cluster expansion and its performances on synthetic data are studied. To make the implementation easier we give the pseudo-code of the algorithm.
△ Less
Submitted 25 October, 2011;
originally announced October 2011.
-
High-Dimensional Inference with the generalized Hopfield Model: Principal Component Analysis and Corrections
Authors:
Simona Cocco,
Remi Monasson,
Vitor Sessak
Abstract:
We consider the problem of inferring the interactions between a set of N binary variables from the knowledge of their frequencies and pairwise correlations. The inference framework is based on the Hopfield model, a special case of the Ising model where the interaction matrix is defined through a set of patterns in the variable space, and is of rank much smaller than N. We show that Maximum Lik eli…
▽ More
We consider the problem of inferring the interactions between a set of N binary variables from the knowledge of their frequencies and pairwise correlations. The inference framework is based on the Hopfield model, a special case of the Ising model where the interaction matrix is defined through a set of patterns in the variable space, and is of rank much smaller than N. We show that Maximum Lik elihood inference is deeply related to Principal Component Analysis when the amp litude of the pattern components, xi, is negligible compared to N^1/2. Using techniques from statistical mechanics, we calculate the corrections to the patterns to the first order in xi/N^1/2. We stress that it is important to generalize the Hopfield model and include both attractive and repulsive patterns, to correctly infer networks with sparse and strong interactions. We present a simple geometrical criterion to decide how many attractive and repulsive patterns should be considered as a function of the sampling noise. We moreover discuss how many sampled configurations are required for a good inference, as a function of the system size, N and of the amplitude, xi. The inference approach is illustrated on synthetic and biological data.
△ Less
Submitted 19 April, 2011;
originally announced April 2011.
-
Fast Inference of Interactions in Assemblies of Stochastic Integrate-and-Fire Neurons from Spike Recordings
Authors:
Remi Monasson,
Simona Cocco
Abstract:
We present two Bayesian procedures to infer the interactions and external currents in an assembly of stochastic integrate-and-fire neurons from the recording of their spiking activity. The first procedure is based on the exact calculation of the most likely time courses of the neuron membrane potentials conditioned by the recorded spikes, and is exact for a vanishing noise variance and for an inst…
▽ More
We present two Bayesian procedures to infer the interactions and external currents in an assembly of stochastic integrate-and-fire neurons from the recording of their spiking activity. The first procedure is based on the exact calculation of the most likely time courses of the neuron membrane potentials conditioned by the recorded spikes, and is exact for a vanishing noise variance and for an instantaneous synaptic integration. The second procedure takes into account the presence of fluctuations around the most likely time courses of the potentials, and can deal with moderate noise levels. The running time of both procedures is proportional to the number S of spikes multiplied by the squared number N of neurons. The algorithms are validated on synthetic data generated by networks with known couplings and currents. We also reanalyze previously published recordings of the activity of the salamander retina (including from 32 to 40 neurons, and from 65,000 to 170,000 spikes). We study the dependence of the inferred interactions on the membrane leaking time; the differences and similarities with the classical cross-correlation analysis are discussed.
△ Less
Submitted 25 February, 2011;
originally announced February 2011.
-
Adaptive Cluster Expansion for Inferring Boltzmann Machines with Noisy Data
Authors:
Simona Cocco,
Rémi Monasson
Abstract:
We introduce a procedure to infer the interactions among a set of binary variables, based on their sampled frequencies and pairwise correlations. The algorithm builds the clusters of variables contributing most to the entropy of the inferred Ising model, and rejects the small contributions due to the sampling noise. Our procedure successfully recovers benchmark Ising models even at criticality and…
▽ More
We introduce a procedure to infer the interactions among a set of binary variables, based on their sampled frequencies and pairwise correlations. The algorithm builds the clusters of variables contributing most to the entropy of the inferred Ising model, and rejects the small contributions due to the sampling noise. Our procedure successfully recovers benchmark Ising models even at criticality and in the low temperature phase, and is applied to neurobiological data.
△ Less
Submitted 16 February, 2011;
originally announced February 2011.
-
Theory of spike timing based neural classifiers
Authors:
Ran Rubin,
Remi Monasson,
Haim Sompolinsky
Abstract:
We study the computational capacity of a model neuron, the Tempotron, which classifies sequences of spikes by linear-threshold operations. We use statistical mechanics and extreme value theory to derive the capacity of the system in random classification tasks. In contrast to its static analog, the Perceptron, the Tempotron's solutions space consists of a large number of small clusters of weight v…
▽ More
We study the computational capacity of a model neuron, the Tempotron, which classifies sequences of spikes by linear-threshold operations. We use statistical mechanics and extreme value theory to derive the capacity of the system in random classification tasks. In contrast to its static analog, the Perceptron, the Tempotron's solutions space consists of a large number of small clusters of weight vectors. The capacity of the system per synapse is finite in the large size limit and weakly diverges with the stimulus duration relative to the membrane and synaptic time constants.
△ Less
Submitted 26 October, 2010;
originally announced October 2010.
-
On the trajectories and performance of Infotaxis, an information-based greedy search algorithm
Authors:
Carlo Barbieri,
Simona Cocco,
Rémi Monasson
Abstract:
We present a continuous-space version of Infotaxis, a search algorithm where a searcher greedily moves to maximize the gain in information about the position of the target to be found. Using a combination of analytical and numerical tools we study the nature of the trajectories in two and three dimensions. The probability that the search is successful and the running time of the search are estimat…
▽ More
We present a continuous-space version of Infotaxis, a search algorithm where a searcher greedily moves to maximize the gain in information about the position of the target to be found. Using a combination of analytical and numerical tools we study the nature of the trajectories in two and three dimensions. The probability that the search is successful and the running time of the search are estimated. A possible extension to non-greedy search is suggested.
△ Less
Submitted 16 March, 2011; v1 submitted 13 October, 2010;
originally announced October 2010.
-
Dynamical modelling of molecular constructions and setups for DNA unzip**
Authors:
Carlo Barbieri,
Simona Cocco,
Remi Monasson,
Francesco Zamponi
Abstract:
We present a dynamical model of DNA mechanical unzip** under the action of a force. The model includes the motion of the fork in the sequence-dependent landscape, the trap(s) acting on the bead(s), and the polymeric components of the molecular construction (unzipped single strands of DNA, and linkers). Different setups are considered to test the model, and the outcome of the simulations is com…
▽ More
We present a dynamical model of DNA mechanical unzip** under the action of a force. The model includes the motion of the fork in the sequence-dependent landscape, the trap(s) acting on the bead(s), and the polymeric components of the molecular construction (unzipped single strands of DNA, and linkers). Different setups are considered to test the model, and the outcome of the simulations is compared to simpler dynamical models existing in the literature where polymers are assumed to be at equilibrium.
△ Less
Submitted 5 December, 2008;
originally announced December 2008.
-
Small-correlation expansions for the inverse Ising problem
Authors:
Vitor Sessak,
Rémi Monasson
Abstract:
We present a systematic small-correlation expansion to solve the inverse Ising problem: find a set of couplings and fields corresponding to a given set of correlations and magnetizations. Couplings are calculated up to the third order in the correlations for generic magnetizations, and to the seventh order in the case of zero magnetizations; in addition we show how to sum some useful classes of…
▽ More
We present a systematic small-correlation expansion to solve the inverse Ising problem: find a set of couplings and fields corresponding to a given set of correlations and magnetizations. Couplings are calculated up to the third order in the correlations for generic magnetizations, and to the seventh order in the case of zero magnetizations; in addition we show how to sum some useful classes of diagrams exactly. The resulting expansion outperforms existing algorithms on the Sherrington-Kirkpatrick spin-glass model.
△ Less
Submitted 21 November, 2008;
originally announced November 2008.
-
A review of the Statistical Mechanics approach to Random Optimization Problems
Authors:
Fabrizio Altarelli,
Remi Monasson,
Guilhem Semerjian,
Francesco Zamponi
Abstract:
We review the connection between statistical mechanics and the analysis of random optimization problems, with particular emphasis on the random k-SAT problem. We discuss and characterize the different phase transitions that are met in these problems, starting from basic concepts. We also discuss how statistical mechanics methods can be used to investigate the behavior of local search and decimat…
▽ More
We review the connection between statistical mechanics and the analysis of random optimization problems, with particular emphasis on the random k-SAT problem. We discuss and characterize the different phase transitions that are met in these problems, starting from basic concepts. We also discuss how statistical mechanics methods can be used to investigate the behavior of local search and decimation based algorithms.
△ Less
Submitted 13 February, 2008;
originally announced February 2008.
-
Relationship between clustering and algorithmic phase transitions in the random k-XORSAT model and its NP-complete extensions
Authors:
Fabrizio Altarelli,
Remi Monasson,
Francesco Zamponi
Abstract:
We study the performances of stochastic heuristic search algorithms on Uniquely Extendible Constraint Satisfaction Problems with random inputs. We show that, for any heuristic preserving the Poissonian nature of the underlying instance, the (heuristic-dependent) largest ratio $α_a$ of constraints per variables for which a search algorithm is likely to find solutions is smaller than the critical…
▽ More
We study the performances of stochastic heuristic search algorithms on Uniquely Extendible Constraint Satisfaction Problems with random inputs. We show that, for any heuristic preserving the Poissonian nature of the underlying instance, the (heuristic-dependent) largest ratio $α_a$ of constraints per variables for which a search algorithm is likely to find solutions is smaller than the critical ratio $α_d$ above which solutions are clustered and highly correlated. In addition we show that the clustering ratio can be reached when the number k of variables per constraints goes to infinity by the so-called Generalized Unit Clause heuristic.
△ Less
Submitted 18 October, 2007; v1 submitted 4 September, 2007;
originally announced September 2007.
-
Inferring DNA sequences from mechanical unzip** data: the large-bandwidth case
Authors:
Valentina Baldazzi,
Serena Bradde,
Simona Cocco,
Enzo Marinari,
Remi Monasson
Abstract:
The complementary strands of DNA molecules can be separated when stretched apart by a force; the unzip** signal is correlated to the base content of the sequence but is affected by thermal and instrumental noise. We consider here the ideal case where opening events are known to a very good time resolution (very large bandwidth), and study how the sequence can be reconstructed from the unzippin…
▽ More
The complementary strands of DNA molecules can be separated when stretched apart by a force; the unzip** signal is correlated to the base content of the sequence but is affected by thermal and instrumental noise. We consider here the ideal case where opening events are known to a very good time resolution (very large bandwidth), and study how the sequence can be reconstructed from the unzip** data. Our approach relies on the use of statistical Bayesian inference and of Viterbi decoding algorithm. Performances are studied numerically on Monte Carlo generated data, and analytically. We show how multiple unzip**s of the same molecule may be exploited to improve the quality of the prediction, and calculate analytically the number of required unzip**s as a function of the bandwidth, the sequence content, the elasticity parameters of the unzipped strands.
△ Less
Submitted 19 April, 2007;
originally announced April 2007.
-
Reconstructing a Random Potential from its Random Walks
Authors:
Simona Cocco,
Remi Monasson
Abstract:
The problem of how many trajectories of a random walker in a potential are needed to reconstruct the values of this potential is studied. We show that this problem can be solved by calculating the probability of survival of an abstract random walker in a partially absorbing potential. The approach is illustrated on the discrete Sinai (random force) model with a drift. We determine the parameter…
▽ More
The problem of how many trajectories of a random walker in a potential are needed to reconstruct the values of this potential is studied. We show that this problem can be solved by calculating the probability of survival of an abstract random walker in a partially absorbing potential. The approach is illustrated on the discrete Sinai (random force) model with a drift. We determine the parameter (temperature, duration of each trajectory, ...) values making reconstruction as fast as possible.
△ Less
Submitted 19 April, 2007;
originally announced April 2007.