-
Implicitly Guided Design with PropEn: Match your Data to Follow the Gradient
Authors:
Nataša Tagasovska,
Vladimir Gligorijević,
Kyunghyun Cho,
Andreas Loukas
Abstract:
Across scientific domains, generating new models or optimizing existing ones while meeting specific criteria is crucial. Traditional machine learning frameworks for guided design use a generative model and a surrogate model (discriminator), requiring large datasets. However, real-world scientific applications often have limited data and complex landscapes, making data-hungry models inefficient or…
▽ More
Across scientific domains, generating new models or optimizing existing ones while meeting specific criteria is crucial. Traditional machine learning frameworks for guided design use a generative model and a surrogate model (discriminator), requiring large datasets. However, real-world scientific applications often have limited data and complex landscapes, making data-hungry models inefficient or impractical. We propose a new framework, PropEn, inspired by ``matching'', which enables implicit guidance without training a discriminator. By matching each sample with a similar one that has a better property value, we create a larger training dataset that inherently indicates the direction of improvement. Matching, combined with an encoder-decoder architecture, forms a domain-agnostic generative framework for property enhancement. We show that training with a matched dataset approximates the gradient of the property of interest while remaining within the data distribution, allowing efficient design optimization. Extensive evaluations in toy problems and scientific applications, such as therapeutic protein design and airfoil optimization, demonstrate PropEn's advantages over common baselines. Notably, the protein design results are validated with wet lab experiments, confirming the competitiveness and effectiveness of our approach.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
AbDiffuser: Full-Atom Generation of in vitro Functioning Antibodies
Authors:
Karolis Martinkus,
Jan Ludwiczak,
Kyunghyun Cho,
Wei-Ching Liang,
Julien Lafrance-Vanasse,
Isidro Hotzel,
Arvind Rajpal,
Yan Wu,
Richard Bonneau,
Vladimir Gligorijevic,
Andreas Loukas
Abstract:
We introduce AbDiffuser, an equivariant and physics-informed diffusion model for the joint generation of antibody 3D structures and sequences. AbDiffuser is built on top of a new representation of protein structure, relies on a novel architecture for aligned proteins, and utilizes strong diffusion priors to improve the denoising process. Our approach improves protein diffusion by taking advantage…
▽ More
We introduce AbDiffuser, an equivariant and physics-informed diffusion model for the joint generation of antibody 3D structures and sequences. AbDiffuser is built on top of a new representation of protein structure, relies on a novel architecture for aligned proteins, and utilizes strong diffusion priors to improve the denoising process. Our approach improves protein diffusion by taking advantage of domain knowledge and physics-based constraints; handles sequence-length changes; and reduces memory complexity by an order of magnitude, enabling backbone and side chain generation. We validate AbDiffuser in silico and in vitro. Numerical experiments showcase the ability of AbDiffuser to generate antibodies that closely track the sequence and structural properties of a reference set. Laboratory experiments confirm that all 16 HER2 antibodies discovered were expressed at high levels and that 57.1% of the selected designs were tight binders.
△ Less
Submitted 6 March, 2024; v1 submitted 28 July, 2023;
originally announced August 2023.
-
Protein Discovery with Discrete Walk-Jump Sampling
Authors:
Nathan C. Frey,
Daniel Berenberg,
Karina Zadorozhny,
Joseph Kleinhenz,
Julien Lafrance-Vanasse,
Isidro Hotzel,
Yan Wu,
Stephen Ra,
Richard Bonneau,
Kyunghyun Cho,
Andreas Loukas,
Vladimir Gligorijevic,
Saeed Saremi
Abstract:
We resolve difficulties in training and sampling from a discrete generative model by learning a smoothed energy function, sampling from the smoothed data manifold with Langevin Markov chain Monte Carlo (MCMC), and projecting back to the true data manifold with one-step denoising. Our Discrete Walk-Jump Sampling formalism combines the contrastive divergence training of an energy-based model and imp…
▽ More
We resolve difficulties in training and sampling from a discrete generative model by learning a smoothed energy function, sampling from the smoothed data manifold with Langevin Markov chain Monte Carlo (MCMC), and projecting back to the true data manifold with one-step denoising. Our Discrete Walk-Jump Sampling formalism combines the contrastive divergence training of an energy-based model and improved sample quality of a score-based model, while simplifying training and sampling by requiring only a single noise level. We evaluate the robustness of our approach on generative modeling of antibody proteins and introduce the distributional conformity score to benchmark protein generative models. By optimizing and sampling from our models for the proposed distributional conformity score, 97-100% of generated samples are successfully expressed and purified and 70% of functional designs show equal or improved binding affinity compared to known functional antibodies on the first attempt in a single round of laboratory experiments. We also report the first demonstration of long-run fast-mixing MCMC chains where diverse antibody protein classes are visited in a single MCMC chain.
△ Less
Submitted 15 March, 2024; v1 submitted 8 June, 2023;
originally announced June 2023.
-
A Pareto-optimal compositional energy-based model for sampling and optimization of protein sequences
Authors:
Nataša Tagasovska,
Nathan C. Frey,
Andreas Loukas,
Isidro Hötzel,
Julien Lafrance-Vanasse,
Ryan Lewis Kelly,
Yan Wu,
Arvind Rajpal,
Richard Bonneau,
Kyunghyun Cho,
Stephen Ra,
Vladimir Gligorijević
Abstract:
Deep generative models have emerged as a popular machine learning-based approach for inverse design problems in the life sciences. However, these problems often require sampling new designs that satisfy multiple properties of interest in addition to learning the data distribution. This multi-objective optimization becomes more challenging when properties are independent or orthogonal to each other…
▽ More
Deep generative models have emerged as a popular machine learning-based approach for inverse design problems in the life sciences. However, these problems often require sampling new designs that satisfy multiple properties of interest in addition to learning the data distribution. This multi-objective optimization becomes more challenging when properties are independent or orthogonal to each other. In this work, we propose a Pareto-compositional energy-based model (pcEBM), a framework that uses multiple gradient descent for sampling new designs that adhere to various constraints in optimizing distinct properties. We demonstrate its ability to learn non-convex Pareto fronts and generate sequences that simultaneously satisfy multiple desired properties across a series of real-world antibody design tasks.
△ Less
Submitted 19 October, 2022;
originally announced October 2022.
-
PropertyDAG: Multi-objective Bayesian optimization of partially ordered, mixed-variable properties for biological sequence design
Authors:
Ji Won Park,
Samuel Stanton,
Saeed Saremi,
Andrew Watkins,
Henri Dwyer,
Vladimir Gligorijevic,
Richard Bonneau,
Stephen Ra,
Kyunghyun Cho
Abstract:
Bayesian optimization offers a sample-efficient framework for navigating the exploration-exploitation trade-off in the vast design space of biological sequences. Whereas it is possible to optimize the various properties of interest jointly using a multi-objective acquisition function, such as the expected hypervolume improvement (EHVI), this approach does not account for objectives with a hierarch…
▽ More
Bayesian optimization offers a sample-efficient framework for navigating the exploration-exploitation trade-off in the vast design space of biological sequences. Whereas it is possible to optimize the various properties of interest jointly using a multi-objective acquisition function, such as the expected hypervolume improvement (EHVI), this approach does not account for objectives with a hierarchical dependency structure. We consider a common use case where some regions of the Pareto frontier are prioritized over others according to a specified $\textit{partial ordering}$ in the objectives. For instance, when designing antibodies, we would like to maximize the binding affinity to a target antigen only if it can be expressed in live cell culture -- modeling the experimental dependency in which affinity can only be measured for antibodies that can be expressed and thus produced in viable quantities. In general, we may want to confer a partial ordering to the properties such that each property is optimized conditioned on its parent properties satisfying some feasibility condition. To this end, we present PropertyDAG, a framework that operates on top of the traditional multi-objective BO to impose this desired ordering on the objectives, e.g. expression $\rightarrow$ affinity. We demonstrate its performance over multiple simulated active learning iterations on a penicillin production task, toy numerical problem, and a real-world antibody design task.
△ Less
Submitted 8 October, 2022;
originally announced October 2022.
-
Multi-segment preserving sampling for deep manifold sampler
Authors:
Daniel Berenberg,
Jae Hyeon Lee,
Simon Kelow,
Ji Won Park,
Andrew Watkins,
Vladimir Gligorijević,
Richard Bonneau,
Stephen Ra,
Kyunghyun Cho
Abstract:
Deep generative modeling for biological sequences presents a unique challenge in reconciling the bias-variance trade-off between explicit biological insight and model flexibility. The deep manifold sampler was recently proposed as a means to iteratively sample variable-length protein sequences by exploiting the gradients from a function predictor. We introduce an alternative approach to this guide…
▽ More
Deep generative modeling for biological sequences presents a unique challenge in reconciling the bias-variance trade-off between explicit biological insight and model flexibility. The deep manifold sampler was recently proposed as a means to iteratively sample variable-length protein sequences by exploiting the gradients from a function predictor. We introduce an alternative approach to this guided sampling procedure, multi-segment preserving sampling, that enables the direct inclusion of domain-specific knowledge by designating preserved and non-preserved segments along the input sequence, thereby restricting variation to only select regions. We present its effectiveness in the context of antibody design by training two models: a deep manifold sampler and a GPT-2 language model on nearly six million heavy chain sequences annotated with the IGHV1-18 gene. During sampling, we restrict variation to only the complementarity-determining region 3 (CDR3) of the input. We obtain log probability scores from a GPT-2 model for each sampled CDR3 and demonstrate that multi-segment preserving sampling generates reasonable designs while maintaining the desired, preserved regions.
△ Less
Submitted 9 May, 2022;
originally announced May 2022.
-
Non-Negative Matrix Factorizations for Multiplex Network Analysis
Authors:
Vladimir Gligorijevic,
Yannis Panagakis,
Stefanos Zafeiriou
Abstract:
Networks have been a general tool for representing, analyzing, and modeling relational data arising in several domains. One of the most important aspect of network analysis is community detection or network clustering. Until recently, the major focus have been on discovering community structure in single (i.e., monoplex) networks. However, with the advent of relational data with multiple modalitie…
▽ More
Networks have been a general tool for representing, analyzing, and modeling relational data arising in several domains. One of the most important aspect of network analysis is community detection or network clustering. Until recently, the major focus have been on discovering community structure in single (i.e., monoplex) networks. However, with the advent of relational data with multiple modalities, multiplex networks, i.e., networks composed of multiple layers representing different aspects of relations, have emerged. Consequently, community detection in multiplex network, i.e., detecting clusters of nodes shared by all layers, has become a new challenge. In this paper, we propose Network Fusion for Composite Community Extraction (NF-CCE), a new class of algorithms, based on four different non-negative matrix factorization models, capable of extracting composite communities in multiplex networks. Each algorithm works in two steps: first, it finds a non-negative, low-dimensional feature representation of each network layer; then, it fuses the feature representation of layers into a common non-negative, low-dimensional feature representation via collective factorization. The composite clusters are extracted from the common feature representation. We demonstrate the superior performance of our algorithms over the state-of-the-art methods on various types of multiplex networks, including biological, social, economic, citation, phone communication, and brain multiplex networks.
△ Less
Submitted 25 January, 2017; v1 submitted 30 November, 2016;
originally announced December 2016.
-
The Dynamics of Emotional Chats with Bots: Experiment and Agent-Based Simulations
Authors:
Bosiljka Tadic,
Vladimir Gligorijevic,
Marcin Skowron,
Milovan Suvakov
Abstract:
Quantitative research of emotions in psychology and machine-learning methods for extracting emotion components from text messages open an avenue for physical science to explore the nature of stochastic processes in which emotions play a role, e.g., in human dynamics online. Here, we investigate the occurrence of collective behavior of users that is induced by chats with emotional Bots. The Bots, d…
▽ More
Quantitative research of emotions in psychology and machine-learning methods for extracting emotion components from text messages open an avenue for physical science to explore the nature of stochastic processes in which emotions play a role, e.g., in human dynamics online. Here, we investigate the occurrence of collective behavior of users that is induced by chats with emotional Bots. The Bots, designed in an experimental environment, are considered. Furthermore, using the agent-based modeling approach, the activity of these experimental Bots is simulated within a social network of interacting emotional agents. Quantitative analysis of time series carrying emotional messages by agents suggests temporal correlations and persistent fluctuations with clustering according to emotion similarity. {All data used in this study are fully anonymized.}
△ Less
Submitted 16 April, 2014;
originally announced April 2014.
-
Master thesis: Growth and Self-Organization Processes in Directed Social Network
Authors:
Vladimir Gligorijevic
Abstract:
Large dataset collected from Ubuntu chat channel is studied as a complex dynamical system with emergent collective behaviour of users. With the appropriate network map**s we examined wealthy topological structure of Ubuntu network. The structure of this network is determined by computing different topological measures. The directed, weighted network, which is a suitable representation of the dat…
▽ More
Large dataset collected from Ubuntu chat channel is studied as a complex dynamical system with emergent collective behaviour of users. With the appropriate network map**s we examined wealthy topological structure of Ubuntu network. The structure of this network is determined by computing different topological measures. The directed, weighted network, which is a suitable representation of the dataset from Ubuntu chat channel is characterized with power law dependencies of various quantities, hierarchical organization and disassortative mixing patterns. Beyond the topological features, the emergent collective state is further quantified by analysis of time series of users activities driven by emotions. Analysis of time series reveals self-organized dynamics with long-range temporal correlations in user actions.
△ Less
Submitted 14 December, 2013; v1 submitted 16 March, 2013;
originally announced March 2013.
-
Structure and stability of online chat networks built on emotion-carrying links
Authors:
Vladimir Gligorijevic,
Marcin Skowron,
Bosiljka Tadic
Abstract:
High-resolution data of online chats are studied as a physical system in laboratory in order to quantify collective behavior of users. Our analysis reveals strong regularities characteristic to natural systems with additional features. In particular, we find self-organized dynamics with long-range correlations in user actions and persistent associations among users that have the properties of a so…
▽ More
High-resolution data of online chats are studied as a physical system in laboratory in order to quantify collective behavior of users. Our analysis reveals strong regularities characteristic to natural systems with additional features. In particular, we find self-organized dynamics with long-range correlations in user actions and persistent associations among users that have the properties of a social network. Furthermore, the evolution of the graph and its architecture with specific k-core structure are shown to be related with the type and the emotion arousal of exchanged messages. Partitioning of the graph by deletion of the links which carry high arousal messages exhibits critical fluctuations at the percolation threshold.
△ Less
Submitted 21 September, 2012;
originally announced September 2012.
-
How the online social networks are used: Dialogs-based structure of MySpace
Authors:
Milovan Suvakov,
Marija Mitrovic,
Vladimir Gligorijevic,
Bosiljka Tadic
Abstract:
Quantitative study of collective dynamics in online social networks is a new challenge based on the abundance of empirical data. Conclusions, however, may depend on factors as user's psychology profiles and their reasons to use the online contacts. In this paper we have compiled and analyzed two datasets from \texttt{MySpace}. The data contain networked dialogs occurring within a specified time de…
▽ More
Quantitative study of collective dynamics in online social networks is a new challenge based on the abundance of empirical data. Conclusions, however, may depend on factors as user's psychology profiles and their reasons to use the online contacts. In this paper we have compiled and analyzed two datasets from \texttt{MySpace}. The data contain networked dialogs occurring within a specified time depth, high temporal resolution, and texts of messages, in which the emotion valence is assessed by using SentiStrength classifier. Performing a comprehensive analysis we obtain three groups of results: Dynamic topology of the dialogs-based networks have characteristic structure with Zipf's distribution of communities, low link reciprocity, and disassortative correlations. Overlaps supporting "weak-ties" hypothesis are found to follow the laws recently conjectured for online games. Long-range temporal correlations and persistent fluctuations occur in the time series of messages carrying positive (negative) emotion. Patterns of user communications have dominant positive emotion (attractiveness) and strong impact of circadian cycles and nteractivity times longer than one day. Taken together, these results give a new insight into functioning of the online social networks and unveil importance of the amount of information and emotion that is communicated along the social links. (All data used in this study are fully anonymized.)
△ Less
Submitted 28 June, 2012;
originally announced June 2012.