Search | arXiv e-print repository

Implicitly Guided Design with PropEn: Match your Data to Follow the Gradient

Authors: Nataša Tagasovska, Vladimir Gligorijević, Kyunghyun Cho, Andreas Loukas

Abstract: Across scientific domains, generating new models or optimizing existing ones while meeting specific criteria is crucial. Traditional machine learning frameworks for guided design use a generative model and a surrogate model (discriminator), requiring large datasets. However, real-world scientific applications often have limited data and complex landscapes, making data-hungry models inefficient or… ▽ More Across scientific domains, generating new models or optimizing existing ones while meeting specific criteria is crucial. Traditional machine learning frameworks for guided design use a generative model and a surrogate model (discriminator), requiring large datasets. However, real-world scientific applications often have limited data and complex landscapes, making data-hungry models inefficient or impractical. We propose a new framework, PropEn, inspired by ``matching'', which enables implicit guidance without training a discriminator. By matching each sample with a similar one that has a better property value, we create a larger training dataset that inherently indicates the direction of improvement. Matching, combined with an encoder-decoder architecture, forms a domain-agnostic generative framework for property enhancement. We show that training with a matched dataset approximates the gradient of the property of interest while remaining within the data distribution, allowing efficient design optimization. Extensive evaluations in toy problems and scientific applications, such as therapeutic protein design and airfoil optimization, demonstrate PropEn's advantages over common baselines. Notably, the protein design results are validated with wet lab experiments, confirming the competitiveness and effectiveness of our approach. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2308.05027 [pdf, other]

AbDiffuser: Full-Atom Generation of in vitro Functioning Antibodies

Authors: Karolis Martinkus, Jan Ludwiczak, Kyunghyun Cho, Wei-Ching Liang, Julien Lafrance-Vanasse, Isidro Hotzel, Arvind Rajpal, Yan Wu, Richard Bonneau, Vladimir Gligorijevic, Andreas Loukas

Abstract: We introduce AbDiffuser, an equivariant and physics-informed diffusion model for the joint generation of antibody 3D structures and sequences. AbDiffuser is built on top of a new representation of protein structure, relies on a novel architecture for aligned proteins, and utilizes strong diffusion priors to improve the denoising process. Our approach improves protein diffusion by taking advantage… ▽ More We introduce AbDiffuser, an equivariant and physics-informed diffusion model for the joint generation of antibody 3D structures and sequences. AbDiffuser is built on top of a new representation of protein structure, relies on a novel architecture for aligned proteins, and utilizes strong diffusion priors to improve the denoising process. Our approach improves protein diffusion by taking advantage of domain knowledge and physics-based constraints; handles sequence-length changes; and reduces memory complexity by an order of magnitude, enabling backbone and side chain generation. We validate AbDiffuser in silico and in vitro. Numerical experiments showcase the ability of AbDiffuser to generate antibodies that closely track the sequence and structural properties of a reference set. Laboratory experiments confirm that all 16 HER2 antibodies discovered were expressed at high levels and that 57.1% of the selected designs were tight binders. △ Less

Submitted 6 March, 2024; v1 submitted 28 July, 2023; originally announced August 2023.

Comments: NeurIPS 2023

arXiv:2306.12360 [pdf, other]

Protein Discovery with Discrete Walk-Jump Sampling

Authors: Nathan C. Frey, Daniel Berenberg, Karina Zadorozhny, Joseph Kleinhenz, Julien Lafrance-Vanasse, Isidro Hotzel, Yan Wu, Stephen Ra, Richard Bonneau, Kyunghyun Cho, Andreas Loukas, Vladimir Gligorijevic, Saeed Saremi

Abstract: We resolve difficulties in training and sampling from a discrete generative model by learning a smoothed energy function, sampling from the smoothed data manifold with Langevin Markov chain Monte Carlo (MCMC), and projecting back to the true data manifold with one-step denoising. Our Discrete Walk-Jump Sampling formalism combines the contrastive divergence training of an energy-based model and imp… ▽ More We resolve difficulties in training and sampling from a discrete generative model by learning a smoothed energy function, sampling from the smoothed data manifold with Langevin Markov chain Monte Carlo (MCMC), and projecting back to the true data manifold with one-step denoising. Our Discrete Walk-Jump Sampling formalism combines the contrastive divergence training of an energy-based model and improved sample quality of a score-based model, while simplifying training and sampling by requiring only a single noise level. We evaluate the robustness of our approach on generative modeling of antibody proteins and introduce the distributional conformity score to benchmark protein generative models. By optimizing and sampling from our models for the proposed distributional conformity score, 97-100% of generated samples are successfully expressed and purified and 70% of functional designs show equal or improved binding affinity compared to known functional antibodies on the first attempt in a single round of laboratory experiments. We also report the first demonstration of long-run fast-mixing MCMC chains where diverse antibody protein classes are visited in a single MCMC chain. △ Less

Submitted 15 March, 2024; v1 submitted 8 June, 2023; originally announced June 2023.

Comments: ICLR 2024 oral presentation, top 1.2% of submissions; {ICLR 2023 Physics for Machine Learning, NeurIPS 2023 GenBio, MLCB 2023} Spotlight

arXiv:2210.10838 [pdf, other]

A Pareto-optimal compositional energy-based model for sampling and optimization of protein sequences

Authors: Nataša Tagasovska, Nathan C. Frey, Andreas Loukas, Isidro Hötzel, Julien Lafrance-Vanasse, Ryan Lewis Kelly, Yan Wu, Arvind Rajpal, Richard Bonneau, Kyunghyun Cho, Stephen Ra, Vladimir Gligorijević

Abstract: Deep generative models have emerged as a popular machine learning-based approach for inverse design problems in the life sciences. However, these problems often require sampling new designs that satisfy multiple properties of interest in addition to learning the data distribution. This multi-objective optimization becomes more challenging when properties are independent or orthogonal to each other… ▽ More Deep generative models have emerged as a popular machine learning-based approach for inverse design problems in the life sciences. However, these problems often require sampling new designs that satisfy multiple properties of interest in addition to learning the data distribution. This multi-objective optimization becomes more challenging when properties are independent or orthogonal to each other. In this work, we propose a Pareto-compositional energy-based model (pcEBM), a framework that uses multiple gradient descent for sampling new designs that adhere to various constraints in optimizing distinct properties. We demonstrate its ability to learn non-convex Pareto fronts and generate sequences that simultaneously satisfy multiple desired properties across a series of real-world antibody design tasks. △ Less

Submitted 19 October, 2022; originally announced October 2022.

arXiv:2210.04096 [pdf, other]

PropertyDAG: Multi-objective Bayesian optimization of partially ordered, mixed-variable properties for biological sequence design

Authors: Ji Won Park, Samuel Stanton, Saeed Saremi, Andrew Watkins, Henri Dwyer, Vladimir Gligorijevic, Richard Bonneau, Stephen Ra, Kyunghyun Cho

Abstract: Bayesian optimization offers a sample-efficient framework for navigating the exploration-exploitation trade-off in the vast design space of biological sequences. Whereas it is possible to optimize the various properties of interest jointly using a multi-objective acquisition function, such as the expected hypervolume improvement (EHVI), this approach does not account for objectives with a hierarch… ▽ More Bayesian optimization offers a sample-efficient framework for navigating the exploration-exploitation trade-off in the vast design space of biological sequences. Whereas it is possible to optimize the various properties of interest jointly using a multi-objective acquisition function, such as the expected hypervolume improvement (EHVI), this approach does not account for objectives with a hierarchical dependency structure. We consider a common use case where some regions of the Pareto frontier are prioritized over others according to a specified $\textit{partial ordering}$ in the objectives. For instance, when designing antibodies, we would like to maximize the binding affinity to a target antigen only if it can be expressed in live cell culture -- modeling the experimental dependency in which affinity can only be measured for antibodies that can be expressed and thus produced in viable quantities. In general, we may want to confer a partial ordering to the properties such that each property is optimized conditioned on its parent properties satisfying some feasibility condition. To this end, we present PropertyDAG, a framework that operates on top of the traditional multi-objective BO to impose this desired ordering on the objectives, e.g. expression $\rightarrow$ affinity. We demonstrate its performance over multiple simulated active learning iterations on a penicillin production task, toy numerical problem, and a real-world antibody design task. △ Less

Submitted 8 October, 2022; originally announced October 2022.

Comments: 9 pages, 7 figures. Submitted to NeurIPS 2022 AI4Science Workshop

arXiv:2205.04259 [pdf, other]

Multi-segment preserving sampling for deep manifold sampler

Authors: Daniel Berenberg, Jae Hyeon Lee, Simon Kelow, Ji Won Park, Andrew Watkins, Vladimir Gligorijević, Richard Bonneau, Stephen Ra, Kyunghyun Cho

Abstract: Deep generative modeling for biological sequences presents a unique challenge in reconciling the bias-variance trade-off between explicit biological insight and model flexibility. The deep manifold sampler was recently proposed as a means to iteratively sample variable-length protein sequences by exploiting the gradients from a function predictor. We introduce an alternative approach to this guide… ▽ More Deep generative modeling for biological sequences presents a unique challenge in reconciling the bias-variance trade-off between explicit biological insight and model flexibility. The deep manifold sampler was recently proposed as a means to iteratively sample variable-length protein sequences by exploiting the gradients from a function predictor. We introduce an alternative approach to this guided sampling procedure, multi-segment preserving sampling, that enables the direct inclusion of domain-specific knowledge by designating preserved and non-preserved segments along the input sequence, thereby restricting variation to only select regions. We present its effectiveness in the context of antibody design by training two models: a deep manifold sampler and a GPT-2 language model on nearly six million heavy chain sequences annotated with the IGHV1-18 gene. During sampling, we restrict variation to only the complementarity-determining region 3 (CDR3) of the input. We obtain log probability scores from a GPT-2 model for each sampled CDR3 and demonstrate that multi-segment preserving sampling generates reasonable designs while maintaining the desired, preserved regions. △ Less

Submitted 9 May, 2022; originally announced May 2022.

arXiv:1612.00750 [pdf, other]

Non-Negative Matrix Factorizations for Multiplex Network Analysis

Authors: Vladimir Gligorijevic, Yannis Panagakis, Stefanos Zafeiriou

Abstract: Networks have been a general tool for representing, analyzing, and modeling relational data arising in several domains. One of the most important aspect of network analysis is community detection or network clustering. Until recently, the major focus have been on discovering community structure in single (i.e., monoplex) networks. However, with the advent of relational data with multiple modalitie… ▽ More Networks have been a general tool for representing, analyzing, and modeling relational data arising in several domains. One of the most important aspect of network analysis is community detection or network clustering. Until recently, the major focus have been on discovering community structure in single (i.e., monoplex) networks. However, with the advent of relational data with multiple modalities, multiplex networks, i.e., networks composed of multiple layers representing different aspects of relations, have emerged. Consequently, community detection in multiplex network, i.e., detecting clusters of nodes shared by all layers, has become a new challenge. In this paper, we propose Network Fusion for Composite Community Extraction (NF-CCE), a new class of algorithms, based on four different non-negative matrix factorization models, capable of extracting composite communities in multiplex networks. Each algorithm works in two steps: first, it finds a non-negative, low-dimensional feature representation of each network layer; then, it fuses the feature representation of layers into a common non-negative, low-dimensional feature representation via collective factorization. The composite clusters are extracted from the common feature representation. We demonstrate the superior performance of our algorithms over the state-of-the-art methods on various types of multiplex networks, including biological, social, economic, citation, phone communication, and brain multiplex networks. △ Less

Submitted 25 January, 2017; v1 submitted 30 November, 2016; originally announced December 2016.

Comments: 12 pages, 4 figures, 3 tables

arXiv:1404.4191 [pdf, ps, other]

The Dynamics of Emotional Chats with Bots: Experiment and Agent-Based Simulations

Authors: Bosiljka Tadic, Vladimir Gligorijevic, Marcin Skowron, Milovan Suvakov

Abstract: Quantitative research of emotions in psychology and machine-learning methods for extracting emotion components from text messages open an avenue for physical science to explore the nature of stochastic processes in which emotions play a role, e.g., in human dynamics online. Here, we investigate the occurrence of collective behavior of users that is induced by chats with emotional Bots. The Bots, d… ▽ More Quantitative research of emotions in psychology and machine-learning methods for extracting emotion components from text messages open an avenue for physical science to explore the nature of stochastic processes in which emotions play a role, e.g., in human dynamics online. Here, we investigate the occurrence of collective behavior of users that is induced by chats with emotional Bots. The Bots, designed in an experimental environment, are considered. Furthermore, using the agent-based modeling approach, the activity of these experimental Bots is simulated within a social network of interacting emotional agents. Quantitative analysis of time series carrying emotional messages by agents suggests temporal correlations and persistent fluctuations with clustering according to emotion similarity. {All data used in this study are fully anonymized.} △ Less

Submitted 16 April, 2014; originally announced April 2014.

Comments: 16 pages, 11 figures

Journal ref: ScienceJet 2014, 3: 50

arXiv:1303.3990

Master thesis: Growth and Self-Organization Processes in Directed Social Network

Authors: Vladimir Gligorijevic

Abstract: Large dataset collected from Ubuntu chat channel is studied as a complex dynamical system with emergent collective behaviour of users. With the appropriate network map**s we examined wealthy topological structure of Ubuntu network. The structure of this network is determined by computing different topological measures. The directed, weighted network, which is a suitable representation of the dat… ▽ More Large dataset collected from Ubuntu chat channel is studied as a complex dynamical system with emergent collective behaviour of users. With the appropriate network map**s we examined wealthy topological structure of Ubuntu network. The structure of this network is determined by computing different topological measures. The directed, weighted network, which is a suitable representation of the dataset from Ubuntu chat channel is characterized with power law dependencies of various quantities, hierarchical organization and disassortative mixing patterns. Beyond the topological features, the emergent collective state is further quantified by analysis of time series of users activities driven by emotions. Analysis of time series reveals self-organized dynamics with long-range temporal correlations in user actions. △ Less

Submitted 14 December, 2013; v1 submitted 16 March, 2013; originally announced March 2013.

Comments: This paper has been withdrawn due to its incompleteness

arXiv:1209.4760 [pdf, ps, other]

doi 10.1016/j.physa.2012.10.003

Structure and stability of online chat networks built on emotion-carrying links

Authors: Vladimir Gligorijevic, Marcin Skowron, Bosiljka Tadic

Abstract: High-resolution data of online chats are studied as a physical system in laboratory in order to quantify collective behavior of users. Our analysis reveals strong regularities characteristic to natural systems with additional features. In particular, we find self-organized dynamics with long-range correlations in user actions and persistent associations among users that have the properties of a so… ▽ More High-resolution data of online chats are studied as a physical system in laboratory in order to quantify collective behavior of users. Our analysis reveals strong regularities characteristic to natural systems with additional features. In particular, we find self-organized dynamics with long-range correlations in user actions and persistent associations among users that have the properties of a social network. Furthermore, the evolution of the graph and its architecture with specific k-core structure are shown to be related with the type and the emotion arousal of exchanged messages. Partitioning of the graph by deletion of the links which carry high arousal messages exhibits critical fluctuations at the percolation threshold. △ Less

Submitted 21 September, 2012; originally announced September 2012.

Comments: 10 pages, 5 figures

arXiv:1206.6588 [pdf, ps, other]

How the online social networks are used: Dialogs-based structure of MySpace

Authors: Milovan Suvakov, Marija Mitrovic, Vladimir Gligorijevic, Bosiljka Tadic

Abstract: Quantitative study of collective dynamics in online social networks is a new challenge based on the abundance of empirical data. Conclusions, however, may depend on factors as user's psychology profiles and their reasons to use the online contacts. In this paper we have compiled and analyzed two datasets from \texttt{MySpace}. The data contain networked dialogs occurring within a specified time de… ▽ More Quantitative study of collective dynamics in online social networks is a new challenge based on the abundance of empirical data. Conclusions, however, may depend on factors as user's psychology profiles and their reasons to use the online contacts. In this paper we have compiled and analyzed two datasets from \texttt{MySpace}. The data contain networked dialogs occurring within a specified time depth, high temporal resolution, and texts of messages, in which the emotion valence is assessed by using SentiStrength classifier. Performing a comprehensive analysis we obtain three groups of results: Dynamic topology of the dialogs-based networks have characteristic structure with Zipf's distribution of communities, low link reciprocity, and disassortative correlations. Overlaps supporting "weak-ties" hypothesis are found to follow the laws recently conjectured for online games. Long-range temporal correlations and persistent fluctuations occur in the time series of messages carrying positive (negative) emotion. Patterns of user communications have dominant positive emotion (attractiveness) and strong impact of circadian cycles and nteractivity times longer than one day. Taken together, these results give a new insight into functioning of the online social networks and unveil importance of the amount of information and emotion that is communicated along the social links. (All data used in this study are fully anonymized.) △ Less

Submitted 28 June, 2012; originally announced June 2012.

Comments: 18 pages, 12 figures (resized to 50KB)

Showing 1–11 of 11 results for author: Gligorijević, V