-
Protein Discovery with Discrete Walk-Jump Sampling
Authors:
Nathan C. Frey,
Daniel Berenberg,
Karina Zadorozhny,
Joseph Kleinhenz,
Julien Lafrance-Vanasse,
Isidro Hotzel,
Yan Wu,
Stephen Ra,
Richard Bonneau,
Kyunghyun Cho,
Andreas Loukas,
Vladimir Gligorijevic,
Saeed Saremi
Abstract:
We resolve difficulties in training and sampling from a discrete generative model by learning a smoothed energy function, sampling from the smoothed data manifold with Langevin Markov chain Monte Carlo (MCMC), and projecting back to the true data manifold with one-step denoising. Our Discrete Walk-Jump Sampling formalism combines the contrastive divergence training of an energy-based model and imp…
▽ More
We resolve difficulties in training and sampling from a discrete generative model by learning a smoothed energy function, sampling from the smoothed data manifold with Langevin Markov chain Monte Carlo (MCMC), and projecting back to the true data manifold with one-step denoising. Our Discrete Walk-Jump Sampling formalism combines the contrastive divergence training of an energy-based model and improved sample quality of a score-based model, while simplifying training and sampling by requiring only a single noise level. We evaluate the robustness of our approach on generative modeling of antibody proteins and introduce the distributional conformity score to benchmark protein generative models. By optimizing and sampling from our models for the proposed distributional conformity score, 97-100% of generated samples are successfully expressed and purified and 70% of functional designs show equal or improved binding affinity compared to known functional antibodies on the first attempt in a single round of laboratory experiments. We also report the first demonstration of long-run fast-mixing MCMC chains where diverse antibody protein classes are visited in a single MCMC chain.
△ Less
Submitted 15 March, 2024; v1 submitted 8 June, 2023;
originally announced June 2023.
-
Protein Design with Guided Discrete Diffusion
Authors:
Nate Gruver,
Samuel Stanton,
Nathan C. Frey,
Tim G. J. Rudner,
Isidro Hotzel,
Julien Lafrance-Vanasse,
Arvind Rajpal,
Kyunghyun Cho,
Andrew Gordon Wilson
Abstract:
A popular approach to protein design is to combine a generative model with a discriminative model for conditional sampling. The generative model samples plausible sequences while the discriminative model guides a search for sequences with high fitness. Given its broad success in conditional sampling, classifier-guided diffusion modeling is a promising foundation for protein design, leading many to…
▽ More
A popular approach to protein design is to combine a generative model with a discriminative model for conditional sampling. The generative model samples plausible sequences while the discriminative model guides a search for sequences with high fitness. Given its broad success in conditional sampling, classifier-guided diffusion modeling is a promising foundation for protein design, leading many to develop guided diffusion models for structure with inverse folding to recover sequences. In this work, we propose diffusioN Optimized Sampling (NOS), a guidance method for discrete diffusion models that follows gradients in the hidden states of the denoising network. NOS makes it possible to perform design directly in sequence space, circumventing significant limitations of structure-based methods, including scarce data and challenging inverse design. Moreover, we use NOS to generalize LaMBO, a Bayesian optimization procedure for sequence design that facilitates multiple objectives and edit-based constraints. The resulting method, LaMBO-2, enables discrete diffusions and stronger performance with limited edits through a novel application of saliency maps. We apply LaMBO-2 to a real-world protein design task, optimizing antibodies for higher expression yield and binding affinity to several therapeutic targets under locality and developability constraints, attaining a 99% expression rate and 40% binding rate in exploratory in vitro experiments.
△ Less
Submitted 12 December, 2023; v1 submitted 31 May, 2023;
originally announced May 2023.
-
SupSiam: Non-contrastive Auxiliary Loss for Learning from Molecular Conformers
Authors:
Michael Maser,
Ji Won Park,
Joshua Yao-Yu Lin,
Jae Hyeon Lee,
Nathan C. Frey,
Andrew Watkins
Abstract:
We investigate Siamese networks for learning related embeddings for augmented samples of molecular conformers. We find that a non-contrastive (positive-pair only) auxiliary task aids in supervised training of Euclidean neural networks (E3NNs) and increases manifold smoothness (MS) around point-cloud geometries. We demonstrate this property for multiple drug-activity prediction tasks while maintain…
▽ More
We investigate Siamese networks for learning related embeddings for augmented samples of molecular conformers. We find that a non-contrastive (positive-pair only) auxiliary task aids in supervised training of Euclidean neural networks (E3NNs) and increases manifold smoothness (MS) around point-cloud geometries. We demonstrate this property for multiple drug-activity prediction tasks while maintaining relevant performance metrics, and propose an extension of MS to probabilistic and regression settings. We provide an analysis of representation collapse, finding substantial effects of task-weighting, latent dimension, and regularization. We expect the presented protocol to aid in the development of reliable E3NNs from molecular conformers, even for small-data drug discovery programs.
△ Less
Submitted 15 February, 2023;
originally announced February 2023.
-
A Green(er) World for A.I
Authors:
Dan Zhao,
Nathan C. Frey,
Joseph McDonald,
Matthew Hubbell,
David Bestor,
Michael Jones,
Andrew Prout,
Vijay Gadepally,
Siddharth Samsi
Abstract:
As research and practice in artificial intelligence (A.I.) grow in leaps and bounds, the resources necessary to sustain and support their operations also grow at an increasing pace. While innovations and applications from A.I. have brought significant advances, from applications to vision and natural language to improvements to fields like medical imaging and materials engineering, their costs sho…
▽ More
As research and practice in artificial intelligence (A.I.) grow in leaps and bounds, the resources necessary to sustain and support their operations also grow at an increasing pace. While innovations and applications from A.I. have brought significant advances, from applications to vision and natural language to improvements to fields like medical imaging and materials engineering, their costs should not be neglected. As we embrace a world with ever-increasing amounts of data as well as research and development of A.I. applications, we are sure to face an ever-mounting energy footprint to sustain these computational budgets, data storage needs, and more. But, is this sustainable and, more importantly, what kind of setting is best positioned to nurture such sustainable A.I. in both research and practice? In this paper, we outline our outlook for Green A.I. -- a more sustainable, energy-efficient and energy-aware ecosystem for develo** A.I. across the research, computing, and practitioner communities alike -- and the steps required to arrive there. We present a bird's eye view of various areas for potential changes and improvements from the ground floor of AI's operational and hardware optimizations for datacenters/HPCs to the current incentive structures in the world of A.I. research and practice, and more. We hope these points will spur further discussion, and action, on some of these issues and their potential solutions.
△ Less
Submitted 27 January, 2023;
originally announced January 2023.
-
Graph Contrastive Learning for Materials
Authors:
Teddy Koker,
Keegan Quigley,
Will Spaeth,
Nathan C. Frey,
Lin Li
Abstract:
Recent work has shown the potential of graph neural networks to efficiently predict material properties, enabling high-throughput screening of materials. Training these models, however, often requires large quantities of labelled data, obtained via costly methods such as ab initio calculations or experimental evaluation. By leveraging a series of material-specific transformations, we introduce Cry…
▽ More
Recent work has shown the potential of graph neural networks to efficiently predict material properties, enabling high-throughput screening of materials. Training these models, however, often requires large quantities of labelled data, obtained via costly methods such as ab initio calculations or experimental evaluation. By leveraging a series of material-specific transformations, we introduce CrystalCLR, a framework for constrastive learning of representations with crystal graph neural networks. With the addition of a novel loss function, our framework is able to learn representations competitive with engineered fingerprinting methods. We also demonstrate that via model finetuning, contrastive pretraining can improve the performance of graph neural networks for prediction of material properties and significantly outperform traditional ML models that use engineered fingerprints. Lastly, we observe that CrystalCLR produces material representations that form clusters by compound class.
△ Less
Submitted 23 November, 2022;
originally announced November 2022.
-
A Pareto-optimal compositional energy-based model for sampling and optimization of protein sequences
Authors:
Nataša Tagasovska,
Nathan C. Frey,
Andreas Loukas,
Isidro Hötzel,
Julien Lafrance-Vanasse,
Ryan Lewis Kelly,
Yan Wu,
Arvind Rajpal,
Richard Bonneau,
Kyunghyun Cho,
Stephen Ra,
Vladimir Gligorijević
Abstract:
Deep generative models have emerged as a popular machine learning-based approach for inverse design problems in the life sciences. However, these problems often require sampling new designs that satisfy multiple properties of interest in addition to learning the data distribution. This multi-objective optimization becomes more challenging when properties are independent or orthogonal to each other…
▽ More
Deep generative models have emerged as a popular machine learning-based approach for inverse design problems in the life sciences. However, these problems often require sampling new designs that satisfy multiple properties of interest in addition to learning the data distribution. This multi-objective optimization becomes more challenging when properties are independent or orthogonal to each other. In this work, we propose a Pareto-compositional energy-based model (pcEBM), a framework that uses multiple gradient descent for sampling new designs that adhere to various constraints in optimizing distinct properties. We demonstrate its ability to learn non-convex Pareto fronts and generate sequences that simultaneously satisfy multiple desired properties across a series of real-world antibody design tasks.
△ Less
Submitted 19 October, 2022;
originally announced October 2022.
-
SELFIES and the future of molecular string representations
Authors:
Mario Krenn,
Qianxiang Ai,
Senja Barthel,
Nessa Carson,
Angelo Frei,
Nathan C. Frey,
Pascal Friederich,
Théophile Gaudin,
Alberto Alexander Gayle,
Kevin Maik Jablonka,
Rafael F. Lameiro,
Dominik Lemm,
Alston Lo,
Seyed Mohamad Moosavi,
José Manuel Nápoles-Duarte,
AkshatKumar Nigam,
Robert Pollice,
Kohulan Rajan,
Ulrich Schatzschneider,
Philippe Schwaller,
Marta Skreta,
Berend Smit,
Felix Strieth-Kalthoff,
Chong Sun,
Gary Tom
, et al. (6 additional authors not shown)
Abstract:
Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool…
▽ More
Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, SMILES, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, SMILES has several shortcomings -- most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100\% robustness: SELFIES (SELF-referencIng Embedded Strings). SELFIES has since simplified and enabled numerous new applications in chemistry. In this manuscript, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete Future Projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science.
△ Less
Submitted 31 March, 2022;
originally announced April 2022.
-
Benchmarking Resource Usage for Efficient Distributed Deep Learning
Authors:
Nathan C. Frey,
Baolin Li,
Joseph McDonald,
Dan Zhao,
Michael Jones,
David Bestor,
Devesh Tiwari,
Vijay Gadepally,
Siddharth Samsi
Abstract:
Deep learning (DL) workflows demand an ever-increasing budget of compute and energy in order to achieve outsized gains. Neural architecture searches, hyperparameter sweeps, and rapid prototy** consume immense resources that can prevent resource-constrained researchers from experimenting with large models and carry considerable environmental impact. As such, it becomes essential to understand how…
▽ More
Deep learning (DL) workflows demand an ever-increasing budget of compute and energy in order to achieve outsized gains. Neural architecture searches, hyperparameter sweeps, and rapid prototy** consume immense resources that can prevent resource-constrained researchers from experimenting with large models and carry considerable environmental impact. As such, it becomes essential to understand how different deep neural networks (DNNs) and training leverage increasing compute and energy resources -- especially specialized computationally-intensive models across different domains and applications.
In this paper, we conduct over 3,400 experiments training an array of deep networks representing various domains/tasks -- natural language processing, computer vision, and chemistry -- on up to 424 graphics processing units (GPUs). During training, our experiments systematically vary compute resource characteristics and energy-saving mechanisms such as power utilization and GPU clock rate limits to capture and illustrate the different trade-offs and scaling behaviors each representative model exhibits under various resource and energy-constrained regimes. We fit power law models that describe how training time scales with available compute resources and energy constraints. We anticipate that these findings will help inform and guide high-performance computing providers in optimizing resource utilization, by selectively reducing energy consumption for different deep learning tasks/workflows with minimal impact on training.
△ Less
Submitted 28 January, 2022;
originally announced January 2022.
-
FastFlows: Flow-Based Models for Molecular Graph Generation
Authors:
Nathan C. Frey,
Vijay Gadepally,
Bharath Ramsundar
Abstract:
We propose a framework using normalizing-flow based models, SELF-Referencing Embedded Strings, and multi-objective optimization that efficiently generates small molecules. With an initial training set of only 100 small molecules, FastFlows generates thousands of chemically valid molecules in seconds. Because of the efficient sampling, substructure filters can be applied as desired to eliminate com…
▽ More
We propose a framework using normalizing-flow based models, SELF-Referencing Embedded Strings, and multi-objective optimization that efficiently generates small molecules. With an initial training set of only 100 small molecules, FastFlows generates thousands of chemically valid molecules in seconds. Because of the efficient sampling, substructure filters can be applied as desired to eliminate compounds with unreasonable moieties. Using easily computable and learned metrics for druglikeness, synthetic accessibility, and synthetic complexity, we perform a multi-objective optimization to demonstrate how FastFlows functions in a high-throughput virtual screening context. Our model is significantly simpler and easier to train than autoregressive molecular generative models, and enables fast generation and identification of druglike, synthesizable molecules.
△ Less
Submitted 28 January, 2022;
originally announced January 2022.
-
Bringing Atomistic Deep Learning to Prime Time
Authors:
Nathan C. Frey,
Siddharth Samsi,
Bharath Ramsundar,
Connor W. Coley,
Vijay Gadepally
Abstract:
Artificial intelligence has not yet revolutionized the design of materials and molecules. In this perspective, we identify four barriers preventing the integration of atomistic deep learning, molecular science, and high-performance computing. We outline focused research efforts to address the opportunities presented by these challenges.
Artificial intelligence has not yet revolutionized the design of materials and molecules. In this perspective, we identify four barriers preventing the integration of atomistic deep learning, molecular science, and high-performance computing. We outline focused research efforts to address the opportunities presented by these challenges.
△ Less
Submitted 9 December, 2021;
originally announced December 2021.
-
Scalable Geometric Deep Learning on Molecular Graphs
Authors:
Nathan C. Frey,
Siddharth Samsi,
Joseph McDonald,
Lin Li,
Connor W. Coley,
Vijay Gadepally
Abstract:
Deep learning in molecular and materials sciences is limited by the lack of integration between applied science, artificial intelligence, and high-performance computing. Bottlenecks with respect to the amount of training data, the size and complexity of model architectures, and the scale of the compute infrastructure are all key factors limiting the scaling of deep learning for molecules and mater…
▽ More
Deep learning in molecular and materials sciences is limited by the lack of integration between applied science, artificial intelligence, and high-performance computing. Bottlenecks with respect to the amount of training data, the size and complexity of model architectures, and the scale of the compute infrastructure are all key factors limiting the scaling of deep learning for molecules and materials. Here, we present $\textit{LitMatter}$, a lightweight framework for scaling molecular deep learning methods. We train four graph neural network architectures on over 400 GPUs and investigate the scaling behavior of these methods. Depending on the model architecture, training time speedups up to $60\times$ are seen. Empirical neural scaling relations quantify the model-dependent scaling and enable optimal compute resource allocation and the identification of scalable molecular geometric deep learning model implementations.
△ Less
Submitted 6 December, 2021;
originally announced December 2021.
-
The Pseudo Projection Operator: Applications of Deep Learning to Projection Based Filtering in Non-Trivial Frequency Regimes
Authors:
Matthew L. Weiss,
Nathan C. Frey,
Siddharth Samsi,
Randy C. Paffenroth,
Vijay Gadepally
Abstract:
Traditional frequency based projection filters, or projection operators (PO), separate signal and noise through a series of transformations which remove frequencies where noise is present. However, this technique relies on a priori knowledge of what frequencies contain signal and noise and that these frequencies do not overlap, which is difficult to achieve in practice. To address these issues, we…
▽ More
Traditional frequency based projection filters, or projection operators (PO), separate signal and noise through a series of transformations which remove frequencies where noise is present. However, this technique relies on a priori knowledge of what frequencies contain signal and noise and that these frequencies do not overlap, which is difficult to achieve in practice. To address these issues, we introduce a PO-neural network hybrid model, the Pseudo Projection Operator (PPO), which leverages a neural network to perform frequency selection. We compare the filtering capabilities of a PPO, PO, and denoising autoencoder (DAE) on the University of Rochester Multi-Modal Music Performance Dataset with a variety of added noise types. In the majority of experiments, the PPO outperforms both the PO and DAE. Based upon these results, we suggest future application of the PPO to filtering problems in the physical and biological sciences.
△ Less
Submitted 13 April, 2022; v1 submitted 13 November, 2021;
originally announced November 2021.
-
High-throughput search for magnetic and topological order in transition metal oxides
Authors:
Nathan C. Frey,
Matthew K. Horton,
Jason M. Munro,
Sinéad M. Griffin,
Kristin A. Persson,
Vivek B. Shenoy
Abstract:
The discovery of intrinsic magnetic topological order in $\rm MnBi_2Te_4$ has invigorated the search for materials with coexisting magnetic and topological phases. These multi-order quantum materials are expected to exhibit new topological phases that can be tuned with magnetic fields, but the search for such materials is stymied by difficulties in predicting magnetic structure and stability. Here…
▽ More
The discovery of intrinsic magnetic topological order in $\rm MnBi_2Te_4$ has invigorated the search for materials with coexisting magnetic and topological phases. These multi-order quantum materials are expected to exhibit new topological phases that can be tuned with magnetic fields, but the search for such materials is stymied by difficulties in predicting magnetic structure and stability. Here, we compute over 27,000 unique magnetic orderings for over 3,000 transition metal oxides in the Materials Project database to determine their magnetic ground states and estimate their effective exchange parameters and critical temperatures. We perform a high-throughput band topology analysis of centrosymmetric magnetic materials, calculate topological invariants, and identify 18 new candidate ferromagnetic topological semimetals, axion insulators, and antiferromagnetic topological insulators. To accelerate future efforts, machine learning classifiers are trained to predict both magnetic ground states and magnetic topological order without requiring first-principles calculations.
△ Less
Submitted 1 June, 2020;
originally announced June 2020.
-
Universal fluctuations in growth dynamics of economic systems
Authors:
Nathan C. Frey,
Sakib Matin,
H. Eugene Stanley,
Michael Salinger
Abstract:
The growth of business firms is an example of a system of complex interacting units that resembles complex interacting systems in nature such as earthquakes. Remarkably, work in econophysics has provided evidence that the statistical properties of the growth of business firms follow the same sorts of power laws that characterize physical systems near their critical points. Given how economies chan…
▽ More
The growth of business firms is an example of a system of complex interacting units that resembles complex interacting systems in nature such as earthquakes. Remarkably, work in econophysics has provided evidence that the statistical properties of the growth of business firms follow the same sorts of power laws that characterize physical systems near their critical points. Given how economies change over time, whether these statistical properties are persistent, robust, and universal like those of physical systems remains an open question. Here, we show that the scaling properties of firm growth previously demonstrated for publicly-traded U.S. manufacturing firms from 1974 to 1993 apply to the same sorts of firms from 1993 to 2015, to firms in other broad sectors (such as materials), and to firms in new sectors (such as Internet services). We measure virtually the same scaling exponent for manufacturing for the 1993 to 2015 period as for the 1974 to 1993 period and virtually the same scaling exponent for other sectors as for manufacturing. Furthermore, we show that fluctuations of the growth rate for new industries self-organize into a power law over relatively short time scales.
△ Less
Submitted 21 May, 2018; v1 submitted 5 December, 2017;
originally announced December 2017.