Search | arXiv e-print repository

Role of Structural and Conformational Diversity for Machine Learning Potentials

Authors: Nikhil Shenoy, Prudencio Tossou, Emmanuel Noutahi, Hadrien Mary, Dominique Beaini, Jiarui Ding

Abstract: In the field of Machine Learning Interatomic Potentials (MLIPs), understanding the intricate relationship between data biases, specifically conformational and structural diversity, and model generalization is critical in improving the quality of Quantum Mechanics (QM) data generation efforts. We investigate these dynamics through two distinct experiments: a fixed budget one, where the dataset size… ▽ More In the field of Machine Learning Interatomic Potentials (MLIPs), understanding the intricate relationship between data biases, specifically conformational and structural diversity, and model generalization is critical in improving the quality of Quantum Mechanics (QM) data generation efforts. We investigate these dynamics through two distinct experiments: a fixed budget one, where the dataset size remains constant, and a fixed molecular set one, which focuses on fixed structural diversity while varying conformational diversity. Our results reveal nuanced patterns in generalization metrics. Notably, for optimal structural and conformational generalization, a careful balance between structural and conformational diversity is required, but existing QM datasets do not meet that trade-off. Additionally, our results highlight the limitation of the MLIP models at generalizing beyond their training distribution, emphasizing the importance of defining applicability domain during model deployment. These findings provide valuable insights and guidelines for QM data generation efforts. △ Less

Submitted 30 October, 2023; originally announced November 2023.

Comments: Accepted at NeurIPS 2023 AI4D3 and AI4S workshops

arXiv:2310.10773 [pdf, other]

Gotta be SAFE: A New Framework for Molecular Design

Authors: Emmanuel Noutahi, Cristian Gabellini, Michael Craig, Jonathan S. C Lim, Prudencio Tossou

Abstract: Traditional molecular string representations, such as SMILES, often pose challenges for AI-driven molecular design due to their non-sequential depiction of molecular substructures. To address this issue, we introduce Sequential Attachment-based Fragment Embedding (SAFE), a novel line notation for chemical structures. SAFE reimagines SMILES strings as an unordered sequence of interconnected fragmen… ▽ More Traditional molecular string representations, such as SMILES, often pose challenges for AI-driven molecular design due to their non-sequential depiction of molecular substructures. To address this issue, we introduce Sequential Attachment-based Fragment Embedding (SAFE), a novel line notation for chemical structures. SAFE reimagines SMILES strings as an unordered sequence of interconnected fragment blocks while maintaining compatibility with existing SMILES parsers. It streamlines complex generative tasks, including scaffold decoration, fragment linking, polymer generation, and scaffold hop**, while facilitating autoregressive generation for fragment-constrained design, thereby eliminating the need for intricate decoding or graph-based models. We demonstrate the effectiveness of SAFE by training an 87-million-parameter GPT2-like model on a dataset containing 1.1 billion SAFE representations. Through targeted experimentation, we show that our SAFE-GPT model exhibits versatile and robust optimization performance. SAFE opens up new avenues for the rapid exploration of chemical space under various constraints, promising breakthroughs in AI-driven molecular design. △ Less

Submitted 10 December, 2023; v1 submitted 16 October, 2023; originally announced October 2023.

Comments: Code, data and models available at: https://github.com/datamol-io/safe/

arXiv:2004.14308 [pdf, other]

doi 10.1021/acsomega.0c04153

Molecular Design in Synthetically Accessible Chemical Space via Deep Reinforcement Learning

Authors: Julien Horwood, Emmanuel Noutahi

Abstract: The fundamental goal of generative drug design is to propose optimized molecules that meet predefined activity, selectivity, and pharmacokinetic criteria. Despite recent progress, we argue that existing generative methods are limited in their ability to favourably shift the distributions of molecular properties during optimization. We instead propose a novel Reinforcement Learning framework for mo… ▽ More The fundamental goal of generative drug design is to propose optimized molecules that meet predefined activity, selectivity, and pharmacokinetic criteria. Despite recent progress, we argue that existing generative methods are limited in their ability to favourably shift the distributions of molecular properties during optimization. We instead propose a novel Reinforcement Learning framework for molecular design in which an agent learns to directly optimize through a space of synthetically-accessible drug-like molecules. This becomes possible by defining transitions in our Markov Decision Process as chemical reactions, and allows us to leverage synthetic routes as an inductive bias. We validate our method by demonstrating that it outperforms existing state-of the art approaches in the optimization of pharmacologically-relevant objectives, while results on multi-objective optimization tasks suggest increased scalability to realistic pharmaceutical design problems. △ Less

Submitted 16 October, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

arXiv:1905.11577 [pdf, other]

Towards Interpretable Sparse Graph Representation Learning with Laplacian Pooling

Authors: Emmanuel Noutahi, Dominique Beaini, Julien Horwood, Sébastien Giguère, Prudencio Tossou

Abstract: Recent work in graph neural networks (GNNs) has led to improvements in molecular activity and property prediction tasks. Unfortunately, GNNs often fail to capture the relative importance of interactions between molecular substructures, in part due to the absence of efficient intermediate pooling steps. To address these issues, we propose LaPool (Laplacian Pooling), a novel, data-driven, and interp… ▽ More Recent work in graph neural networks (GNNs) has led to improvements in molecular activity and property prediction tasks. Unfortunately, GNNs often fail to capture the relative importance of interactions between molecular substructures, in part due to the absence of efficient intermediate pooling steps. To address these issues, we propose LaPool (Laplacian Pooling), a novel, data-driven, and interpretable hierarchical graph pooling method that takes into account both node features and graph structure to improve molecular representation. We benchmark LaPool on molecular graph prediction and understanding tasks and show that it outperforms recent GNNs. Interestingly, LaPool also remains competitive on non-molecular tasks. Both quantitative and qualitative assessments are done to demonstrate LaPool's improved interpretability and highlight its potential benefits in drug design. Finally, we demonstrate LaPool's utility for the generation of valid and novel molecules by incorporating it into an adversarial autoencoder. △ Less

Submitted 2 April, 2020; v1 submitted 27 May, 2019; originally announced May 2019.

Comments: 11 pages, with Appendices

Showing 1–4 of 4 results for author: Noutahi, E