Search | arXiv e-print repository

Role of Structural and Conformational Diversity for Machine Learning Potentials

Authors: Nikhil Shenoy, Prudencio Tossou, Emmanuel Noutahi, Hadrien Mary, Dominique Beaini, Jiarui Ding

Abstract: In the field of Machine Learning Interatomic Potentials (MLIPs), understanding the intricate relationship between data biases, specifically conformational and structural diversity, and model generalization is critical in improving the quality of Quantum Mechanics (QM) data generation efforts. We investigate these dynamics through two distinct experiments: a fixed budget one, where the dataset size… ▽ More In the field of Machine Learning Interatomic Potentials (MLIPs), understanding the intricate relationship between data biases, specifically conformational and structural diversity, and model generalization is critical in improving the quality of Quantum Mechanics (QM) data generation efforts. We investigate these dynamics through two distinct experiments: a fixed budget one, where the dataset size remains constant, and a fixed molecular set one, which focuses on fixed structural diversity while varying conformational diversity. Our results reveal nuanced patterns in generalization metrics. Notably, for optimal structural and conformational generalization, a careful balance between structural and conformational diversity is required, but existing QM datasets do not meet that trade-off. Additionally, our results highlight the limitation of the MLIP models at generalizing beyond their training distribution, emphasizing the importance of defining applicability domain during model deployment. These findings provide valuable insights and guidelines for QM data generation efforts. △ Less

Submitted 30 October, 2023; originally announced November 2023.

Comments: Accepted at NeurIPS 2023 AI4D3 and AI4S workshops

arXiv:2310.10773 [pdf, other]

Gotta be SAFE: A New Framework for Molecular Design

Authors: Emmanuel Noutahi, Cristian Gabellini, Michael Craig, Jonathan S. C Lim, Prudencio Tossou

Abstract: Traditional molecular string representations, such as SMILES, often pose challenges for AI-driven molecular design due to their non-sequential depiction of molecular substructures. To address this issue, we introduce Sequential Attachment-based Fragment Embedding (SAFE), a novel line notation for chemical structures. SAFE reimagines SMILES strings as an unordered sequence of interconnected fragmen… ▽ More Traditional molecular string representations, such as SMILES, often pose challenges for AI-driven molecular design due to their non-sequential depiction of molecular substructures. To address this issue, we introduce Sequential Attachment-based Fragment Embedding (SAFE), a novel line notation for chemical structures. SAFE reimagines SMILES strings as an unordered sequence of interconnected fragment blocks while maintaining compatibility with existing SMILES parsers. It streamlines complex generative tasks, including scaffold decoration, fragment linking, polymer generation, and scaffold hop**, while facilitating autoregressive generation for fragment-constrained design, thereby eliminating the need for intricate decoding or graph-based models. We demonstrate the effectiveness of SAFE by training an 87-million-parameter GPT2-like model on a dataset containing 1.1 billion SAFE representations. Through targeted experimentation, we show that our SAFE-GPT model exhibits versatile and robust optimization performance. SAFE opens up new avenues for the rapid exploration of chemical space under various constraints, promising breakthroughs in AI-driven molecular design. △ Less

Submitted 10 December, 2023; v1 submitted 16 October, 2023; originally announced October 2023.

Comments: Code, data and models available at: https://github.com/datamol-io/safe/

arXiv:2310.04292 [pdf, other]

Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

Authors: Dominique Beaini, Shenyang Huang, Joao Alex Cunha, Zhiyi Li, Gabriela Moisescu-Pareja, Oleksandr Dymov, Samuel Maddrell-Mander, Callum McLean, Frederik Wenkel, Luis Müller, Jama Hussein Mohamud, Ali Parviz, Michael Craig, Michał Koziarski, Jiarui Lu, Zhaocheng Zhu, Cristian Gabellini, Kerstin Klaser, Josef Dean, Cas Wognum, Maciej Sypetkowski, Guillaume Rabusseau, Reihaneh Rabbany, Jian Tang, Christopher Morris , et al. (10 additional authors not shown)

Abstract: Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, where datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by… ▽ More Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, where datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. In addition, to support the development of foundational models based on our proposed datasets, we present the Graphium graph machine learning library which simplifies the process of building and training molecular machine learning models for multi-task and multi-level molecular datasets. Finally, we present a range of baseline results as a starting point of multi-task and multi-level training on these datasets. Empirically, we observe that performance on low-resource biological datasets show improvement by also training on large amounts of quantum data. This indicates that there may be potential in multi-task and multi-level training of a foundation model and fine-tuning it to resource-constrained downstream tasks. △ Less

Submitted 18 October, 2023; v1 submitted 6 October, 2023; originally announced October 2023.

arXiv:2110.04126 [pdf, other]

3D Infomax improves GNNs for Molecular Property Prediction

Authors: Hannes Stärk, Dominique Beaini, Gabriele Corso, Prudencio Tossou, Christian Dallago, Stephan Günnemann, Pietro Liò

Abstract: Molecular property prediction is one of the fastest-growing applications of deep learning with critical real-world impacts. Including 3D molecular structure as input to learned models improves their performance for many molecular tasks. However, this information is infeasible to compute at the scale required by several real-world applications. We propose pre-training a model to reason about the ge… ▽ More Molecular property prediction is one of the fastest-growing applications of deep learning with critical real-world impacts. Including 3D molecular structure as input to learned models improves their performance for many molecular tasks. However, this information is infeasible to compute at the scale required by several real-world applications. We propose pre-training a model to reason about the geometry of molecules given only their 2D molecular graphs. Using methods from self-supervised learning, we maximize the mutual information between 3D summary vectors and the representations of a Graph Neural Network (GNN) such that they contain latent 3D information. During fine-tuning on molecules with unknown geometry, the GNN still generates implicit 3D information and can use it to improve downstream tasks. We show that 3D pre-training provides significant improvements for a wide range of properties, such as a 22% average MAE reduction on eight quantum mechanical properties. Moreover, the learned representations can be effectively transferred between datasets in different molecular spaces. △ Less

Submitted 4 June, 2022; v1 submitted 8 October, 2021; originally announced October 2021.

Comments: 39th International Conference on Machine Learning (ICML 2022). Also accepted at NeurIPS 2021 ML4PH, AI4S, and SSL workshops and as oral at ELLIS ML4Molecules. 24 pages, 7 figures, 18 tables

Journal ref: 39th International Conference on Machine Learning (ICML 2022)

arXiv:2106.03893 [pdf, other]

Rethinking Graph Transformers with Spectral Attention

Authors: Devin Kreuzer, Dominique Beaini, William L. Hamilton, Vincent Létourneau, Prudencio Tossou

Abstract: In recent years, the Transformer architecture has proven to be very successful in sequence processing, but its application to other data structures, such as graphs, has remained limited due to the difficulty of properly defining positions. Here, we present the $\textit{Spectral Attention Network}$ (SAN), which uses a learned positional encoding (LPE) that can take advantage of the full Laplacian s… ▽ More In recent years, the Transformer architecture has proven to be very successful in sequence processing, but its application to other data structures, such as graphs, has remained limited due to the difficulty of properly defining positions. Here, we present the $\textit{Spectral Attention Network}$ (SAN), which uses a learned positional encoding (LPE) that can take advantage of the full Laplacian spectrum to learn the position of each node in a given graph. This LPE is then added to the node features of the graph and passed to a fully-connected Transformer. By leveraging the full spectrum of the Laplacian, our model is theoretically powerful in distinguishing graphs, and can better detect similar sub-structures from their resonance. Further, by fully connecting the graph, the Transformer does not suffer from over-squashing, an information bottleneck of most GNNs, and enables better modeling of physical phenomenons such as heat transfer and electric interaction. When tested empirically on a set of 4 standard datasets, our model performs on par or better than state-of-the-art GNNs, and outperforms any attention-based model by a wide margin, becoming the first fully-connected architecture to perform well on graph benchmarks. △ Less

Submitted 27 October, 2021; v1 submitted 7 June, 2021; originally announced June 2021.

Comments: Accepted in Proceedings of NeurIPS 2021

arXiv:2005.07852 [pdf, other]

Geodesics in fibered latent spaces: A geometric approach to learning correspondences between conditions

Authors: Tariq Daouda, Reda Chhaibi, Prudencio Tossou, Alexandra-Chloé Villani

Abstract: This work introduces a geometric framework and a novel network architecture for creating correspondences between samples of different conditions. Under this formalism, the latent space is a fiber bundle stratified into a base space encoding conditions, and a fiber space encoding the variations within conditions. Furthermore, this latent space is endowed with a natural pull-back metric. The corresp… ▽ More This work introduces a geometric framework and a novel network architecture for creating correspondences between samples of different conditions. Under this formalism, the latent space is a fiber bundle stratified into a base space encoding conditions, and a fiber space encoding the variations within conditions. Furthermore, this latent space is endowed with a natural pull-back metric. The correspondences between conditions are obtained by minimizing an energy functional, resulting in diffeomorphism flows between fibers. We illustrate this approach using MNIST and Olivetti and benchmark its performances on the task of batch correction, which is the problem of integrating multiple biological datasets together. △ Less

Submitted 27 December, 2020; v1 submitted 15 May, 2020; originally announced May 2020.

Comments: 36 pages, many figures. v1: Preliminary version. v2: Minor ref fix. v3: Submitted version with enhanced presentation

arXiv:1905.12131 [pdf, other]

Adaptive Deep Kernel Learning

Authors: Prudencio Tossou, Basile Dura, Francois Laviolette, Mario Marchand, Alexandre Lacoste

Abstract: Deep kernel learning provides an elegant and principled framework for combining the structural properties of deep learning algorithms with the flexibility of kernel methods. By means of a deep neural network, we learn a parametrized kernel operator that can be combined with a differentiable kernel algorithm during inference. While previous work within this framework has focused on learning a singl… ▽ More Deep kernel learning provides an elegant and principled framework for combining the structural properties of deep learning algorithms with the flexibility of kernel methods. By means of a deep neural network, we learn a parametrized kernel operator that can be combined with a differentiable kernel algorithm during inference. While previous work within this framework has focused on learning a single kernel for large datasets, we learn a kernel family for a variety of few-shot regression tasks. Compared to single deep kernel learning, our algorithm enables the identification of the appropriate kernel for each task during inference. As such, it is well adapted for complex task distributions in a few-shot learning setting, which we demonstrate by comparing against existing state-of-the-art algorithms using real-world, few-shot regression tasks related to the field of drug discovery. △ Less

Submitted 11 December, 2020; v1 submitted 28 May, 2019; originally announced May 2019.

arXiv:1905.11577 [pdf, other]

Towards Interpretable Sparse Graph Representation Learning with Laplacian Pooling

Authors: Emmanuel Noutahi, Dominique Beaini, Julien Horwood, Sébastien Giguère, Prudencio Tossou

Abstract: Recent work in graph neural networks (GNNs) has led to improvements in molecular activity and property prediction tasks. Unfortunately, GNNs often fail to capture the relative importance of interactions between molecular substructures, in part due to the absence of efficient intermediate pooling steps. To address these issues, we propose LaPool (Laplacian Pooling), a novel, data-driven, and interp… ▽ More Recent work in graph neural networks (GNNs) has led to improvements in molecular activity and property prediction tasks. Unfortunately, GNNs often fail to capture the relative importance of interactions between molecular substructures, in part due to the absence of efficient intermediate pooling steps. To address these issues, we propose LaPool (Laplacian Pooling), a novel, data-driven, and interpretable hierarchical graph pooling method that takes into account both node features and graph structure to improve molecular representation. We benchmark LaPool on molecular graph prediction and understanding tasks and show that it outperforms recent GNNs. Interestingly, LaPool also remains competitive on non-molecular tasks. Both quantitative and qualitative assessments are done to demonstrate LaPool's improved interpretability and highlight its potential benefits in drug design. Finally, we demonstrate LaPool's utility for the generation of valid and novel molecules by incorporating it into an adversarial autoencoder. △ Less

Submitted 2 April, 2020; v1 submitted 27 May, 2019; originally announced May 2019.

Comments: 11 pages, with Appendices

Showing 1–8 of 8 results for author: Tossou, P