-
ZSMILES: an approach for efficient SMILES storage for random access in Virtual Screening
Authors:
Gianmarco Accordi,
Davide Gadioli,
Giorgio Seguini,
Andrea R. Beccari,
Gianluca Palermo
Abstract:
Virtual screening is a technique used in drug discovery to select the most promising molecules to test in a lab. To perform virtual screening, we need a large set of molecules as input, and storing these molecules can become an issue. In fact, extreme-scale high-throughput virtual screening applications require a big dataset of input molecules and produce an even bigger dataset as output. These mo…
▽ More
Virtual screening is a technique used in drug discovery to select the most promising molecules to test in a lab. To perform virtual screening, we need a large set of molecules as input, and storing these molecules can become an issue. In fact, extreme-scale high-throughput virtual screening applications require a big dataset of input molecules and produce an even bigger dataset as output. These molecules' databases occupy tens of TB of storage space, and domain experts frequently sample a small portion of this data. In this context, SMILES is a popular data format for storing large sets of molecules since it requires significantly less space to represent molecules than other formats (e.g., MOL2, SDF). This paper proposes an efficient dictionary-based approach to compress SMILES-based datasets. This approach takes advantage of domain knowledge to provide a readable output with separable SMILES, enabling random access. We examine the benefits of storing these datasets using ZSMILES to reduce the cold storage footprint in HPC systems. The main contributions concern a custom dictionary-based approach and a data pre-processing step. From experimental results, we can notice how ZSMILES leverage domain knowledge to compress x1.13 more than state of the art in similar scenarios and up to $0.29$ compression ratio. We tested a CUDA version of ZSMILES targetting NVIDIA's GPUs, showing a potential speedup of 7x.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
Tunable and Portable Extreme-Scale Drug Discovery Platform at Exascale: the LIGATE Approach
Authors:
Gianluca Palermo,
Gianmarco Accordi,
Davide Gadioli,
Emanuele Vitali,
Cristina Silvano,
Bruno Guindani,
Danilo Ardagna,
Andrea R. Beccari,
Domenico Bonanni,
Carmine Talarico,
Filippo Lunghini,
Jan Martinovic,
Paulo Silva,
Ada Bohm,
Jakub Beranek,
Jan Krenek,
Branislav Jansik,
Luigi Crisci,
Biagio,
Cosenza,
Peter Thoman,
Philip Salzmann,
Thomas Fahringer,
Leila Alexander,
Gerardo Tauriello
, et al. (10 additional authors not shown)
Abstract:
Today digital revolution is having a dramatic impact on the pharmaceutical industry and the entire healthcare system. The implementation of machine learning, extreme-scale computer simulations, and big data analytics in the drug design and development process offers an excellent opportunity to lower the risk of investment and reduce the time to the patient.
Within the LIGATE project, we aim to i…
▽ More
Today digital revolution is having a dramatic impact on the pharmaceutical industry and the entire healthcare system. The implementation of machine learning, extreme-scale computer simulations, and big data analytics in the drug design and development process offers an excellent opportunity to lower the risk of investment and reduce the time to the patient.
Within the LIGATE project, we aim to integrate, extend, and co-design best-in-class European components to design Computer-Aided Drug Design (CADD) solutions exploiting today's high-end supercomputers and tomorrow's Exascale resources, fostering European competitiveness in the field.
The proposed LIGATE solution is a fully integrated workflow that enables to deliver the result of a virtual screening campaign for drug discovery with the highest speed along with the highest accuracy. The full automation of the solution and the possibility to run it on multiple supercomputing centers at once permit to run an extreme scale in silico drug discovery campaign in few days to respond promptly for example to a worldwide pandemic crisis.
△ Less
Submitted 19 April, 2023;
originally announced April 2023.
-
GPU-optimized Approaches to Molecular Docking-based Virtual Screening in Drug Discovery: A Comparative Analysis
Authors:
Emanuele Vitali,
Federico Ficarelli,
Mauro Bisson,
Davide Gadioli,
Massimiliano Fatica,
Andrea R. Beccari,
Gianluca Palermo
Abstract:
COVID-19 has shown the importance of having a fast response against pandemics. Finding a novel drug is a very long and complex procedure, and it is possible to accelerate the preliminary phases by using computer simulations. In particular, virtual screening is an in-silico phase that is needed to filter a large set of possible drug candidates to a manageable number. This paper presents the impleme…
▽ More
COVID-19 has shown the importance of having a fast response against pandemics. Finding a novel drug is a very long and complex procedure, and it is possible to accelerate the preliminary phases by using computer simulations. In particular, virtual screening is an in-silico phase that is needed to filter a large set of possible drug candidates to a manageable number. This paper presents the implementations and a comparative analysis of two GPU-optimized implementations of a virtual screening algorithm targeting novel GPU architectures. The first adopts a traditional approach that spreads the computation required to evaluate a single molecule across the entire GPU. The second uses a batched approach that exploits the parallel architecture of the GPU to evaluate more molecules in parallel, without considering the latency to process a single molecule. The paper describes the advantages and disadvantages of the proposed solutions, highlighting implementation details that impact the performance. Experimental results highlight the different performance of the two methods on several target molecule databases while running on NVIDIA A100 GPUs. The two implementations have a strong dependency with respect to the data to be processed. For both cases, the performance is improving while reducing the dimension of the target molecules (number of atoms and rotatable bonds). The two methods demonstrated a different behavior with respect to the size of the molecule database to be screened. While the latency one reaches sooner (with fewer molecules) the performance plateau in terms of throughput, the batched one requires a larger set of molecules. However, the performances after the initial transient period are much higher (up to 5x speed-up). Finally, to check the efficiency of both implementations we deeply analyzed their workload characteristics using the instruction roof-line methodology.
△ Less
Submitted 12 September, 2022;
originally announced September 2022.
-
GENEOnet: A new machine learning paradigm based on Group Equivariant Non-Expansive Operators. An application to protein pocket detection
Authors:
Giovanni Bocchi,
Patrizio Frosini,
Alessandra Micheletti,
Alessandro Pedretti,
Carmen Gratteri,
Filippo Lunghini,
Andrea Rosario Beccari,
Carmine Talarico
Abstract:
Nowadays there is a big spotlight cast on the development of techniques of explainable machine learning. Here we introduce a new computational paradigm based on Group Equivariant Non-Expansive Operators, that can be regarded as the product of a rising mathematical theory of information-processing observers. This approach, that can be adjusted to different situations, may have many advantages over…
▽ More
Nowadays there is a big spotlight cast on the development of techniques of explainable machine learning. Here we introduce a new computational paradigm based on Group Equivariant Non-Expansive Operators, that can be regarded as the product of a rising mathematical theory of information-processing observers. This approach, that can be adjusted to different situations, may have many advantages over other common tools, like Neural Networks, such as: knowledge injection and information engineering, selection of relevant features, small number of parameters and higher transparency. We chose to test our method, called GENEOnet, on a key problem in drug design: detecting pockets on the surface of proteins that can host ligands. Experimental results confirmed that our method works well even with a quite small training set, providing thus a great computational advantage, while the final comparison with other state-of-the-art methods shows that GENEOnet provides better or comparable results in terms of accuracy.
△ Less
Submitted 31 January, 2022;
originally announced February 2022.
-
EXSCALATE: An extreme-scale in-silico virtual screening platform to evaluate 1 trillion compounds in 60 hours on 81 PFLOPS supercomputers
Authors:
Davide Gadioli,
Emanuele Vitali,
Federico Ficarelli,
Chiara Latini,
Candida Manelfi,
Carmine Talarico,
Cristina Silvano,
Carlo Cavazzoni,
Gianluca Palermo,
Andrea Rosario Beccari
Abstract:
The social and economic impact of the COVID-19 pandemic demands the reduction of the time required to find a therapeutic cure. In the contest of urgent computing, we re-designed the Exscalate molecular docking platform to benefit from heterogeneous computation nodes and to avoid scaling issues. We deployed the Exscalate platform on two top European supercomputers (CINECA-Marconi100 and ENI-HPC5),…
▽ More
The social and economic impact of the COVID-19 pandemic demands the reduction of the time required to find a therapeutic cure. In the contest of urgent computing, we re-designed the Exscalate molecular docking platform to benefit from heterogeneous computation nodes and to avoid scaling issues. We deployed the Exscalate platform on two top European supercomputers (CINECA-Marconi100 and ENI-HPC5), with a combined computational power of 81 PFLOPS, to evaluate the interaction between 70 billions of small molecules and 15 binding-sites of 12 viral proteins of Sars-Cov2. The experiment lasted 60 hours and overall it performed a trillion of evaluations.
△ Less
Submitted 22 October, 2021;
originally announced October 2021.
-
Tunable Approximations to Control Time-to-Solution in an HPC Molecular Docking Mini-App
Authors:
Davide Gadioli,
Gianluca Palermo,
Stefano Cherubin,
Emanuele Vitali,
Giovanni Agosta,
Candida Manelfi,
Andrea R. Beccari,
Carlo Cavazzoni,
Nico Sanna,
Cristina Silvano
Abstract:
The drug discovery process involves several tasks to be performed in vivo, in vitro and in silico. Molecular docking is a task typically performed in silico. It aims at finding the three-dimensional pose of a given molecule when it interacts with the target protein binding site. This task is often done for virtual screening a huge set of molecules to find the most promising ones, which will be for…
▽ More
The drug discovery process involves several tasks to be performed in vivo, in vitro and in silico. Molecular docking is a task typically performed in silico. It aims at finding the three-dimensional pose of a given molecule when it interacts with the target protein binding site. This task is often done for virtual screening a huge set of molecules to find the most promising ones, which will be forwarded to the later stages of the drug discovery process. Given the huge complexity of the problem, molecular docking cannot be solved by exploring the entire space of the ligand poses. State-of-the-art approaches face the problem by sampling the space of the ligand poses to generate results in a reasonable time budget. In this work, we improve the geometric approach to molecular docking by introducing tunable approximations. In particular, we analyzed and enriched the original implementation with tunable software knobs to explore and control the performance-accuracy tradeoffs. We modeled time-to-solution of the virtual screening task as a function of software knobs, input data features, and available computational resources. Therefore, the application can autotune its configuration according to a user-defined time budget. We used a Mini-App derived by LiGenDock - a state-of-the-art molecular docking application - to validate the proposed approach. We run the enhanced Mini-App on an HPC system by using a very large database of pockets and ligands. The proposed approach exposes a time-to-solution interval spanning more than one order of magnitude with accuracy degradation up to 30%, more in general providing different accuracy levels according to the needs of the virtual screening campaign.
△ Less
Submitted 18 January, 2019;
originally announced January 2019.
-
The ANTAREX Domain Specific Language for High Performance Computing
Authors:
Cristina Silvano,
Giovanni Agosta,
Andrea Bartolini,
Andrea R. Beccari,
Luca Benini,
Loïc Besnard,
João Bispo,
Radim Cmar,
João M. P. Cardoso,
Carlo Cavazzoni,
Daniele Cesarini,
Stefano Cherubin,
Federico Ficarelli,
Davide Gadioli,
Martin Golasowski,
Antonio Libri,
Jan Martinovič,
Gianluca Palermo,
Pedro Pinto,
Erven Rohou,
Kateřina Slaninová,
Emanuele Vitali
Abstract:
The ANTAREX project relies on a Domain Specific Language (DSL) based on Aspect Oriented Programming (AOP) concepts to allow applications to enforce extra functional properties such as energy-efficiency and performance and to optimize Quality of Service (QoS) in an adaptive way. The DSL approach allows the definition of energy-efficiency, performance, and adaptivity strategies as well as their enfo…
▽ More
The ANTAREX project relies on a Domain Specific Language (DSL) based on Aspect Oriented Programming (AOP) concepts to allow applications to enforce extra functional properties such as energy-efficiency and performance and to optimize Quality of Service (QoS) in an adaptive way. The DSL approach allows the definition of energy-efficiency, performance, and adaptivity strategies as well as their enforcement at runtime through application autotuning and resource and power management. In this paper, we present an overview of the key outcome of the project, the ANTAREX DSL, and some of its capabilities through a number of examples, including how the DSL is applied in the context of the project use cases.
△ Less
Submitted 18 January, 2019;
originally announced January 2019.