-
Improving computation efficiency using input and architecture features for a virtual screening application
Authors:
Gianmarco Accordi,
Emanuele Vitali,
Davide Gadioli,
Luigi Crisci,
Biagio Cosenza,
Mauro Bisson,
Massimiliano Fatica,
Andrea Beccari,
Gianluca Palermo
Abstract:
Virtual screening is an early stage of the drug discovery process that selects the most promising candidates. In the urgent computing scenario it is critical to find a solution in a short time frame. In this paper, we focus on a real-world virtual screening application to evaluate out-of-kernel optimizations, that consider input and architecture features to improve the computation efficiency on GP…
▽ More
Virtual screening is an early stage of the drug discovery process that selects the most promising candidates. In the urgent computing scenario it is critical to find a solution in a short time frame. In this paper, we focus on a real-world virtual screening application to evaluate out-of-kernel optimizations, that consider input and architecture features to improve the computation efficiency on GPU. Experiment results on a modern supercomputer node show that we can almost double the performance. Moreover, we implemented the optimization using SYCL and it provides a consistent benefit with the CUDA optimization. A virtual screening campaign can use this gain in performance to increase the number of evaluated candidates, improving the probability of finding a drug.
△ Less
Submitted 9 March, 2023;
originally announced March 2023.
-
Israel coordinates for all static spherically symmetric spacetimes with vanishing second Ricci invariant
Authors:
Yannick M. Bisson,
Kayll Lake
Abstract:
Static spherically symmetric spacetimes with vanishing second Ricci invariant constitute an important class of solutions to Einstein's equations and more generally as archetypes of regular black holes. When studying completeness one is most often presented with the Kruskal - Szekeres procedure. However, this procedure only works if the spacetime admits a single non-degenerate Killing horizon (a si…
▽ More
Static spherically symmetric spacetimes with vanishing second Ricci invariant constitute an important class of solutions to Einstein's equations and more generally as archetypes of regular black holes. When studying completeness one is most often presented with the Kruskal - Szekeres procedure. However, this procedure only works if the spacetime admits a single non-degenerate Killing horizon (a single bifurcation two-sphere). Here we generalize the Israel procedure to examine a constructive approach to completeness based entirely on the static spherically symmetric nature of spacetimes with a vanishing second Ricci invariant. It is shown by "block gluing" that the Israel procedure can cover two bifurcation two-spheres, but can fail with three. No coordinate transformations are used in this work.
△ Less
Submitted 17 October, 2023; v1 submitted 10 February, 2023;
originally announced February 2023.
-
GPU-optimized Approaches to Molecular Docking-based Virtual Screening in Drug Discovery: A Comparative Analysis
Authors:
Emanuele Vitali,
Federico Ficarelli,
Mauro Bisson,
Davide Gadioli,
Massimiliano Fatica,
Andrea R. Beccari,
Gianluca Palermo
Abstract:
COVID-19 has shown the importance of having a fast response against pandemics. Finding a novel drug is a very long and complex procedure, and it is possible to accelerate the preliminary phases by using computer simulations. In particular, virtual screening is an in-silico phase that is needed to filter a large set of possible drug candidates to a manageable number. This paper presents the impleme…
▽ More
COVID-19 has shown the importance of having a fast response against pandemics. Finding a novel drug is a very long and complex procedure, and it is possible to accelerate the preliminary phases by using computer simulations. In particular, virtual screening is an in-silico phase that is needed to filter a large set of possible drug candidates to a manageable number. This paper presents the implementations and a comparative analysis of two GPU-optimized implementations of a virtual screening algorithm targeting novel GPU architectures. The first adopts a traditional approach that spreads the computation required to evaluate a single molecule across the entire GPU. The second uses a batched approach that exploits the parallel architecture of the GPU to evaluate more molecules in parallel, without considering the latency to process a single molecule. The paper describes the advantages and disadvantages of the proposed solutions, highlighting implementation details that impact the performance. Experimental results highlight the different performance of the two methods on several target molecule databases while running on NVIDIA A100 GPUs. The two implementations have a strong dependency with respect to the data to be processed. For both cases, the performance is improving while reducing the dimension of the target molecules (number of atoms and rotatable bonds). The two methods demonstrated a different behavior with respect to the size of the molecule database to be screened. While the latency one reaches sooner (with fewer molecules) the performance plateau in terms of throughput, the batched one requires a larger set of molecules. However, the performances after the initial transient period are much higher (up to 5x speed-up). Finally, to check the efficiency of both implementations we deeply analyzed their workload characteristics using the instruction roof-line methodology.
△ Less
Submitted 12 September, 2022;
originally announced September 2022.
-
How we are leading a 3-XORSAT challenge: from the energy landscape to the algorithm and its efficient implementation on GPUs
Authors:
M. Bernaschi,
M. Bisson,
M. Fatica,
E. Marinari,
V. Martin-Mayor,
G. Parisi,
F. Ricci-Tersenghi
Abstract:
A recent 3-XORSAT challenge required to minimize a very complex and rough energy function, typical of glassy models with a random first order transition and a golf course like energy landscape. We present the ideas beyond the quasi-greedy algorithm and its very efficient implementation on GPUs that are allowing us to rank first in such a competition. We suggest a better protocol to compare algorit…
▽ More
A recent 3-XORSAT challenge required to minimize a very complex and rough energy function, typical of glassy models with a random first order transition and a golf course like energy landscape. We present the ideas beyond the quasi-greedy algorithm and its very efficient implementation on GPUs that are allowing us to rank first in such a competition. We suggest a better protocol to compare algorithmic performances and we also provide analytical predictions about the exponential growth of the times to find the solution in terms of free-energy barriers.
△ Less
Submitted 24 February, 2021; v1 submitted 18 February, 2021;
originally announced February 2021.
-
A Performance Study of the 2D Ising Model on GPUs
Authors:
Joshua Romero,
Mauro Bisson,
Massimiliano Fatica,
Massimo Bernaschi
Abstract:
The simulation of the two-dimensional Ising model is used as a benchmark to show the computational capabilities of Graphic Processing Units (GPUs). The rich programming environment now available on GPUs and flexible hardware capabilities allowed us to quickly experiment with several implementation ideas: a simple stencil-based algorithm, recasting the stencil operations into matrix multiplies to t…
▽ More
The simulation of the two-dimensional Ising model is used as a benchmark to show the computational capabilities of Graphic Processing Units (GPUs). The rich programming environment now available on GPUs and flexible hardware capabilities allowed us to quickly experiment with several implementation ideas: a simple stencil-based algorithm, recasting the stencil operations into matrix multiplies to take advantage of Tensor Cores available on NVIDIA GPUs, and a highly optimized multi-spin coding approach. Using the managed memory API available in CUDA allows for simple and efficient distribution of these implementations across a multi-GPU NVIDIA DGX-2 server. We show that even a basic GPU implementation can outperform current results published on TPUs and that the optimized multi-GPU implementation can simulate very large lattices faster than custom FPGA solutions.
△ Less
Submitted 14 June, 2019;
originally announced June 2019.
-
Parallel Distributed Breadth First Search on the Kepler Architecture
Authors:
Mauro Bisson,
Massimo Bernaschi,
Enrico Mastrostefano
Abstract:
We present the results obtained by using an evolution of our CUDA-based solution for the exploration, via a Breadth First Search, of large graphs. This latest version exploits at its best the features of the Kepler architecture and relies on a combination of techniques to reduce both the number of communications among the GPUs and the amount of exchanged data. The final result is a code that can v…
▽ More
We present the results obtained by using an evolution of our CUDA-based solution for the exploration, via a Breadth First Search, of large graphs. This latest version exploits at its best the features of the Kepler architecture and relies on a combination of techniques to reduce both the number of communications among the GPUs and the amount of exchanged data. The final result is a code that can visit more than 800 billion edges in a second by using a cluster equipped with 4096 Tesla K20X GPUs.
△ Less
Submitted 23 December, 2014; v1 submitted 7 August, 2014;
originally announced August 2014.
-
GPU peer-to-peer techniques applied to a cluster interconnect
Authors:
Roberto Ammendola,
Massimo Bernaschi,
Andrea Biagioni,
Mauro Bisson,
Massimiliano Fatica,
Ottorino Frezza,
Francesca Lo Cicero,
Alessandro Lonardo,
Enrico Mastrostefano,
Pier Stanislao Paolucci,
Davide Rossetti,
Francesco Simula,
Laura Tosoratto,
Piero Vicini
Abstract:
Modern GPUs support special protocols to exchange data directly across the PCI Express bus. While these protocols could be used to reduce GPU data transmission times, basically by avoiding staging to host memory, they require specific hardware features which are not available on current generation network adapters. In this paper we describe the architectural modifications required to implement pee…
▽ More
Modern GPUs support special protocols to exchange data directly across the PCI Express bus. While these protocols could be used to reduce GPU data transmission times, basically by avoiding staging to host memory, they require specific hardware features which are not available on current generation network adapters. In this paper we describe the architectural modifications required to implement peer-to-peer access to NVIDIA Fermi- and Kepler-class GPUs on an FPGA-based cluster interconnect. Besides, the current software implementation, which integrates this feature by minimally extending the RDMA programming model, is discussed, as well as some issues raised while employing it in a higher level API like MPI. Finally, the current limits of the technique are studied by analyzing the performance improvements on low-level benchmarks and on two GPU-accelerated applications, showing when and how they seem to benefit from the GPU peer-to-peer method.
△ Less
Submitted 31 July, 2013;
originally announced July 2013.