Search | arXiv e-print repository

Improving computation efficiency using input and architecture features for a virtual screening application

Authors: Gianmarco Accordi, Emanuele Vitali, Davide Gadioli, Luigi Crisci, Biagio Cosenza, Mauro Bisson, Massimiliano Fatica, Andrea Beccari, Gianluca Palermo

Abstract: Virtual screening is an early stage of the drug discovery process that selects the most promising candidates. In the urgent computing scenario it is critical to find a solution in a short time frame. In this paper, we focus on a real-world virtual screening application to evaluate out-of-kernel optimizations, that consider input and architecture features to improve the computation efficiency on GP… ▽ More Virtual screening is an early stage of the drug discovery process that selects the most promising candidates. In the urgent computing scenario it is critical to find a solution in a short time frame. In this paper, we focus on a real-world virtual screening application to evaluate out-of-kernel optimizations, that consider input and architecture features to improve the computation efficiency on GPU. Experiment results on a modern supercomputer node show that we can almost double the performance. Moreover, we implemented the optimization using SYCL and it provides a consistent benefit with the CUDA optimization. A virtual screening campaign can use this gain in performance to increase the number of evaluated candidates, improving the probability of finding a drug. △ Less

Submitted 9 March, 2023; originally announced March 2023.

arXiv:2209.05069 [pdf, other]

GPU-optimized Approaches to Molecular Docking-based Virtual Screening in Drug Discovery: A Comparative Analysis

Authors: Emanuele Vitali, Federico Ficarelli, Mauro Bisson, Davide Gadioli, Massimiliano Fatica, Andrea R. Beccari, Gianluca Palermo

Abstract: COVID-19 has shown the importance of having a fast response against pandemics. Finding a novel drug is a very long and complex procedure, and it is possible to accelerate the preliminary phases by using computer simulations. In particular, virtual screening is an in-silico phase that is needed to filter a large set of possible drug candidates to a manageable number. This paper presents the impleme… ▽ More COVID-19 has shown the importance of having a fast response against pandemics. Finding a novel drug is a very long and complex procedure, and it is possible to accelerate the preliminary phases by using computer simulations. In particular, virtual screening is an in-silico phase that is needed to filter a large set of possible drug candidates to a manageable number. This paper presents the implementations and a comparative analysis of two GPU-optimized implementations of a virtual screening algorithm targeting novel GPU architectures. The first adopts a traditional approach that spreads the computation required to evaluate a single molecule across the entire GPU. The second uses a batched approach that exploits the parallel architecture of the GPU to evaluate more molecules in parallel, without considering the latency to process a single molecule. The paper describes the advantages and disadvantages of the proposed solutions, highlighting implementation details that impact the performance. Experimental results highlight the different performance of the two methods on several target molecule databases while running on NVIDIA A100 GPUs. The two implementations have a strong dependency with respect to the data to be processed. For both cases, the performance is improving while reducing the dimension of the target molecules (number of atoms and rotatable bonds). The two methods demonstrated a different behavior with respect to the size of the molecule database to be screened. While the latency one reaches sooner (with fewer molecules) the performance plateau in terms of throughput, the batched one requires a larger set of molecules. However, the performances after the initial transient period are much higher (up to 5x speed-up). Finally, to check the efficiency of both implementations we deeply analyzed their workload characteristics using the instruction roof-line methodology. △ Less

Submitted 12 September, 2022; originally announced September 2022.

arXiv:2103.15187 [pdf, other]

doi 10.1016/j.cpc.2021.108248

FSEI-GPU: GPU accelerated simulations of the fluid-structure-electrophysiology interaction in the left heart

Authors: Francesco Viola, Vamsi Spandan, Valentina Meschini, Joshua Romero, Massimiliano Fatica, Marco D. de Tullio, Roberto Verzicco

Abstract: The reliability of cardiovascular computational models depends on the accurate solution of the hemodynamics, the realistic characterization of the hyperelastic and electric properties of the tissues along with the correct description of their interaction. The resulting fluid-structure-electrophysiology interaction (FSEI) thus requires an immense computational power, usually available in large supe… ▽ More The reliability of cardiovascular computational models depends on the accurate solution of the hemodynamics, the realistic characterization of the hyperelastic and electric properties of the tissues along with the correct description of their interaction. The resulting fluid-structure-electrophysiology interaction (FSEI) thus requires an immense computational power, usually available in large supercomputing centers, and requires long time to obtain results even if multi-CPU processors are used (MPI acceleration). In recent years, graphics processing units (GPUs) have emerged as a convenient platform for high performance computing, as they allow for considerable reductions of the time-to-solution. This approach is particularly appealing if the tool has to support medical decisions that require solutions within reduced times and possibly obtained by local computational resources. Accordingly, our multi-physics solver has been ported to GPU architectures using CUDA Fortran to tackle fast and accurate hemodynamics simulations of the human heart without resorting to large-scale supercomputers. This work describes the use of CUDA to accelerate the FSEI on heterogeneous clusters, where both the CPUs and GPUs are used in synergistically with minor modifications of the original source code. The resulting GPU accelerated code solves a single heartbeat within a few hours (from three to ten depending on the grid resolution) running on premises computing facility made of few GPU cards, which can be easily installed in a medical laboratory or in a hospital, thus opening towards a systematic computational fluid dynamics (CFD) aided diagnostic. △ Less

Submitted 4 May, 2021; v1 submitted 28 March, 2021; originally announced March 2021.

arXiv:2103.13383 [pdf, other]

doi 10.1017/jfm.2021.727

One-point statistics for turbulent pipe flow up to $Re_τ \approx 6000$

Authors: Sergio Pirozzoli, Joshua Romero, Massimiliano Fatica, Roberto Verzicco, Paolo Orlandi

Abstract: We study turbulent flows in a smooth straight pipe of circular cross--section up to $Re_τ \approx 6000$ using direct--numerical-simulation (DNS) of the Navier--Stokes equations. The DNS results highlight systematic deviations from Prandtl friction law, amounting to about $2\%$, which would extrapolate to about $4\%$ at extreme Reynolds numbers. Data fitting of the DNS friction coefficient yields a… ▽ More We study turbulent flows in a smooth straight pipe of circular cross--section up to $Re_τ \approx 6000$ using direct--numerical-simulation (DNS) of the Navier--Stokes equations. The DNS results highlight systematic deviations from Prandtl friction law, amounting to about $2\%$, which would extrapolate to about $4\%$ at extreme Reynolds numbers. Data fitting of the DNS friction coefficient yields an estimated von Kármán constant $k \approx 0.387$, which nicely fits the mean velocity profile, and which supports universality of canonical wall-bounded flows. The same constant also applies to the pipe centerline velocity, thus providing support for the claim that the asymptotic state of pipe flow at extreme Reynolds numbers should be plug flow. At the Reynolds numbers under scrutiny, no evidence for saturation of the logarithmic growth of the inner peak of the axial velocity variance is found. Although no outer peak of the velocity variance directly emerges in our DNS, we provide strong evidence that it should appear at $Re_τ \gtrsim 10^4$, as a result of turbulence production exceeding dissipation over a large part of the outer wall layer, thus invalidating the classical equilibrium hypothesis. △ Less

Submitted 21 June, 2021; v1 submitted 24 March, 2021; originally announced March 2021.

arXiv:2102.09510 [pdf, other]

doi 10.1209/0295-5075/133/60005

How we are leading a 3-XORSAT challenge: from the energy landscape to the algorithm and its efficient implementation on GPUs

Authors: M. Bernaschi, M. Bisson, M. Fatica, E. Marinari, V. Martin-Mayor, G. Parisi, F. Ricci-Tersenghi

Abstract: A recent 3-XORSAT challenge required to minimize a very complex and rough energy function, typical of glassy models with a random first order transition and a golf course like energy landscape. We present the ideas beyond the quasi-greedy algorithm and its very efficient implementation on GPUs that are allowing us to rank first in such a competition. We suggest a better protocol to compare algorit… ▽ More A recent 3-XORSAT challenge required to minimize a very complex and rough energy function, typical of glassy models with a random first order transition and a golf course like energy landscape. We present the ideas beyond the quasi-greedy algorithm and its very efficient implementation on GPUs that are allowing us to rank first in such a competition. We suggest a better protocol to compare algorithmic performances and we also provide analytical predictions about the exponential growth of the times to find the solution in terms of free-energy barriers. △ Less

Submitted 24 February, 2021; v1 submitted 18 February, 2021; originally announced February 2021.

Comments: 7 pages, 7 figure, EPL format + SM (2 pages)

Journal ref: EPL, 133 (2021) 60005

arXiv:2001.05234 [pdf, other]

doi 10.1016/j.camwa.2020.01.002

GPU acceleration of CaNS for massively-parallel direct numerical simulations of canonical fluid flows

Authors: Pedro Costa, Everett Phillips, Luca Brandt, Massimiliano Fatica

Abstract: This work presents the GPU acceleration of the open-source code CaNS for very fast massively-parallel simulations of canonical fluid flows. The distinct feature of the many-CPU Navier-Stokes solver in CaNS is its fast direct solver for the second-order finite-difference Poisson equation, based on the method of eigenfunction expansions. The solver implements all the boundary conditions valid for th… ▽ More This work presents the GPU acceleration of the open-source code CaNS for very fast massively-parallel simulations of canonical fluid flows. The distinct feature of the many-CPU Navier-Stokes solver in CaNS is its fast direct solver for the second-order finite-difference Poisson equation, based on the method of eigenfunction expansions. The solver implements all the boundary conditions valid for this type of problems in a unified framework. Here, we extend the solver for GPU-accelerated clusters using CUDA Fortran. The porting makes extensive use of CUF kernels and has been greatly simplified by the unified memory feature of CUDA Fortran, which handles the data migration between host (CPU) and device (GPU) without defining new arrays in the source code. The overall implementation has been validated against benchmark data for turbulent channel flow and its performance assessed on a NVIDIA DGX-2 system (16 Tesla V100 32Gb, connected with NVLink via NVSwitch). The wall-clock time per time step of the GPU-accelerated implementation is impressively small when compared to its CPU implementation on state-of-the-art many-CPU clusters, as long as the domain partitioning is sufficiently small that the data resides mostly on the GPUs. The implementation has been made freely available and open-source under the terms of an MIT license. △ Less

Submitted 2 October, 2020; v1 submitted 15 January, 2020; originally announced January 2020.

Journal ref: Computers & Mathematics with Applications 81 (2021) 502-511

arXiv:1906.06297 [pdf, other]

doi 10.1016/j.cpc.2020.107473

A Performance Study of the 2D Ising Model on GPUs

Authors: Joshua Romero, Mauro Bisson, Massimiliano Fatica, Massimo Bernaschi

Abstract: The simulation of the two-dimensional Ising model is used as a benchmark to show the computational capabilities of Graphic Processing Units (GPUs). The rich programming environment now available on GPUs and flexible hardware capabilities allowed us to quickly experiment with several implementation ideas: a simple stencil-based algorithm, recasting the stencil operations into matrix multiplies to t… ▽ More The simulation of the two-dimensional Ising model is used as a benchmark to show the computational capabilities of Graphic Processing Units (GPUs). The rich programming environment now available on GPUs and flexible hardware capabilities allowed us to quickly experiment with several implementation ideas: a simple stencil-based algorithm, recasting the stencil operations into matrix multiplies to take advantage of Tensor Cores available on NVIDIA GPUs, and a highly optimized multi-spin coding approach. Using the managed memory API available in CUDA allows for simple and efficient distribution of these implementations across a multi-GPU NVIDIA DGX-2 server. We show that even a basic GPU implementation can outperform current results published on TPUs and that the optimized multi-GPU implementation can simulate very large lattices faster than custom FPGA solutions. △ Less

Submitted 14 June, 2019; originally announced June 2019.

arXiv:1810.01993 [pdf, other]

Exascale Deep Learning for Climate Analytics

Authors: Thorsten Kurth, Sean Treichler, Joshua Romero, Mayur Mudigonda, Nathan Luehr, Everett Phillips, Ankur Mahesh, Michael Matheson, Jack Deslippe, Massimiliano Fatica, Prabhat, Michael Houston

Abstract: We extract pixel-level masks of extreme weather patterns using variants of Tiramisu and DeepLabv3+ neural networks. We describe improvements to the software frameworks, input pipeline, and the network training algorithms necessary to efficiently scale deep learning on the Piz Daint and Summit systems. The Tiramisu network scales to 5300 P100 GPUs with a sustained throughput of 21.0 PF/s and parall… ▽ More We extract pixel-level masks of extreme weather patterns using variants of Tiramisu and DeepLabv3+ neural networks. We describe improvements to the software frameworks, input pipeline, and the network training algorithms necessary to efficiently scale deep learning on the Piz Daint and Summit systems. The Tiramisu network scales to 5300 P100 GPUs with a sustained throughput of 21.0 PF/s and parallel efficiency of 79.0%. DeepLabv3+ scales up to 27360 V100 GPUs with a sustained throughput of 325.8 PF/s and a parallel efficiency of 90.7% in single precision. By taking advantage of the FP16 Tensor Cores, a half-precision version of the DeepLabv3+ network achieves a peak and sustained throughput of 1.13 EF/s and 999.0 PF/s respectively. △ Less

Submitted 3 October, 2018; originally announced October 2018.

Comments: 12 pages, 5 tables, 4, figures, Super Computing Conference November 11-16, 2018, Dallas, TX, USA

arXiv:1705.01423 [pdf, other]

doi 10.1016/j.cpc.2018.03.026

AFiD-GPU: a versatile Navier-Stokes Solver for Wall-Bounded Turbulent Flows on GPU Clusters

Authors: Xiaojue Zhu, Everett Phillips, Vamsi Spandan, John Donners, Gregory Ruetsch, Josh Romero, Rodolfo Ostilla-Mónico, Yantao Yang, Detlef Lohse, Roberto Verzicco, Massimiliano Fatica, Richard J. A. M. Stevens

Abstract: The AFiD code, an open source solver for the incompressible Navier-Stokes equations ({\color{blue}\burl{http://www.afid.eu}}), has been ported to GPU clusters to tackle large-scale wall-bounded turbulent flow simulations. The GPU porting has been carried out in CUDA Fortran with the extensive use of kernel loop directives (CUF kernels) in order to have a source code as close as possible to the ori… ▽ More The AFiD code, an open source solver for the incompressible Navier-Stokes equations ({\color{blue}\burl{http://www.afid.eu}}), has been ported to GPU clusters to tackle large-scale wall-bounded turbulent flow simulations. The GPU porting has been carried out in CUDA Fortran with the extensive use of kernel loop directives (CUF kernels) in order to have a source code as close as possible to the original CPU version; just a few routines have been manually rewritten. A new transpose scheme, which is not limited to the GPU version only and can be generally applied to any CFD code that uses pencil distributed parallelization, has been devised to improve the scaling of the Poisson solver, the main bottleneck of incompressible solvers. The GPU version can reduce the wall clock time by an order of magnitude compared to the CPU version for large meshes. Due to the increased performance and efficient use of memory, the GPU version of AFiD can perform simulations in parameter ranges that are unprecedented in thermally-driven wall-bounded turbulence. To verify the accuracy of the code, turbulent Rayleigh-Bénard convection and plane Couette flow are simulated and the results are in good agreement with the experimental and computational data that published in previous literatures. △ Less

Submitted 3 May, 2017; originally announced May 2017.

Comments: 33 pages, 8 figures, submitted

arXiv:1307.8276 [pdf, other]

GPU peer-to-peer techniques applied to a cluster interconnect

Authors: Roberto Ammendola, Massimo Bernaschi, Andrea Biagioni, Mauro Bisson, Massimiliano Fatica, Ottorino Frezza, Francesca Lo Cicero, Alessandro Lonardo, Enrico Mastrostefano, Pier Stanislao Paolucci, Davide Rossetti, Francesco Simula, Laura Tosoratto, Piero Vicini

Abstract: Modern GPUs support special protocols to exchange data directly across the PCI Express bus. While these protocols could be used to reduce GPU data transmission times, basically by avoiding staging to host memory, they require specific hardware features which are not available on current generation network adapters. In this paper we describe the architectural modifications required to implement pee… ▽ More Modern GPUs support special protocols to exchange data directly across the PCI Express bus. While these protocols could be used to reduce GPU data transmission times, basically by avoiding staging to host memory, they require specific hardware features which are not available on current generation network adapters. In this paper we describe the architectural modifications required to implement peer-to-peer access to NVIDIA Fermi- and Kepler-class GPUs on an FPGA-based cluster interconnect. Besides, the current software implementation, which integrates this feature by minimally extending the RDMA programming model, is discussed, as well as some issues raised while employing it in a higher level API like MPI. Finally, the current limits of the technique are studied by analyzing the performance improvements on low-level benchmarks and on two GPU-accelerated applications, showing when and how they seem to benefit from the GPU peer-to-peer method. △ Less

Submitted 31 July, 2013; originally announced July 2013.

Comments: paper accepted to CASS 2013

Showing 1–10 of 10 results for author: Fatica, M