-
MRHS multigrid solver for Wilson-clover fermions
Authors:
Daniel Richtmann,
Nils Meyer,
Tilo Wettig
Abstract:
We describe our implementation of a multigrid solver for Wilson-clover fermions, which increases parallelism by solving for multiple right-hand sides (MRHS) simultaneously. The solver is based on Grid and thus runs on all computing architectures supported by the Grid framework. We present detailed benchmarks of the relevant kernels, such as hop** and clover term on the various multigrid levels,…
▽ More
We describe our implementation of a multigrid solver for Wilson-clover fermions, which increases parallelism by solving for multiple right-hand sides (MRHS) simultaneously. The solver is based on Grid and thus runs on all computing architectures supported by the Grid framework. We present detailed benchmarks of the relevant kernels, such as hop** and clover term on the various multigrid levels, intergrid operators, and reductions. The benchmarks were performed on the JUWELS Booster system at Jülich Supercomputing Centre, which is based on Nvidia A100 GPUs. For example, solving a $24^3\times128$ lattice on 16 GPUs, the overall speedup obtained solely from MRHS is about 10x.
△ Less
Submitted 24 November, 2022;
originally announced November 2022.
-
Multigrid for Wilson Clover Fermions in Grid
Authors:
Daniel Richtmann,
Peter A. Boyle,
Tilo Wettig
Abstract:
With the ever-growing number of computing architectures, performance portability is an important aspect of (Lattice QCD) software. The Grid library provides a good framework for writing such code, as it thoroughly separates hardware-specific code from algorithmic functionality and already supports many modern architectures. We describe the implementation of a multigrid solver for Wilson clover fer…
▽ More
With the ever-growing number of computing architectures, performance portability is an important aspect of (Lattice QCD) software. The Grid library provides a good framework for writing such code, as it thoroughly separates hardware-specific code from algorithmic functionality and already supports many modern architectures. We describe the implementation of a multigrid solver for Wilson clover fermions in Grid by the RQCD group. We present the features included in our implementation, discuss initial optimization efforts, and compare the performance with another multigrid implementation.
△ Less
Submitted 18 April, 2019;
originally announced April 2019.
-
DD-$α$AMG on QPACE 3
Authors:
Peter Georg,
Daniel Richtmann,
Tilo Wettig
Abstract:
We describe our experience porting the Regensburg implementation of the DD-$α$AMG solver from QPACE 2 to QPACE 3. We first review how the code was ported from the first generation Intel Xeon Phi processor (Knights Corner) to its successor (Knights Landing). We then describe the modifications in the communication library necessitated by the switch from InfiniBand to Omni-Path. Finally, we present t…
▽ More
We describe our experience porting the Regensburg implementation of the DD-$α$AMG solver from QPACE 2 to QPACE 3. We first review how the code was ported from the first generation Intel Xeon Phi processor (Knights Corner) to its successor (Knights Landing). We then describe the modifications in the communication library necessitated by the switch from InfiniBand to Omni-Path. Finally, we present the performance of the code on a single processor as well as the scaling on many nodes, where in both cases the speedup factor is close to the theoretical expectations.
△ Less
Submitted 19 October, 2017;
originally announced October 2017.
-
pMR: A high-performance communication library
Authors:
Peter Georg,
Daniel Richtmann,
Tilo Wettig
Abstract:
On many parallel machines, the time LQCD applications spent in communication is a significant contribution to the total wall-clock time, especially in the strong-scaling limit. We present a novel high-performance communication library that can be used as a de facto drop-in replacement for MPI in existing software. Its lightweight nature that avoids some of the unnecessary overhead introduced by MP…
▽ More
On many parallel machines, the time LQCD applications spent in communication is a significant contribution to the total wall-clock time, especially in the strong-scaling limit. We present a novel high-performance communication library that can be used as a de facto drop-in replacement for MPI in existing software. Its lightweight nature that avoids some of the unnecessary overhead introduced by MPI allows us to improve the communication performance of applications without any algorithmic or complicated implementation changes. As a first real-world benchmark, we make use of the pMR library in the coarse-grid solve of the Regensburg implementation of the DD-$α$AMG algorithm. On realistic lattices, we see an improvement of a factor 2x in pure communication time and total execution time savings of up to 20%.
△ Less
Submitted 30 January, 2017;
originally announced January 2017.
-
Direct determinations of the nucleon and pion $σ$ terms at nearly physical quark masses
Authors:
Gunnar S. Bali,
Sara Collins,
Daniel Richtmann,
Andreas Schäfer,
Wolfgang Söldner,
André Sternbeck
Abstract:
We present a high statistics study of the pion and nucleon light and strange quark sigma terms using $N_f=2$ dynamical non-perturbatively improved clover fermions with a range of pion masses down to $m_π\sim 150$ MeV and several volumes, $Lm_π=3.4$ up to $6.7$, and lattice spacings, $a=0.06-0.08$ fm, enabling a study of finite volume and discretisation effects for $m_π\gtrsim 260$ MeV. Systematics…
▽ More
We present a high statistics study of the pion and nucleon light and strange quark sigma terms using $N_f=2$ dynamical non-perturbatively improved clover fermions with a range of pion masses down to $m_π\sim 150$ MeV and several volumes, $Lm_π=3.4$ up to $6.7$, and lattice spacings, $a=0.06-0.08$ fm, enabling a study of finite volume and discretisation effects for $m_π\gtrsim 260$ MeV. Systematics are found to be reasonably under control. For the nucleon we obtain $σ_{πN}=35(6)$ MeV and $σ_s=35(12)$ MeV, or equivalently in terms of the quark fractions, $f_{T_u}=0.021(4)$, $f_{T_d}=0.016(4)$ and $f_{T_s}=0.037(13)$, where the errors include estimates of both the systematic and statistical uncertainties. These values, together with perturbative matching in the heavy quark limit, lead to $f_{T_c}=0.075(4)$, $f_{T_b}=0.072(2)$ and $f_{T_t}=0.070(1)$. In addition, through the use of the (inverse) Feynman-Hellmann theorem our results for $σ_{πN}$ are shown to be consistent with the nucleon masses determined in the analysis. For the pion we implement a method which greatly reduces excited state contamination to the scalar matrix elements from states travelling across the temporal boundary. This enables us to demonstrate the Gell-Mann-Oakes-Renner expectation $σ_π=m_π/2$ over our range of pion masses.
△ Less
Submitted 6 May, 2016; v1 submitted 2 March, 2016;
originally announced March 2016.
-
Multiple right-hand-side setup for the DD-αAMG
Authors:
Daniel Richtmann,
Simon Heybrock,
Tilo Wettig
Abstract:
The setup cost of a modern solver such as DD-αAMG (Wuppertal Multigrid) is a significant contribution to the total time spent on solving the Dirac equation, and in HMC it can even be dominant. We present an improved implementation of this algorithm with modified computation order in the setup procedure. By processing multiple right-hand sides simultaneously we can alleviate many of the performance…
▽ More
The setup cost of a modern solver such as DD-αAMG (Wuppertal Multigrid) is a significant contribution to the total time spent on solving the Dirac equation, and in HMC it can even be dominant. We present an improved implementation of this algorithm with modified computation order in the setup procedure. By processing multiple right-hand sides simultaneously we can alleviate many of the performance issues of the default single right-hand-side setup. The main improvements are as follows: By combining multiple right-hand sides the message size for off-chip communication is larger, which leads to better utilization of the network bandwidth. Many matrix-vector products are replaced by matrix-matrix products, leading to better cache reuse. The synchronization overhead inflicted by on-chip parallelization (threading), which is becoming crucial on many-core architectures such as the Intel Xeon Phi, is effectively reduced. In the parts implemented so far, we observe a speedup of roughly 3x compared to the optimized version of the single right-hand-side setup on realistic lattices.
△ Less
Submitted 13 January, 2016;
originally announced January 2016.