-
Parallel solution of saddle point systems with nested iterative solvers based on the Golub-Kahan Bidiagonalization
Authors:
Carola Kruse,
Masha Sosonkina,
Mario Arioli,
Nicolas Tardieu,
Ulrich Ruede
Abstract:
We present a scalability study of Golub-Kahan bidiagonalization for the parallel iterative solution of symmetric indefinite linear systems with a 2x2 block structure. The algorithms have been implemented within the parallel numerical library PETSc. Since a nested inner-outer iteration strategy may be necessary, we investigate different choices for the inner solvers, including parallel sparse direc…
▽ More
We present a scalability study of Golub-Kahan bidiagonalization for the parallel iterative solution of symmetric indefinite linear systems with a 2x2 block structure. The algorithms have been implemented within the parallel numerical library PETSc. Since a nested inner-outer iteration strategy may be necessary, we investigate different choices for the inner solvers, including parallel sparse direct and multigrid accelerated iterative methods. We show the strong and weak scalability of the Golub-Kahan bidiagonalization based iterative method when applied to a two-dimensional Poiseuille flow and to two- and three-dimensional Stokes test problems.
△ Less
Submitted 28 January, 2020;
originally announced January 2020.
-
Experimentation Procedure for Offloaded Mini-Apps Executed on Cluster Architectures with Xeon Phi Accelerators
Authors:
Gary Lawson,
Vaibhav Sundriyal,
Masha Sosonkina,
Yuzhong Shen
Abstract:
A heterogeneous cluster architecture is complex. It contains hundreds, or thousands of devices connected by a tiered communication system in order to solve a problem. As a heterogeneous system, these devices will have varying performance capabilities. To better understand the interactions which occur between the various devices during execution, an experimentation procedure has been devised to cap…
▽ More
A heterogeneous cluster architecture is complex. It contains hundreds, or thousands of devices connected by a tiered communication system in order to solve a problem. As a heterogeneous system, these devices will have varying performance capabilities. To better understand the interactions which occur between the various devices during execution, an experimentation procedure has been devised to capture, store, and analyze important and meaningful data. The procedure consists of various tools, techniques, and methods for capturing relevant timing, power, and performance data for a typical execution. This procedure currently applies to architectures with Intel Xeon processors and Intel Xeon Phi accelerators. It has been applied to the Co-Design Molecular Dynamics mini-app, courtesy of the ExMatEx team. This work aims to provide end-users with a strategy for investigating codes executed on heterogeneous cluster architectures with Xeon Phi accelerators.
△ Less
Submitted 14 September, 2015; v1 submitted 7 September, 2015;
originally announced September 2015.
-
Using the VBARMS method in parallel computing
Authors:
Bruno Carpentieri,
Jia Liao,
Masha Sosonkina,
Aldo Bonfiglioli
Abstract:
The paper describes an improved parallel MPI-based implementation of VBARMS, a variable block variant of the pARMS preconditioner proposed by Li,~Saad and Sosonkina [NLAA, 2003] for solving general nonsymmetric linear systems. The parallel VBARMS solver can detect automatically exact or approximate dense structures in the linear system, and exploits this information to achieve improved reliability…
▽ More
The paper describes an improved parallel MPI-based implementation of VBARMS, a variable block variant of the pARMS preconditioner proposed by Li,~Saad and Sosonkina [NLAA, 2003] for solving general nonsymmetric linear systems. The parallel VBARMS solver can detect automatically exact or approximate dense structures in the linear system, and exploits this information to achieve improved reliability and increased throughput during the factorization. A novel graph compression algorithm is discussed that finds these approximate dense blocks structures and requires only one simple to use parameter. A complete study of the numerical and parallel performance of parallel VBARMS is presented for the analysis of large turbulent Navier-Stokes equations on a suite of three-dimensional test cases.
△ Less
Submitted 10 August, 2015;
originally announced August 2015.
-
Towards Modeling Energy Consumption of Xeon Phi
Authors:
Gary Lawson,
Masha Sosonkina,
Yuzhong Shen
Abstract:
In the push for exascale computing, energy efficiency is of utmost concern. System architectures often adopt accelerators to hasten application execution at the cost of power. The Intel Xeon Phi co-processor is unique accelerator that offers application designers high degrees of parallelism, energy-efficient cores, and various execution modes. To explore the vast number of available configurations…
▽ More
In the push for exascale computing, energy efficiency is of utmost concern. System architectures often adopt accelerators to hasten application execution at the cost of power. The Intel Xeon Phi co-processor is unique accelerator that offers application designers high degrees of parallelism, energy-efficient cores, and various execution modes. To explore the vast number of available configurations, a model must be developed to predict execution time, power, and energy for the CPU and Xeon Phi. An experimentation method has been developed which measures power for the CPU and Xeon Phi separately, as well as total system power. Execution time and performance are also captured for two experiments conducted in this work. The experiments, frequency scaling and strong scaling, will help validate the adopted model and assist in the development of a model which defines the host and Xeon Phi. The proxy applications investigated, representative of large-scale real-world applications, are Co-Design Molecular Dynamics (CoMD) and Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH). The frequency experiment discussed in this work is used to determine the time on-chip and off-chip to measure the compute- or latencyboundedness of the application. Energy savings were not obtained in symmetric mode for either application.
△ Less
Submitted 25 May, 2015;
originally announced May 2015.