-
Resilience-by-Design Concepts for 6G Communication Networks
Authors:
Ladan Khaloopour,
Yanpeng Su,
Florian Raskob,
Tobias Meuser,
Roland Bless,
Leon Würsching,
Kamyar Abedi,
Marko Andjelkovic,
Hekma Chaari,
Pousali Chakraborty,
Michael Kreutzer,
Matthias Hollick,
Thorsten Strufe,
Norman Franchi,
Vahid Jamali
Abstract:
The sixth generation (6G) mobile communication networks are expected to intelligently integrate into various aspects of modern digital society, including smart cities, homes, healthcare, transportation, and factories. While offering a multitude of services, it is likely that societies become increasingly reliant on 6G infrastructure. Any disruption to these digital services, whether due to human o…
▽ More
The sixth generation (6G) mobile communication networks are expected to intelligently integrate into various aspects of modern digital society, including smart cities, homes, healthcare, transportation, and factories. While offering a multitude of services, it is likely that societies become increasingly reliant on 6G infrastructure. Any disruption to these digital services, whether due to human or technical failures, natural disasters, or terrorism, would significantly impact citizens' daily lives. Hence, 6G networks need not only to provide high-performance services but also to be resilient in maintaining essential services in the face of potentially unknown challenges. This paper introduces a comprehensive concept for designing resilient 6G communication networks, summarizing our initial studies within the German Open6GHub project. Adopting an interdisciplinary approach, we propose to embed physical and cyber resilience across all communication system layers, addressing electronics, physical channel, network components and functions, networks, services, and cross-layer and cross-infrastructure considerations. After reviewing the background on resilience concepts, definitions, and approaches, we introduce the proposed resilience-by-design (RBD) concept for 6G communication networks. We further elaborate on the proposed RBD concept along with selected 6G use-cases and present various open problems for future research on 6G resilience.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
MLComp: A Methodology for Machine Learning-based Performance Estimation and Adaptive Selection of Pareto-Optimal Compiler Optimization Sequences
Authors:
Alessio Colucci,
Dávid Juhász,
Martin Mosbeck,
Alberto Marchisio,
Semeen Rehman,
Manfred Kreutzer,
Guenther Nadbath,
Axel Jantsch,
Muhammad Shafique
Abstract:
Embedded systems have proliferated in various consumer and industrial applications with the evolution of Cyber-Physical Systems and the Internet of Things. These systems are subjected to stringent constraints so that embedded software must be optimized for multiple objectives simultaneously, namely reduced energy consumption, execution time, and code size. Compilers offer optimization phases to im…
▽ More
Embedded systems have proliferated in various consumer and industrial applications with the evolution of Cyber-Physical Systems and the Internet of Things. These systems are subjected to stringent constraints so that embedded software must be optimized for multiple objectives simultaneously, namely reduced energy consumption, execution time, and code size. Compilers offer optimization phases to improve these metrics. However, proper selection and ordering of them depends on multiple factors and typically requires expert knowledge. State-of-the-art optimizers facilitate different platforms and applications case by case, and they are limited by optimizing one metric at a time, as well as requiring a time-consuming adaptation for different targets through dynamic profiling.
To address these problems, we propose the novel MLComp methodology, in which optimization phases are sequenced by a Reinforcement Learning-based policy. Training of the policy is supported by Machine Learning-based analytical models for quick performance estimation, thereby drastically reducing the time spent for dynamic profiling. In our framework, different Machine Learning models are automatically tested to choose the best-fitting one. The trained Performance Estimator model is leveraged to efficiently devise Reinforcement Learning-based multi-objective policies for creating quasi-optimal phase sequences.
Compared to state-of-the-art estimation models, our Performance Estimator model achieves lower relative error (<2%) with up to 50x faster training time over multiple platforms and application domains. Our Phase Selection Policy improves execution time and energy consumption of a given code by up to 12% and 6%, respectively. The Performance Estimator and the Phase Selection Policy can be trained efficiently for any target platform and application domain.
△ Less
Submitted 11 December, 2020; v1 submitted 9 December, 2020;
originally announced December 2020.
-
Chebyshev Filter Diagonalization on Modern Manycore Processors and GPGPUs
Authors:
Moritz Kreutzer,
Georg Hager,
Dominik Ernst,
Holger Fehske,
Alan R. Bishop,
Gerhard Wellein
Abstract:
Chebyshev filter diagonalization is well established in quantum chemistry and quantum physics to compute bulks of eigenvalues of large sparse matrices. Choosing a block vector implementation, we investigate optimization opportunities on the new class of high-performance compute devices featuring both high-bandwidth and low-bandwidth memory. We focus on the transparent access to the full address sp…
▽ More
Chebyshev filter diagonalization is well established in quantum chemistry and quantum physics to compute bulks of eigenvalues of large sparse matrices. Choosing a block vector implementation, we investigate optimization opportunities on the new class of high-performance compute devices featuring both high-bandwidth and low-bandwidth memory. We focus on the transparent access to the full address space supported by both architectures under consideration: Intel Xeon Phi "Knights Landing" and Nvidia "Pascal."
We propose two optimizations: (1) Subspace blocking is applied for improved performance and data access efficiency. We also show that it allows transparently handling problems much larger than the high-bandwidth memory without significant performance penalties. (2) Pipelining of communication and computation phases of successive subspaces is implemented to hide communication costs without extra memory traffic.
As an application scenario we use filter diagonalization studies on topological insulator materials. Performance numbers on up to 512 nodes of the OakForest-PACS and Piz Daint supercomputers are presented, achieving beyond 100 Tflop/s for computing 100 inner eigenvalues of sparse matrices of dimension one billion.
△ Less
Submitted 6 March, 2018;
originally announced March 2018.
-
CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance
Authors:
Faisal Shahzad,
Jonas Thies,
Moritz Kreutzer,
Thomas Zeiser,
Georg Hager,
Gerhard Wellein
Abstract:
In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency but…
▽ More
In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency but it takes a lot of implementation effort. This work presents the implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic Fault Tolerance), which serves two purposes. First, it provides an extendable library that significantly eases the implementation of application-level checkpointing. The most basic and frequently used checkpoint data types are already part of CRAFT and can be directly used out of the box. The library can be easily extended to add more data types. As means of overhead reduction, the library offers a build-in asynchronous checkpointing mechanism and also supports the Scalable Checkpoint/Restart (SCR) library for node level checkpointing. Second, CRAFT provides an easier interface for User-Level Failure Mitigation (ULFM) based dynamic process recovery, which significantly reduces the complexity and effort of failure detection and communication recovery mechanism. By utilizing both functionalities together, applications can write application-level checkpoints and recover dynamically from process failures with very limited programming effort. This work presents the design and use of our library in detail. The associated overheads are thoroughly analyzed using several benchmarks.
△ Less
Submitted 7 August, 2017;
originally announced August 2017.
-
GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems
Authors:
Moritz Kreutzer,
Jonas Thies,
Melven Röhrig-Zöllner,
Andreas Pieper,
Faisal Shahzad,
Martin Galgon,
Achim Basermann,
Holger Fehske,
Georg Hager,
Gerhard Wellein
Abstract:
While many of the architectural details of future exascale-class high performance computer systems are still a matter of intense research, there appears to be a general consensus that they will be strongly heterogeneous, featuring "standard" as well as "accelerated" resources. Today, such resources are available as multicore processors, graphics processing units (GPUs), and other accelerators such…
▽ More
While many of the architectural details of future exascale-class high performance computer systems are still a matter of intense research, there appears to be a general consensus that they will be strongly heterogeneous, featuring "standard" as well as "accelerated" resources. Today, such resources are available as multicore processors, graphics processing units (GPUs), and other accelerators such as the Intel Xeon Phi. Any software infrastructure that claims usefulness for such environments must be able to meet their inherent challenges: massive multi-level parallelism, topology, asynchronicity, and abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems. It implements the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel numerical kernels, intelligent resource management, and truly heterogeneous parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We describe the details of its design with respect to the challenges posed by modern heterogeneous supercomputers and recent algorithmic developments. Implementation details which are indispensable for achieving high efficiency are pointed out and their necessity is justified by performance measurements or predictions based on performance models. The library code and several applications are available as open source. We also provide instructions on how to make use of GHOST in existing software packages, together with a case study which demonstrates the applicability and performance of GHOST as a component within a larger software stack.
△ Less
Submitted 15 February, 2016; v1 submitted 29 July, 2015;
originally announced July 2015.
-
Building a fault tolerant application using the GASPI communication layer
Authors:
Faisal Shahzad,
Moritz Kreutzer,
Thomas Zeiser,
Rui Machado,
Andreas Pieper,
Georg Hager,
Gerhard Wellein
Abstract:
It is commonly agreed that highly parallel software on Exascale computers will suffer from many more runtime failures due to the decreasing trend in the mean time to failures (MTTF). Therefore, it is not surprising that a lot of research is going on in the area of fault tolerance and fault mitigation. Applications should survive a failure and/or be able to recover with minimal cost. MPI is not yet…
▽ More
It is commonly agreed that highly parallel software on Exascale computers will suffer from many more runtime failures due to the decreasing trend in the mean time to failures (MTTF). Therefore, it is not surprising that a lot of research is going on in the area of fault tolerance and fault mitigation. Applications should survive a failure and/or be able to recover with minimal cost. MPI is not yet very mature in handling failures, the User-Level Failure Mitigation (ULFM) proposal being currently the most promising approach is still in its prototype phase. In our work we use GASPI, which is a relatively new communication library based on the PGAS model. It provides the missing features to allow the design of fault-tolerant applications. Instead of introducing algorithm-based fault tolerance in its true sense, we demonstrate how we can build on (existing) clever checkpointing and extend applications to allow integrate a low cost fault detection mechanism and, if necessary, recover the application on the fly. The aspects of process management, the restoration of groups and the recovery mechanism is presented in detail. We use a sparse matrix vector multiplication based application to perform the analysis of the overhead introduced by such modifications. Our fault detection mechanism causes no overhead in failure-free cases, whereas in case of failure(s), the failure detection and recovery cost is of reasonably acceptable order and shows good scalability.
△ Less
Submitted 18 May, 2015;
originally announced May 2015.
-
Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems
Authors:
Moritz Kreutzer,
Georg Hager,
Gerhard Wellein,
Andreas Pieper,
Andreas Alvermann,
Holger Fehske
Abstract:
The Kernel Polynomial Method (KPM) is a well-established scheme in quantum physics and quantum chemistry to determine the eigenvalue density and spectral properties of large sparse matrices. In this work we demonstrate the high optimization potential and feasibility of peta-scale heterogeneous CPU-GPU implementations of the KPM. At the node level we show that it is possible to decouple the sparse…
▽ More
The Kernel Polynomial Method (KPM) is a well-established scheme in quantum physics and quantum chemistry to determine the eigenvalue density and spectral properties of large sparse matrices. In this work we demonstrate the high optimization potential and feasibility of peta-scale heterogeneous CPU-GPU implementations of the KPM. At the node level we show that it is possible to decouple the sparse matrix problem posed by KPM from main memory bandwidth both on CPU and GPU. To alleviate the effects of scattered data access we combine loosely coupled outer iterations with tightly coupled block sparse matrix multiple vector operations, which enables pure data streaming. All optimizations are guided by a performance analysis and modelling process that indicates how the computational bottlenecks change with each optimization step. Finally we use the optimized node-level KPM with a hybrid-parallel framework to perform large scale heterogeneous electronic structure calculations for novel topological materials on a petascale-class Cray XC30 system.
△ Less
Submitted 29 July, 2015; v1 submitted 20 October, 2014;
originally announced October 2014.
-
A unified sparse matrix data format for efficient general sparse matrix-vector multiply on modern processors with wide SIMD units
Authors:
Moritz Kreutzer,
Georg Hager,
Gerhard Wellein,
Holger Fehske,
Alan R. Bishop
Abstract:
Sparse matrix-vector multiplication (spMVM) is the most time-consuming kernel in many numerical algorithms and has been studied extensively on all modern processor and accelerator architectures. However, the optimal sparse matrix data storage format is highly hardware-specific, which could become an obstacle when using heterogeneous systems. Also, it is as yet unclear how the wide single instructi…
▽ More
Sparse matrix-vector multiplication (spMVM) is the most time-consuming kernel in many numerical algorithms and has been studied extensively on all modern processor and accelerator architectures. However, the optimal sparse matrix data storage format is highly hardware-specific, which could become an obstacle when using heterogeneous systems. Also, it is as yet unclear how the wide single instruction multiple data (SIMD) units in current multi- and many-core processors should be used most efficiently if there is no structure in the sparsity pattern of the matrix. We suggest SELL-C-sigma, a variant of Sliced ELLPACK, as a SIMD-friendly data format which combines long-standing ideas from General Purpose Graphics Processing Units (GPGPUs) and vector computer programming. We discuss the advantages of SELL-C-sigma compared to established formats like Compressed Row Storage (CRS) and ELLPACK and show its suitability on a variety of hardware platforms (Intel Sandy Bridge, Intel Xeon Phi and Nvidia Tesla K20) for a wide range of test matrices from different application areas. Using appropriate performance models we develop deep insight into the data transfer properties of the SELL-C-sigma spMVM kernel. SELL-C-sigma comes with two tuning parameters whose performance impact across the range of test matrices is studied and for which reasonable choices are proposed. This leads to a hardware-independent ("catch-all") sparse matrix format, which achieves very high efficiency for all test matrices across all hardware platforms.
△ Less
Submitted 5 March, 2014; v1 submitted 23 July, 2013;
originally announced July 2013.
-
Sparse matrix-vector multiplication on GPGPU clusters: A new storage format and a scalable implementation
Authors:
Moritz Kreutzer,
Georg Hager,
Gerhard Wellein,
Holger Fehske,
Achim Basermann,
Alan R. Bishop
Abstract:
Sparse matrix-vector multiplication (spMVM) is the dominant operation in many sparse solvers. We investigate performance properties of spMVM with matrices of various sparsity patterns on the nVidia "Fermi" class of GPGPUs. A new "padded jagged diagonals storage" (pJDS) format is proposed which may substantially reduce the memory overhead intrinsic to the widespread ELLPACK-R scheme. In our test sc…
▽ More
Sparse matrix-vector multiplication (spMVM) is the dominant operation in many sparse solvers. We investigate performance properties of spMVM with matrices of various sparsity patterns on the nVidia "Fermi" class of GPGPUs. A new "padded jagged diagonals storage" (pJDS) format is proposed which may substantially reduce the memory overhead intrinsic to the widespread ELLPACK-R scheme. In our test scenarios the pJDS format cuts the overall spMVM memory footprint on the GPGPU by up to 70%, and achieves 95% to 130% of the ELLPACK-R performance. Using a suitable performance model we identify performance bottlenecks on the node level that invalidate some types of matrix structures for efficient multi-GPGPU parallelization. For appropriate sparsity patterns we extend previous work on distributed-memory parallel spMVM to demonstrate a scalable hybrid MPI-GPGPU code, achieving efficient overlap of communication and computation.
△ Less
Submitted 29 February, 2012; v1 submitted 23 December, 2011;
originally announced December 2011.