-
The AMD Rome Memory Barrier
Authors:
Phillip Allen Lane,
Jessica Lobrano
Abstract:
With the rapid growth of AMD as a competitor in the CPU industry, it is imperative that high-performance and architectural engineers analyze new AMD CPUs. By understanding new and unfamiliar architectures, engineers are able to adapt their algorithms to fully utilize new hardware. Furthermore, engineers are able to anticipate the limitations of an architecture and determine when an alternate platf…
▽ More
With the rapid growth of AMD as a competitor in the CPU industry, it is imperative that high-performance and architectural engineers analyze new AMD CPUs. By understanding new and unfamiliar architectures, engineers are able to adapt their algorithms to fully utilize new hardware. Furthermore, engineers are able to anticipate the limitations of an architecture and determine when an alternate platform is desirable for a particular workload. This paper presents results which show that the AMD "Rome" architecture performance suffers once an application's memory bandwidth exceeds 37.5 GiB/s for integer-heavy applications, or 100 GiB/s for floating-point-heavy workloads. Strong positive correlations between memory bandwidth and CPI are presented, as well as strong positive correlations between increased memory load and time-to-completion of benchmarks from the SPEC CPU2017 benchmark suites.
△ Less
Submitted 21 November, 2022;
originally announced November 2022.
-
A Genetic Algorithm-based Framework for Learning Statistical Power Manifold
Authors:
Abhishek K. Umrawal,
Sean P. Lane,
Erin P. Hennes
Abstract:
Statistical power is a measure of the replicability of a categorical hypothesis test. Formally, it is the probability of detecting an effect, if there is a true effect present in the population. Hence, optimizing statistical power as a function of some parameters of a hypothesis test is desirable. However, for most hypothesis tests, the explicit functional form of statistical power for individual…
▽ More
Statistical power is a measure of the replicability of a categorical hypothesis test. Formally, it is the probability of detecting an effect, if there is a true effect present in the population. Hence, optimizing statistical power as a function of some parameters of a hypothesis test is desirable. However, for most hypothesis tests, the explicit functional form of statistical power for individual model parameters is unknown; but calculating power for a given set of values of those parameters is possible using simulated experiments. These simulated experiments are usually computationally expensive. Hence, develo** the entire statistical power manifold using simulations can be very time-consuming. We propose a novel genetic algorithm-based framework for learning statistical power manifolds. For a multiple linear regression $F$-test, we show that the proposed algorithm/framework learns the statistical power manifold much faster as compared to a brute-force approach as the number of queries to the power oracle is significantly reduced. We also show that the quality of learning the manifold improves as the number of iterations increases for the genetic algorithm. Such tools are useful for evaluating statistical power trade-offs when researchers have little information regarding a priori best guesses of primary effect sizes of interest or how sampling variability in non-primary effects impacts power for primary ones.
△ Less
Submitted 18 February, 2023; v1 submitted 1 September, 2022;
originally announced September 2022.
-
Heterogeneous Sparse Matrix-Vector Multiplication via Compressed Sparse Row Format
Authors:
Phillip Allen Lane,
Joshua Dennis Booth
Abstract:
Sparse matrix-vector multiplication (SpMV) is one of the most important kernels in high-performance computing (HPC), yet SpMV normally suffers from ill performance on many devices. Due to ill performance, SpMV normally requires special care to store and tune for a given device. Moreover, HPC is facing heterogeneous hardware containing multiple different compute units, e.g., many-core CPUs and GPUs…
▽ More
Sparse matrix-vector multiplication (SpMV) is one of the most important kernels in high-performance computing (HPC), yet SpMV normally suffers from ill performance on many devices. Due to ill performance, SpMV normally requires special care to store and tune for a given device. Moreover, HPC is facing heterogeneous hardware containing multiple different compute units, e.g., many-core CPUs and GPUs. Therefore, an emerging goal has been to produce heterogeneous formats and methods that allow critical kernels, e.g., SpMV, to be executed on different devices with portable performance and minimal changes to format and method. This paper presents a heterogeneous format based on CSR, named CSR-k, that can be tuned quickly and outperforms the average performance of Intel MKL on Intel Xeon Platinum 8380 and AMD Epyc 7742 CPUs while still outperforming NVIDIA's cuSPARSE and Sandia National Laboratories' KokkosKernels on NVIDIA A100 and V100 for regular sparse matrices, i.e., sparse matrices where the number of nonzeros per row has a variance $\leq$ 10, such as those commonly generated from two and three-dimensional finite difference and element problems. In particular, CSR-k achieves this with reordering and by grou** rows into a hierarchical structure of super-rows and super-super-rows that are represented by just a few extra arrays of pointers. Due to its simplicity, a model can be tuned for a device and used to select super-row and super-super-rows sizes in constant time.
△ Less
Submitted 6 January, 2023; v1 submitted 9 March, 2022;
originally announced March 2022.
-
An Adaptive Self-Scheduling Loop Scheduler
Authors:
Joshua Dennis Booth,
Phillip Lane
Abstract:
Many shared-memory parallel irregular applications, such as sparse linear algebra and graph algorithms, depend on efficient loop scheduling (LS) in a fork-join manner despite that the work per loop iteration can greatly vary depending on the application and the input. Because of its importance, many different methods, e.g., workload-aware self-scheduling, and parameters, e.g., chunk size, have bee…
▽ More
Many shared-memory parallel irregular applications, such as sparse linear algebra and graph algorithms, depend on efficient loop scheduling (LS) in a fork-join manner despite that the work per loop iteration can greatly vary depending on the application and the input. Because of its importance, many different methods, e.g., workload-aware self-scheduling, and parameters, e.g., chunk size, have been explored to achieve reasonable performance that requires expert prior knowledge about the application and input. This work proposes a new LS method that requires little to no expert knowledge to achieve speedups close to those of tuned LS methods by self-managing chunk size based on a heuristic of workload variance and using work-stealing. This method, named \ichunk, is implemented into libgomp for testing. It is evaluated against OpenMP's guided, dynamic, and taskloop methods and is evaluated against BinLPT and generic work-stealing on an array of applications that includes: a synthetic benchmark, breadth-first search, K-Means, the molecular dynamics code LavaMD, and sparse matrix-vector multiplication. On 28 thread Intel system, \ichunk is the only method to always be one of the top three LS methods. On average across all applications, \ichunk is within 5.4% of the best method and is even able to outperform other LS methods for breadth-first search and K-Means.
△ Less
Submitted 28 October, 2021; v1 submitted 15 July, 2020;
originally announced July 2020.
-
Performance Evaluation of Multiparty Authentication in 5G IIoT Environments
Authors:
Hussain Al-Aqrabi,
Phil Lane,
Richard Hill
Abstract:
With the rapid development of various emerging technologies such as the Industrial Internet of Things (IIoT), there is a need to secure communications between such devices. Communication system delays are one of the factors that adversely affect the performance of an authentication system. 5G networks enable greater data throughput and lower latency, which presents new opportunities for the secure…
▽ More
With the rapid development of various emerging technologies such as the Industrial Internet of Things (IIoT), there is a need to secure communications between such devices. Communication system delays are one of the factors that adversely affect the performance of an authentication system. 5G networks enable greater data throughput and lower latency, which presents new opportunities for the secure authentication of business transactions between IIoT devices. We evaluate an approach to develo** a flexible and secure model for authenticating IIoT components in dynamic 5G environments.
△ Less
Submitted 16 January, 2020;
originally announced January 2020.
-
Securing Manufacturing Intelligence for the Industrial Internet of Things
Authors:
Hussain Al-Aqrabi,
Richard Hill,
Phil Lane,
Hamza Aagela
Abstract:
Widespread interest in the emerging area of predictive analytics is driving industries such as manufacturing to explore new approaches to the collection and management of data provided from Industrial Internet of Things (IIoT) devices. Often, analytics processing for Business Intelligence (BI) is an intensive task, and it also presents both an opportunity for competitive advantage as well as a sec…
▽ More
Widespread interest in the emerging area of predictive analytics is driving industries such as manufacturing to explore new approaches to the collection and management of data provided from Industrial Internet of Things (IIoT) devices. Often, analytics processing for Business Intelligence (BI) is an intensive task, and it also presents both an opportunity for competitive advantage as well as a security vulnerability in terms of the potential for losing Intellectual Property (IP). This article explores two approaches to securing BI in the manufacturing domain. Simulation results indicate that a Unified Threat Management (UTM) model is simpler to maintain and has less potential vulnerabilities than a distributed security model. Conversely, a distributed model of security out-performs the UTM model and offers more scope for the use of existing hardware resources. In conclusion, a hybrid security model is proposed where security controls are segregated into a multi-cloud architecture.
△ Less
Submitted 22 January, 2019;
originally announced January 2019.
-
Towards Optimised Data Transport and Analytics for Edge Computing
Authors:
Phil Lane,
Richard Hill
Abstract:
Industrial organisations, particularly Small and Medium-sized Enterprises (SME), face a number of challenges with regard to the adoption of Industrial Internet of Things (IIoT) technologies and methods. The scope of analytics processing that can be performed on data from IIoT-enabled industrial processes is typically limited by the compute and storage resources that are available, and any investme…
▽ More
Industrial organisations, particularly Small and Medium-sized Enterprises (SME), face a number of challenges with regard to the adoption of Industrial Internet of Things (IIoT) technologies and methods. The scope of analytics processing that can be performed on data from IIoT-enabled industrial processes is typically limited by the compute and storage resources that are available, and any investment in additional hardware that is sufficiently flexible and scalable is difficult to justify in terms of Return On Investment (ROI). We describe a distributed model of data transport and processing that eases the take-up of IIoT, whilst also enabling a capability to securely deliver more complex analysis and future insight discovery, than would be possible with traditional network architectures.
△ Less
Submitted 10 January, 2019;
originally announced January 2019.