Machine-Learning-Driven Runtime Optimization of BLAS Level 3 on Modern Multi-Core Systems

Yufan Xia The Chinese University of Hong Kong
Hong Kong SAR, China
[email protected]
   Giuseppe Maria Junior Barca
The University of Melbourne
Melbourne, Australia
[email protected]
Abstract

BLAS Level 3 operations are essential for scientific computing, but finding the optimal number of threads for multi-threaded implementations on modern multi-core systems is challenging. We present an extension to the Architecture and Data-Structure Aware Linear Algebra (ADSALA) library that uses machine learning to optimize the runtime of all BLAS Level 3 operations. Our method predicts the best number of threads for each operation based on the matrix dimensions and the system architecture. We test our method on two HPC platforms with Intel and AMD processors, using MKL and BLIS as baseline BLAS implementations. We achieve speedups of 1.5 to 3.0 for all operations, compared to using the maximum number of threads. We also analyze the runtime patterns of different BLAS operations and explain the sources of speedup. Our work shows the effectiveness and generality of the ADSALA approach for optimizing BLAS routines on modern multi-core systems.

I Introduction

The linear algebra subroutines in the Basic Linear Algebra Subprograms (BLAS) [1] form the backbone of scientific computing. Due to the critical role of BLAS, great effort has been devoted to improving the performance its subroutines. The automatically tuned linear algebra software (ATLAS) and its improvements are the first batch of auto-tuning efforts on optimising linear algebra operations; they are able to auto-tune the linear algebra operation codes by searching over parameters like blocking factor, loop order, and partial storage location on each specific computer architecture [2, 3, 4]. Later in the 2000s, the self-optimised linear algebra routine (SOLAR) attempted the analytical modelling of the timing of multi-process linear algebra operations [5, 6, 7, 8, 9, 10].

Within BLAS, Level 3 (L3) operations, which are concerned with matrix-matrix linear algebra, are the most complex and computationally demanding [1]. In addition to the General Matrix Multiplication (GEMM), BLAS L3 includes the Symmetric Rank-k Update (SYR), Symmetric Rank-2k Update (SYR2), Triangular Matrix-Matrix Multiplication (TRMM), and Triangular Rank-k Update (TRSM) subroutines.

While there has been significant work in optimizing single-threaded CPU performance of BLAS L3 by fine-tuning the size of matrix blocks for various system architectures, similar efforts for optimizing these routines specifically for multi-core CPU have been less frequent due to the high complexity and large variety of the underlying computer architectures.

Work has been carried out by using established optimization techniques such as parameter tuning and blocking[11, 12]. The Parallel Linear Algebra for Scalable Multi-Core Architectures (PLASMA) [13] library uses polynomial regression to optimise the number of threads or block size used based on empirical features, obtaining performance comparable to the Intel® MKL for Cholesky decomposition. Peise et al. adopted a polynomial regression to model the dense linear algebra run times, and they applied additional techniques to boost the performance of this simple model [14].

However, choosing the number of threads that minimises the execution time of a given BLAS remains challenging and largely unsolved due to the underlying diversity and complexity of modern shared-memory computer architectures.

To tackle this challenge, recent research conducted by Xia et al. [15], which was integrated in the open-source Architecture and Data-Structure Aware Linear Algebra library, has employed a systematic machine learning (ML) methodology to significantly reduce the runtime of Single-precision General Matrix Multiplication (SGEMM) operations. This approach leverages an ML model to dynamically select the optimal number of threads for minimizing the execution time of GEMM for a specific input configuration and computer system architecture. The ML model itself undergoes training during the installation phase, and this training process is customized to the particular computer system architecture in use. Subsequently, during runtime, the ML model is employed to make predictions regarding the most efficient number of threads for a given task. An important advantage of this approach is its adoption of existing GEMM implementations, treating them as black boxes. This work has inspired a novel auto-tuning method in Graph Neural Network training [16].

In this study, we have expanded the runtime optimization capabilities of the ADSALA library to encompass all single- and double-precision BLAS L3 operations, and we present the performance results achieved. Furthermore, we have incorporated an automatic model selection feature during the installation process for each distinct BLAS subroutine. This addition aids in identifying and using the most appropriate machine learning algorithm for each subroutine on every machine where the library is installed. We test our method on two HPC platforms with Intel and AMD processors, using multi-threaded MKL and BLIS as baseline BLAS implementations. Our software implementation enables us to speed up all BLAS L3 subroutines, notwithstanding the runtime overhead of ML evaluations.

On the Gadi supercomputer located at the National Computational Infrastructure (NCI) and on the Setonix supercomputer located at the Pawsey Supercomputing Centre (please refer to section V for details about NCI and Pawsey), we achieve speedups of 1.5 ×\times× to 3.0 ×\times× compared to using the maximum number of threads with hyperthreading enabled or disabled. For comparison, the most recent method relevant to our study employs pure polynomial regression to model and determine the number of threads for the PLASMA QR routine[13]. This method reported average speed improvements of 1%, 13%, and 27% across three different platforms. The sizes of the matrices used in these tests ranged from 2000 to 7000, which aligns closely with the range we tested in our study.

We also analyze the runtime patterns of different ADSALA BLAS L3 runs and discuss the sources of speedup. The ADSALA library is provided as open-source implementation for the community to use and extend.

The remainder of this article is structured as follows. We provide background information concerning BLAS L3 operations and ML in Section II. We review the improved software design in Section III, and discuss the adopted ML methods in Section IV. Section V details the experimentation platform and settings. We present and analyze the performance of the ML models and software speedup in Section VI. Section VII concludes.

TABLE I: Specifications of BLAS level III subroutines.
Matrix A Matrix B Matrix C
dims shape type shape type shape type
GEMM 3 m ×\times× k regular k ×\times× n regular m ×\times× n regular
SYMM 2 m ×\times× m symmetric m ×\times× n regular m ×\times× n regular
SYRK 2 n ×\times× k regular n ×\times× k regular n ×\times× n symmetric
SYR2K 2 n ×\times× k regular n ×\times× k regular n ×\times× n symmetric
TRMM 2 m ×\times× m triangular m ×\times× n regular
TRSM 2 m ×\times× m triangular m ×\times× n regular
TABLE II: Comparisons of ML model characteristics.
Model Catagories Models Parametric Good with Data Imbalance Data Size Requirement
Linear Models Linear Regression Yes No Medium
ElasticNet Medium
Bayesian Regression Small
Tree Based Models Decision Tree No Yes Medium
XGBoost
AdaBoost
Random Forest
LightGBM
Other Models SVM Regressor No No Small
KNN Regressor Medium

II Background

In this Section, we introduce BLAS L3 subroutines, ML algorithms, and related data preprocessing techniques.

II-A BLAS Level III Subroutines

BLAS L3 includes six matrix-matrix operations, including GEMM, SYMM, SYR, SYR2, TRMM, and TRSM [1]. These operations have different number of matrix operands, matrix shapes, and some matrix operands are required to be symmetric or triangular. These requirements are listed in Table I. These subroutines provide a unified interfaces for different BLAS implementations, thus enables users to switch between different BLAS implementations without changing their code.

The performance of BLAS L3 subroutines is highly dependent on the underlying implementation and the system architecture. For example, the GEMM routine is the most commonly used BLAS Level III subroutine, and it is usually the most optimised one for BLAS implementations. However, as reported, even GEMM has the issue where the maximum thread number, which we defined to be equal to the number of available CPU cores times the hyper-threading level, may not provide the best performance [15]. While this issue has not been reported to be present in other BLAS subroutines, in later sections, we will show that the other BLAS L3 subroutines show a similar behaviour.

II-B Machine Learning Algorithms

To build a high-performance ML model, we need to sift through suitable potential candidates. We provide a brief overview of the dataset we will use (for more details on data collection, please refer to subsection IV-B) to help justify our selection of candidate models that best fit the data characteristics. The dataset for a given BLAS L3 operation typically comprises approximately 103superscript10310^{3}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT data points spanning 4-15 dimensions, with most data features exhibiting a skewed distribution. Despite being a relatively small dataset with low-dimensional data, the relationship between the features and the label can be quite complex due to the expected non-linear relation stemming from the polynomial time complexity of most BLAS L3 operations and the intricacies of multi-core computer architectures. Given these attributes, we present several ML algorithms as potential candidates in Table I.

Linear algorithms produce simple but high speed ML models [17]. They usually suffer from low predictive accuracy for non-linear map**s. Due to the nature of our task, evaluation speed is an essential factor, so linear ML models may enable pragmatic BLAS L3 speedups. We include linear regression, ElasticNet and Bayesian regression as our candidates [18, 19].

We also include both Decision Tree and tree ensemble models as candidates. The Decision Tree algorithm is non-parametric, it uses a set of rules to map each data instance to a discrete class or a continuous value [20]. Random Forest, AdaBoost, XGBoost and LightGBM are more complex algorithms who are built upon an ensemble of Decision Trees. These tree ensemble algorithms generally can reduce bias and variances compared to a single Decision Tree and thus can produce a more promising result but with higher evaluation time.

Support Vector Machine (SVN) and k-Nearest-Neighbors (kNN) are non-parametric algorithms but are designed to work well for high dimensions and unknown data structures, respectively [21]. Since we have low data dimensionality, they might not demonstrate their advantages when learning from our dataset; kNN is also known to be slow in evaluation [22]. We will evaluate their performance for completeness.

Refer to caption
(a) The installation workflow of ADSALA GEMM. Upon ADSALA installation, the library performs two sub-parts shown in the diagram. In the end, two files containing the configurations together with the production-ready ML model will be saved for later use at runtime.
Refer to caption
(b) The runtime workflow of ADSALA for BLAS level III subroutines. Configuration files and trained ML models outputted during installation (see Fig. 0(a)) are used by this runtime library.
Figure 1: The software design of ADSALA.

II-C Data Preprocessing Techniques

Data outliers can affect the predictive performance of ML models. While statistical methods can effectively remove global outliers, they often fail to detect local outliers. the Local Outlier Factor (LOF), a density-based method, overcomes this limitation by assigning each data point a degree of isolation score based on its surrounding points [23, 24]. We use LOF to identify and remove these outliers from our dataset.

The Yeo-Johnson transformation is a data transformation technique that remaps each feature value such that the feature distribution is near-Gaussian, thus improving the predictive performance of ML algorithms that assume normality, such as linear regression, ElasticNet, and Bayesian regression. Unlike the original Box-Cox transformation, which requires all feature values to be positive, Yeo-Johnson accepts non-positive values and provides stable parameter estimation [25, 26]. Yeo-Johnson transformation uses a parameter λ𝜆\lambdaitalic_λ to control the strength of transformation. We apply Yeo-Johnson with maximum likelihood estimation (MLE) for parameter estimation for the data transformation, thereby automating the ML workflow.

III Software Workflow

Our software, upon installation, collects runtime data for each of BLAS L3 subroutine and different combinations of thread numbers, input matrices dimensions, and BLAS packages availability. This data trains an ML model to predict the optimal thread number for given dimensions of the GEMM input. The final product is an ADSALA library that dynamically selects the optimal thread number at runtime, adapting to different HPC platforms, BLAS packages, and subroutines.

Figure 0(a) and 0(b) illustrate our software design procedures. The software comprises data gathering, model training, and a runtime library. The first two parts are executed at installation time, while the third part is typically used at runtime when the user program links to our library.

III-A Installation Workflow

During installation, the software gathers training data through experimentation. It quasi-randomly (vide infra) samples from the domain of the specific subroutine matrices dimensions, passes them to a timing program that runs and times the corresponding BLAS operations. The timings are stored as training data for the ML model. The timing data is then preprocessed and utilised to tune the hyper-parameters of the ML model.

III-B Runtime Workflow

Upon instantiation, the ML models and its configuration are loaded into memory. At runtime, when the user program calls a BLAS function, the ML model predicts the optimal thread number on-the-fly and runs the implementation with that thread number. The ML model and configuration are wrapped in a C++ class that can be destroyed after the last BLAS call to free memory.

To avoid redundant ML model evaluations for two consecutive BLAS call with identical input dimensions, our software remembers the input to the last BLAS call and its correlation ML prediction. If the current BLAS input dimensions are the same as the previous, the software will simply read and apply the predictions from the responsible class attributes without re-evaluation.

IV Machine Learning Methods

In this section, we build upon the original ADSALA proof-of-concept library [15] and extend it to support all BLAS L3 subroutines. We first introduce the mechanism of ML predictions, then we detail the methods for data collection, data preprocessing, model selection, and model training process in our software. Since our goal is a library that works on different HPC architectures and BLAS packages, the following methodologies can be applied to any HPC architectures and BLAS packages.

IV-A Mechanism for Predictions

For a given BLAS subroutine and a given combination of input dimensions, the software selects the optimal number of threads by first predicting the duration of BLAS subroutine associated with the possible number of threads. It then selects the thread number with the shortest predicted runtime for the ensuing BLAS subroutine execution.

IV-B Data Gathering

Two paragraphs that were in this section are combined to eliminate repetitions mentioned by R1

In order to ensure that our method is effective for BLAS L3 subroutines across a variety of matrix shapes and sizes within memory constraints, including slim/square and big/small matrices, we need to sample domains that are evenly distributed across the space. To achieve this, we employ a scrambled Halton sequence to produce a quasi-random sequence with low discrepancy for data sampling [27]. The upper bound of the sum size of matrices to 500 MB111Some BLAS subroutines have their output matrix overwrite the input matrix (TRMM and TRSM). We calculate the sum memory size as the sum of input/output matrices, ignoring the overwritten matrices.. Given that the samples can have two or three domains, we opt for the scrambled Halton sequence over the regular Halton sequence to reduce the correlation between dimensions [27]. We utilize bases 2, 3, and 4 for generating the sequence for dimensions m𝑚mitalic_m, k𝑘kitalic_k, and n𝑛nitalic_n and bases 2, 3 for dimensions m𝑚mitalic_m, n𝑛nitalic_n (or n𝑛nitalic_n, k𝑘kitalic_k). These samples are then input into the timing program to gather their runtime data.

IV-C Feature Engineering and Data Preprocessing

Table III shows the features used for the ML model. We created two set of features for subroutines with three matrix size parameter and two matrix size parameter with respect to the computational complexity, their memory footprint, and the multi-thread speedup.

The execution time for BLAS subroutines is a function of the matrix dimensions and the number of threads (nt𝑛𝑡ntitalic_n italic_t), and it varies depending on the architecture. For three-dimension subroutines, the terms mk𝑚𝑘m*kitalic_m ∗ italic_k, kn𝑘𝑛k*nitalic_k ∗ italic_n, mn𝑚𝑛m*nitalic_m ∗ italic_n, and mk+kn+mn𝑚𝑘𝑘𝑛𝑚𝑛m*k+k*n+m*nitalic_m ∗ italic_k + italic_k ∗ italic_n + italic_m ∗ italic_n correspond to the sizes of matrices A𝐴Aitalic_A, B𝐵Bitalic_B, C𝐶Citalic_C, and the total memory size in single-/double-precision words, respectively. Similar terms are constructed for two-dimension subroutines as shown in the second group in Table III. These memory terms are directly related to the number of memory operations and hence, influence the runtime of various BLAS subroutines. Specifically, these terms are dominant in the serial runtime for smaller matrix dimensions. In three-dimension subroutines, the cubic term mkn𝑚𝑘𝑛m*k*nitalic_m ∗ italic_k ∗ italic_n is proportional to the number of floating-point operations performed and tends to dominate the runtime in serial execution for larger matrix dimensions. In parallel execution, the FLOP workloads are distributed across threads, resulting in terms like mnk/nt𝑚𝑛𝑘𝑛𝑡{m*n*k}/ntitalic_m ∗ italic_n ∗ italic_k / italic_n italic_t.

For feature transformation, outlier removal, feature selection, and hyper-parameter tuning stages, we use the same methodologies as in the original ADSALA library that helps improving the speed and also predictive performance of the model[15]. We employ the Yeo-Johnson transformation to approximate the distribution of features to a near-Gaussian distribution, which is beneficial for model learning.222In our tests, we observed an enhancement in the performance of these models when the Yeo-Johnson Transformation was applied, resulting in a 10-20% decrease in the Root Mean Square Error (RMSE) for Linear Regression. The application of it does not affect the performance of other candidate models much. This transformation identifies the most suitable parameter values for λ𝜆\lambdaitalic_λ and adjusts the transformation’s impact on each feature using the MLE method [26]. Following this transformation, we carry out a standardisation process on features to ensure they all operate on a similar scale [28]. Subsequently, we eliminate features that have correlation coefficients with other features exceeding a threshold of 80%percent8080\%80 % to eliminate the potentially redundant features among the candidates. For each correlation feature pair, we remove the feature with the larger total correlation with the other features. Following this, the hyper-parameter tuning is performed for all models to compare model performance.

TABLE III: List of available features for BLAS subroutines with two or three matrix dimensions. nt stands for number of threads.
Three Dimensions m𝑚mitalic_m, k𝑘kitalic_k, n𝑛nitalic_n Two Dimensions m𝑚mitalic_m, n𝑛nitalic_n
1 m m/ntmnt{\text{m}}/{\text{nt}}m / nt m
2 k k/ntknt{\text{k}}/{\text{nt}}k / nt n
3 n n/ntnnt{\text{n}}/{\text{nt}}n / nt nt
4 nt m*k/ntm*knt{\text{m*k}}/{\text{nt}}m*k / nt m*n
5 m*k m*n/ntm*nnt{\text{m*n}}/{\text{nt}}m*n / nt memory_footprint
6 m*n k*n/ntk*nnt{\text{k*n}}/{\text{nt}}k*n / nt m/nt
7 k*n m*k*n/ntm*k*nnt{\text{m*k*n}}/{\text{nt}}m*k*n / nt n/nt
8 m*k*n memory_footprint / nt m*n/nt
9 memory_footprint memory_footprint / nt

IV-D Model Selection

Given that our method’s ultimate goal is to achieve optimal runtime speedup, the model selection process incorporates both the predictive performance of the model and the speed of model evaluation with a potential trade-off between these two. This approach is akin to the proof-of-concept work conducted with ADSALA [15] where the speedup is estimated using the formula: s = toriginaltADSALA+teval In this context, tADSALAsubscript𝑡ADSALAt_{\text{ADSALA}}italic_t start_POSTSUBSCRIPT ADSALA end_POSTSUBSCRIPT is the runtime of the BLAS subroutine with predicted threads, tevalsubscript𝑡evalt_{\text{eval}}italic_t start_POSTSUBSCRIPT eval end_POSTSUBSCRIPT is the ML model evaluation time, and toriginalsubscript𝑡originalt_{\text{original}}italic_t start_POSTSUBSCRIPT original end_POSTSUBSCRIPT is the runtime with maximum threads. The evaluation time tevalsubscript𝑡evalt_{\text{eval}}italic_t start_POSTSUBSCRIPT eval end_POSTSUBSCRIPT is measured by averaging multiple runs. The ML model with the highest average estimated speedup s𝑠sitalic_s across all BLAS subroutines is selected.

V Experimentation Information

This Section presents the supercomputing platforms used for experimentation and the setup of the experiments on those platforms.

V-A Experimentation Platforms

We conducted our experiments on two supercomputing platforms.

V-A1 Setonix

Setonix is a supercomputer housed at the Pawsey Supercomputing Research Centre in Australia. Each of its compute nodes has specific features, as depicted in Fig. 2.

  • Each compute node is equipped with two CPU sockets, each housing an AMD® EPYC 64-Core Milan CPU (2.55 GHz). The CPUs support hyper-threading, enabling up to 256 threads to run concurrently per compute node. Each Milan CPU comprises eight modules, with each module containing eight Zen 3 cores and a dedicated 32 MB level three cache that is shared among these cores.

  • The system has a memory capacity of 256GB, organized into eight NUMA domains, with four NUMA domains per socket. Each socket supports eight memory channels.

Refer to caption
Figure 2: A schematic diagram for the 2-socket EPYC CPU configuration on Setonix.

V-A2 Gadi

Gadi is a supercomputer located at the Australian National Computational Infrastructure. Figure 3 illustrates the features of each of Gadi’s compute nodes:

  • Each compute node has two CPU sockets, each containing an Intel® Xeon 24-Core Cascade Lake CPU (Platinum 8274, 3.2 GHz). The CPUs support hyper-threading, enabling up to 96 threads to run concurrently per compute node.

  • The system has a memory capacity of 192GB, organized into four NUMA domains, with two NUMA domains per socket. Each socket supports six memory channels.

Refer to caption
Figure 3: A schematic diagram of the 2-socket Cascade Lake CPU configuration; sockets are connected using Intel® UPI (Ultra Path Interconnect).

V-B Experimentation Setup

For the Setonix supercomputer, which uses AMD® processors, we will use the performance of BLIS333 https://developer.amd.com/amd-aocl/blas-library/ as the baseline for measuring performance improvements. For the Gadi supercomputer, which uses Intel® processors, we will use the performance of MKL444 https://www.intel.com/content/www/us/en/develop/documentation/get-started-with-mkl-for-dpcpp/top.html as the baseline for measuring performance improvements. For convenience in comparion of performance with the proof-of-concept ADSALA work, we will use the same experimental settings which can be found in [15].

TABLE IV: Model performance and estimated speedups for ML models on Setonix.
subroutine best_model
dgemm LinearRegression
dsymm XGBRegressor
dsyr2k XGBRegressor
dsyrk XGBRegressor
dtrmm XGBRegressor
dtrsm XGBRegressor
sgemm XGBRegressor
ssymm XGBRegressor
ssyr2k XGBRegressor
ssyrk XGBRegressor
strmm XGBRegressor
strsm XGBRegressor
TABLE V: Model performance and estimated speedups for ML models on Gadi.
subroutine best_model
dgemm BayesianRidge
dsymm RandomForestRegressor
dsyr2k XGBRegressor
dsyrk LinearRegression
dtrmm XGBRegressor
dtrsm LinearRegression
sgemm XGBRegressor
ssymm RandomForestRegressor
ssyr2k XGBRegressor
ssyrk XGBRegressor
strmm XGBRegressor
strsm LinearRegression
TABLE VI: Detailed statistics for model performance and estimated speedups for ML models on Gadi. The bold values of each column represents the best entry in that column.
dgemm Gadi
Model Normalised Test RMSE Ideal mean speedup Ideal aggregate speedup Model evaluation time in μ𝜇\muitalic_μs Estimated mean speedup Estimated aggregate speedup
Linear Regression 0.93 1.13 1.01 15.3 1.12 1.01
ElasticNet 1.00 1.07 0.94 10.63 1.07 0.94
Bayes Regression 0.93 1.13 1.01 8.11 1.13 1.01
Decision Tree 0.32 0.82 0.36 8.02 0.82 0.36
Random Forest 0.24 1.08 0.97 983.09 1.01 0.94
AdaBoost 0.52 0.84 0.46 89.77 0.84 0.46
KNN 0.28 1.07 0.95 6449.32 0.78 0.78
XGBoost 0.23 1.1 0.98 290.76 1.08 0.97
dsymm Gadi
Model Normalised Test RMSE Ideal mean speedup Ideal aggregate speedup Model evaluation time in μ𝜇\muitalic_μs Estimated mean speedup Estimated aggregate speedup
Linear Regression 1.00 1.49 1.86 6.49 1.49 1.86
ElasticNet 1.00 1.45 1.8 4.78 1.45 1.8
Bayes Regression 1.00 1.49 1.86 4.65 1.49 1.86
Decision Tree 0.46 1.07 0.54 4.56 1.07 0.54
Random Forest 0.15 1.72 1.89 563.59 1.62 1.86
AdaBoost 0.61 0.76 0.36 69.88 0.75 0.36
KNN 0.18 1.76 1.95 2775.13 1.42 1.79
XGBoost 0.18 1.65 1.82 1046.46 1.5 1.77
ssyrk Gadi
Model Normalised Test RMSE Ideal mean speedup Ideal aggregate speedup Model evaluation time in μ𝜇\muitalic_μs Estimated mean speedup Estimated aggregate speedup
Linear Regression 1.00 0.95 0.89 6.85 0.95 0.89
ElasticNet 1.00 0.99 0.96 5.07 0.99 0.96
Bayes Regression 1.00 0.95 0.89 4.97 0.95 0.89
Decision Tree 0.33 0.56 0.28 4.89 0.56 0.28
Random Forest 0.09 1.2 0.81 2324.21 0.96 0.77
AdaBoost 0.48 0.53 0.26 62.94 0.52 0.26
KNN 0.11 1.2 0.85 1760.56 1 0.81
XGBoost 0.09 1.15 0.82 446.43 1.08 0.81
strsm Gadi
Model Normalised Test RMSE Ideal mean speedup Ideal aggregate speedup Model evaluation time in μ𝜇\muitalic_μs Estimated mean speedup Estimated aggregate speedup
Linear Regression 1.00 1.04 0.96 8.84 1.04 0.96
ElasticNet 1.00 1.02 0.95 6.39 1.02 0.95
Bayes Regression 1.00 1.04 0.96 5.72 1.04 0.96
Decision Tree 0.31 0.64 0.23 4.94 0.64 0.23
Random Forest 0.06 1.2 1.1 2191.65 0.96 1.01
AdaBoost 0.46 0.58 0.27 119.84 0.57 0.27
KNN 0.08 1.2 1.09 1688.15 1 1.02
XGBoost 0.07 1.21 1.05 1354.14 1.04 0.99

Refer to caption

(a) Setonix

Refer to caption

(b) Gadi
Figure 4: Heatmap of the optimal number of threads on Setonix and Gadi, concerning all BLAS level III subroutines except GEMM. The horizontal and vertical axes use a square root scale.

Refer to caption

(a) Setonix

Refer to caption

(b) Gadi
Figure 5: Heatmap of the optimal number of threads on Setonix and Gadi. The horizontal and vertical axes use a square root scale. The dashed lines on each sub-graph are contour lines of the sampling domain with each label showing the value of the third dimension.

Refer to caption

(a) Setonix

Refer to caption

(b) Gadi
Figure 6: Heatmap of the testing speedup with respect to matrix sizes, concerning all BLAS level III subroutines except GEMM. The horizontal and vertical axes use a square root scale.
TABLE VII: Speedup statistics on Setonix and Gadi with hyper-threading
(a) Setonix
Mean Std Min 25% 50% 75% Max
dgemm 1.54 0.66 0.87 1.16 1.29 1.74 4.79
sgemm 1.32 0.41 0.76 1.05 1.18 1.37 9.05
dsymm 2.89 1.80 0.82 1.38 2.36 4.09 8.46
ssymm 2.22 1.69 0.34 1.04 1.60 2.88 7.42
dsyr2k 1.48 0.46 0.94 1.17 1.36 1.56 3.59
ssyr2k 1.53 0.49 0.75 1.20 1.43 1.72 3.79
dsyrk 1.46 0.46 0.92 1.15 1.28 1.69 3.54
ssyrk 1.55 0.42 0.98 1.22 1.46 1.82 2.80
dtrmm 1.61 0.82 0.87 1.13 1.27 1.78 6.51
strmm 1.67 0.92 0.94 1.10 1.34 1.81 7.13
dtrsm 1.71 1.07 0.99 1.13 1.27 1.77 7.43
strsm 1.68 1.21 0.98 1.12 1.34 1.85 12.38
(b) Gadi
Mean Std Min 25% 50% 75% Max
dgemm 1.27 0.55 0.41 1.01 1.13 1.29 4.25
sgemm 1.07 0.70 0.88 1.00 1.00 1.02 3.01
dsymm 2.28 1.89 0.88 1.18 1.43 2.50 12.08
ssymm 2.16 1.98 0.99 1.18 1.30 2.22 11.05
dsyr2k 1.28 0.38 0.90 1.08 1.20 1.37 4.27
ssyr2k 1.47 0.60 0.41 1.14 1.26 1.61 4.52
dsyrk 1.40 0.36 0.86 1.20 1.30 1.47 3.98
ssyrk 1.65 0.47 1.11 1.29 1.45 1.96 3.03
dtrmm 1.30 0.24 1.03 1.16 1.25 1.34 2.75
strmm 1.35 0.26 0.98 1.17 1.30 1.47 2.78
dtrsm 1.31 0.25 0.75 1.17 1.29 1.41 2.86
strsm 1.40 0.40 0.74 1.19 1.33 1.47 4.40

Refer to caption

(a) Setonix

Refer to caption

(b) Gadi
Figure 7: Heatmap for speedups with respect to matrix dimensions on Setonix and Gadi. The horizontal and vertical axes use a square root scale. The dashed lines on each sub-graph are contour lines of the sampling domain with each label showing the value of the third dimension.
TABLE VIII: Profiling results on Gadi with GEMM, SYMM, and SYRK.
m, k, n Thread Count Total Time (s) Thread Sync (s) Kernel Call (s) Data Copy (s)
dgemm 64,2048,64 no ML 96 11.001 4.806 0.010 2.060
dgemm 64,2048,64 with ML 16 4.430 1.453 0.001 0.887
sgemm 64,2048,64 no ML 96 7.698 2.716 0.005 1.133
sgemm 64,2048,64 with ML 5 2.701 0.155 0.001 0.301
dsymm 248,39944 no ML 96 29.063 11.627 0.496 9.783
dsymm 248,39944 with ML 25 22.067 4.962 0.441 9.698
ssymm 2759,41681 no ML 96 27.299 8.986 1.208 10.605
ssymm 2759,41681 with ML 12 15.597 2.085 0.241 8.254
dsyrk 124,160163 no ML 96 36.060 8.545 0.005 13.672
dsyrk 124,160163 with ML 43 34.371 7.543 0.001 12.529
ssyrk 175,15095 no ML 96 65.844 62.666 1.639 1.103
ssyrk 175,15095 with ML 46 4.581 1.323 1.744 1.263

VI Performance Analysis

In this section, we first provide a visual representation and succinct explanation of our datasets. We then present the results of our machine learning model selection process. Finally, we evaluate the performance of our method and its software implementation using our test datasets. All analysis are performed on both Setonix and Gadi supercomputers and for all BLAS L3 subroutines with double and single precision.

VI-A Datasets, Training Time, and Data Visualization

The size of the training dataset is 1000-1200 for each of the BLAS III subroutines on each HPC platform. It is believed that the size of the dataset is sufficient, since it is generally observed that the validation performance does not improve significantly with more training set data.

The data gathering on Setonix and Gadi required approximately 100 node hours for each subroutine. It was observed that, given the same upper limit of matrix size (specifically 500 MB), the double precision subroutines took more time to collect data compared to their single precision counterparts.

Figure 5 contains two subfigures that depict the distribution of the optimal number of threads for GEMM on Setonix and Gadi, respectively. The patterns for sGEMM and dGEMM appear to be quite similar on Setonix, but show more variation on Gadi. A common trend observed is that irregular GEMM calls, where at least one of the dimensions m𝑚mitalic_m, k𝑘kitalic_k, or n𝑛nitalic_n is small, often lead to suboptimal performance. Also, we can observe patches of abnormal area where choices of the optimal number of threads is drastically different from the surrounding area.

The optimal number of threads for SYMM, SYRK, SYR2K, TRMM, and TRSM on Setonix and Gadi is displayed in Fig. 4. The patterns for double precision and single precision subroutines are mostly similar, but there are some exceptions. Also, the distribution of the optimal number of threads is more uneven than for GEMM. For SYRK, TRMM, and TRSM on Setonix, many calls have optimal number of threads higher than the number of physical cores (hyperthreads). On the contrary, for SYRK, SYR2K, and TRMM on Gadi, almost all calls have optimal number of threads lower than the number of physical cores. There are also abnormal areas where the optimal number of threads is drastically different from the surrounding area, similar to GEMM.

The complex performance patterns of BLAS level III subroutines are shown by these results; our ML models aim to learn these patterns and select the best number of threads.

As mentioned in Section IV-C, we use stratified sampling to split the data set for model training and testing, with 15 %percent\%% of the data set as the test set.

VI-B Model Performance and Selection

The best algorithms chosen by our algorithm of selecting the ML model with the highest average estimated speedup s for all BLAS subroutines are shown in Table V and V. We can see that only four algorithms are chosen at least once, with XGBoost still being the most common option. The results also indicate that the best algorithms for Setonix and Gadi are sometimes different, which is expected since the two platforms have different architectures. We have also included in Table VI detailed statistics supporting the selection of most suitable models for four subroutines on Gadi.

We regret that we have not completed integrating all four models in our code base. For now, we only use XGBoost as the ML model for both Setonix and Gadi. Since the estimated speedup only differ by less than 10 %, we expect the speedup difference will not be significant. We will integrate the other three models in the future.

We used separate datasets for each subroutines from the training and test datasets test ADSALA on BLAS L3 subroutines. These datasets consist of 100-120 data points sampled with a scrambled Halton sequence within the same domain as how its training dataset is sampled. This is expected to ensure a more uniform, low discrepancy sampling of the data used for the performance analysis of our software. We compared the subroutine runtime with our ADSALA software to the runtime of original using maximum number of threads. Please note that the speedup results include the model evaluation time during runtime.

The testing shows that our ADSALA software can reliably improve the performance of BLAS L3 subroutines on both Setonix and Gadi. Table VII shows the statistics of the speedup results. The mean speedup on Setonix is generally higher with also higher standard deviation than on Gadi. Precision-wise, double precision subroutines generally have higher speedups than their single precision counterparts. Subroutine-wise, SYMM has the highest mean speedup on both platforms, while GEMM and SYR2K has the lowest mean speedup on Setonix and Gadi, respectively.

Figure 7 visualises the GEMM speedup distribution on Setonix and Gadi as a 3D heatmap from three angles, where red indicates acceleration and blue indicates deceleration. We can see that the speedup distribution pattern resembles the pattern of room for improvements, and the speedup generally decreases as three dimensions get larger.

Figure 6 shows the SYMM, SYRK, SYR2K, TRMM, and TRSM speedup distribution with respect to matrix dimensions on Setonix and Gadi. We can observe that the speedup patterns also resembles that in Fig. 4. The double/single subroutine pairs generally show similar speedup patter. However, speedup shows various patterns for different subroutines. Also, it does not always follows the rule that the speedup decreases as the matrix dimensions get larger.

VI-C Performance Assessment using Software Implementation

We use Table VIII to explain the large speedups of ADSALA over the original BLAS package in some cases. These cases have considerable speedup and are selected from the test set for software testing.

We use IntelA®dvisorandIntel®superscript𝐴®𝑑𝑣𝑖𝑠𝑜𝑟𝑎𝑛𝑑𝐼𝑛𝑡𝑒superscript𝑙®{}^{\circledR}AdvisorandIntel^{\circledR}start_FLOATSUPERSCRIPT ® end_FLOATSUPERSCRIPT italic_A italic_d italic_v italic_i italic_s italic_o italic_r italic_a italic_n italic_d italic_I italic_n italic_t italic_e italic_l start_POSTSUPERSCRIPT ® end_POSTSUPERSCRIPT Vtune to profile these two GEMMs on Gadi, repeating each matrix multiplication 100 times with different input values. Table VIII shows the time breakdown of GEMM calls. The SGEMM wall-time consists of three main components:

  1. 1.

    Data copies. The BLAS runtime uses a buffer for each thread as a workspace and copies blocks of matrix operands into it before computation. The copy time depends on the matrix sizes, memory placement, and thread number.

  2. 2.

    Thread synchronization.

  3. 3.

    BLAS L3 kernel calls, where the FLOPs are performed. The kernels depend on the thread number and the matrix block sizes. They can be compute-bound for large blocks.

Generally speaking, the speedup of ADSALA is due to the reduction of all three parts. Because of the relatively small problem size, the most significant time consumption is on thread synchronization most of the time. The thread sync time can be improved from around 30 % to more than 50 times. Data copy has the second largest time consumption, and the speedup on it also contributes to the reduction of total runtime. Although the kernal call is consumes minimal time, our number threads selected by ML can also reduce the time consumption on it.

VII Conclusions

We presented an extension to the ADSALA library where only GEMM was optimized. We extend its approach to fit other BLAS L3 routines with different input matrices. While kee** the same method for selecting the ML algorithm, we found the optimal algorithm is architecture and subroutine dependent.

Out of the six BLAS L3 subroutines, we improved on all of them, with speedups ranging from 1.1 to 3.0. We also analyzed visualise and analyse the speedup patterns of different BLAS subroutines and HPC platforms. By profiling, we explained the large speedups in some cases and showed that the source of speedup is the reduction of all three main components of the multi-thread subroutine runtime.

Our method is applicable to runtime parameter predictions, including multi-thread BLAS I, II, and LAPACK operations, which are sensitive to prediction duration. Future work will investigate automatic processor selection, such as the feasibility of running BLAS/LAPACK operations on either a GPU or CPU. At present, our approach is applicable to problems that satisfy certain conditions: a well-defined objective function, a finite and discrete search space, a set of relevant features, and sufficient data for training and validating ML models.

Our work shows the effectiveness and generality of the ADSALA approach for optimizing BLAS routines on modern multi-core systems. For future works, we plan to extend our ML-driven runtime thread selection approach to other BLAS operations and to a more diverse set of computing hardware, including GPU accelerators and heterogeneous systems.

References

  • [1] L. S. Blackford, A. Petitet, R. Pozo, K. Remington, R. C. Whaley, J. Demmel, J. Dongarra, I. Duff, S. Hammarling, G. Henry et al., “An updated set of basic linear algebra subprograms (blas),” ACM Transactions on Mathematical Software, vol. 28, no. 2, pp. 135–151, 2002.
  • [2] R. C. Whaley and J. J. Dongarra, “Automatically tuned linear algebra software,” in SC’98: Proceedings of the 1998 ACM/IEEE conference on Supercomputing.   IEEE, 1998, pp. 38–38.
  • [3] J. A. Gunnels, G. M. Henry, and R. A. Geijn, “A family of high-performance matrix multiplication algorithms,” in International Conference on Computational Science.   Springer, 2001, pp. 51–60.
  • [4] R. Vuduc, J. W. Demmel, and J. Bilmes, “Statistical models for automatic performance tuning,” in International Conference on Computational Science.   Springer, 2001, pp. 117–126.
  • [5] J. A. Gunnels, A systematic approach to the design and analysis of linear algebra algorithms.   The University of Texas at Austin, 2001.
  • [6] V. Valsalam and A. Skjellum, “A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels,” Concurrency and Computation: Practice and Experience, vol. 14, no. 10, pp. 805–839, 2002.
  • [7] J. Demmel, J. Dongarra, V. Eijkhout, E. Fuentes, A. Petitet, R. Vuduc, R. C. Whaley, and K. Yelick, “Self-adapting linear algebra algorithms and software,” Proceedings of the IEEE, vol. 93, no. 2, pp. 293–312, 2005.
  • [8] J. R. Herrero Zaragoza, A framework for efficient execution of matrix computations.   Universitat Politècnica de Catalunya, 2006.
  • [9] J. Cuenca, D. Giménez, and J. González, “Towards the design of an automatically tuned linear algebra library,” in 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing (EUROMICRO-PDP 2002).   IEEE Computer Society, 2002, pp. 0201–0201.
  • [10] J. Cuenca, D. Giménez, and J. González, “Architecture of an automatically tuned linear algebra library,” Parallel Computing, vol. 30, no. 2, pp. 187–210, feb 2004.
  • [11] T. Katagiri, S. Ohshima, and M. Matsumoto, “Auto-tuning on numa and many-core environments with an fdm code,” in 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017, pp. 1399–1407.
  • [12] S. Catalán, J. R. Herrero, E. S. Quintana-Ortí, R. Rodríguez-Sánchez, and R. Van De Geijn, “A case for malleable thread-level linear algebra libraries: The lu factorization with partial pivoting,” IEEE access, vol. 7, pp. 17 617–17 633, 2019.
  • [13] J. Cámara, J. Cuenca, L.-P. García, and D. Giménez, “Empirical modelling of linear algebra shared-memory routines,” Procedia Computer Science, vol. 18, pp. 110–119, 2013.
  • [14] E. Peise and P. Bientinesi, “Performance modeling for dense linear algebra,” in 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.   IEEE, 2012, pp. 406–416.
  • [15] Y. Xia, M. De La Pierre, A. S. Barnard, and G. M. J. Barca, “A machine learning approach towards runtime optimisation of matrix multiplication,” in 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2023, pp. 524–534.
  • [16] Y.-C. Lin, Y. Chen, S. Gobriel, N. Jain, G. K. Jha, and V. Prasanna, “Argo: An auto-tuning runtime system for scalable gnn training on multi-core processor,” arXiv preprint arXiv:2402.03671, 2024.
  • [17] M. P. Deisenroth, A. A. Faisal, and C. S. Ong, Mathematics for machine learning.   Cambridge University Press, 2020.
  • [18] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.
  • [19] C. M. Bishop and M. E. Tip**, “Bayesian regression and classification,” Nato Science Series sub Series III Computer And Systems Sciences, vol. 190, pp. 267–288, 2003.
  • [20] C. M. Bishop and N. M. Nasrabadi, Pattern recognition and machine learning.   Springer, 2006, vol. 4, no. 4.
  • [21] M. Awad and R. Khanna, “Support vector regression,” in Efficient learning machines.   Springer, 2015, pp. 67–80.
  • [22] N. Bhatia et al., “Survey of nearest neighbor techniques,” arXiv preprint arXiv:1007.0085, 2010.
  • [23] J. Han, J. Pei, and H. Tong, Data mining: concepts and techniques.   Morgan kaufmann, 2022.
  • [24] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “Lof: Identifying density-based local outliers,” SIGMOD Rec., vol. 29, no. 2, p. 93–104, may 2000.
  • [25] R. M. Sakia, “The box-cox transformation technique: a review,” Journal of the Royal Statistical Society: Series D (The Statistician), vol. 41, no. 2, pp. 169–178, 1992.
  • [26] S. Weisberg, “Yeo-johnson power transformations,” Department of Applied Statistics, University of Minnesota. Retrieved June, vol. 1, p. 2003, 2001.
  • [27] M. Mascagni and H. Chi, “On the scrambled halton sequence,” vol. 10, no. 3-4, pp. 435–442, 2004.
  • [28] A. Géron, Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems.   ” O’Reilly Media, Inc.”, 2019.