Machine-Learning-Driven Runtime Optimization of BLAS Level 3 on Modern Multi-Core Systems

Yufan Xia The Chinese University of Hong Kong
Hong Kong SAR, China
[email protected] Giuseppe Maria Junior Barca
The University of Melbourne
Melbourne, Australia
[email protected]

Abstract

BLAS Level 3 operations are essential for scientific computing, but finding the optimal number of threads for multi-threaded implementations on modern multi-core systems is challenging. We present an extension to the Architecture and Data-Structure Aware Linear Algebra (ADSALA) library that uses machine learning to optimize the runtime of all BLAS Level 3 operations. Our method predicts the best number of threads for each operation based on the matrix dimensions and the system architecture. We test our method on two HPC platforms with Intel and AMD processors, using MKL and BLIS as baseline BLAS implementations. We achieve speedups of 1.5 to 3.0 for all operations, compared to using the maximum number of threads. We also analyze the runtime patterns of different BLAS operations and explain the sources of speedup. Our work shows the effectiveness and generality of the ADSALA approach for optimizing BLAS routines on modern multi-core systems.

I Introduction

The linear algebra subroutines in the Basic Linear Algebra Subprograms (BLAS) [1] form the backbone of scientific computing. Due to the critical role of BLAS, great effort has been devoted to improving the performance its subroutines. The automatically tuned linear algebra software (ATLAS) and its improvements are the first batch of auto-tuning efforts on optimising linear algebra operations; they are able to auto-tune the linear algebra operation codes by searching over parameters like blocking factor, loop order, and partial storage location on each specific computer architecture [2, 3, 4]. Later in the 2000s, the self-optimised linear algebra routine (SOLAR) attempted the analytical modelling of the timing of multi-process linear algebra operations [5, 6, 7, 8, 9, 10].

Within BLAS, Level 3 (L3) operations, which are concerned with matrix-matrix linear algebra, are the most complex and computationally demanding [1]. In addition to the General Matrix Multiplication (GEMM), BLAS L3 includes the Symmetric Rank-k Update (SYR), Symmetric Rank-2k Update (SYR2), Triangular Matrix-Matrix Multiplication (TRMM), and Triangular Rank-k Update (TRSM) subroutines.

While there has been significant work in optimizing single-threaded CPU performance of BLAS L3 by fine-tuning the size of matrix blocks for various system architectures, similar efforts for optimizing these routines specifically for multi-core CPU have been less frequent due to the high complexity and large variety of the underlying computer architectures.

Work has been carried out by using established optimization techniques such as parameter tuning and blocking[11, 12]. The Parallel Linear Algebra for Scalable Multi-Core Architectures (PLASMA) [13] library uses polynomial regression to optimise the number of threads or block size used based on empirical features, obtaining performance comparable to the Intel^® MKL for Cholesky decomposition. Peise et al. adopted a polynomial regression to model the dense linear algebra run times, and they applied additional techniques to boost the performance of this simple model [14].

However, choosing the number of threads that minimises the execution time of a given BLAS remains challenging and largely unsolved due to the underlying diversity and complexity of modern shared-memory computer architectures.

To tackle this challenge, recent research conducted by Xia et al. [15], which was integrated in the open-source Architecture and Data-Structure Aware Linear Algebra library, has employed a systematic machine learning (ML) methodology to significantly reduce the runtime of Single-precision General Matrix Multiplication (SGEMM) operations. This approach leverages an ML model to dynamically select the optimal number of threads for minimizing the execution time of GEMM for a specific input configuration and computer system architecture. The ML model itself undergoes training during the installation phase, and this training process is customized to the particular computer system architecture in use. Subsequently, during runtime, the ML model is employed to make predictions regarding the most efficient number of threads for a given task. An important advantage of this approach is its adoption of existing GEMM implementations, treating them as black boxes. This work has inspired a novel auto-tuning method in Graph Neural Network training [16].

In this study, we have expanded the runtime optimization capabilities of the ADSALA library to encompass all single- and double-precision BLAS L3 operations, and we present the performance results achieved. Furthermore, we have incorporated an automatic model selection feature during the installation process for each distinct BLAS subroutine. This addition aids in identifying and using the most appropriate machine learning algorithm for each subroutine on every machine where the library is installed. We test our method on two HPC platforms with Intel and AMD processors, using multi-threaded MKL and BLIS as baseline BLAS implementations. Our software implementation enables us to speed up all BLAS L3 subroutines, notwithstanding the runtime overhead of ML evaluations.

On the Gadi supercomputer located at the National Computational Infrastructure (NCI) and on the Setonix supercomputer located at the Pawsey Supercomputing Centre (please refer to section V for details about NCI and Pawsey), we achieve speedups of 1.5 $\times$ to 3.0 $\times$ compared to using the maximum number of threads with hyperthreading enabled or disabled. For comparison, the most recent method relevant to our study employs pure polynomial regression to model and determine the number of threads for the PLASMA QR routine[13]. This method reported average speed improvements of 1%, 13%, and 27% across three different platforms. The sizes of the matrices used in these tests ranged from 2000 to 7000, which aligns closely with the range we tested in our study.

We also analyze the runtime patterns of different ADSALA BLAS L3 runs and discuss the sources of speedup. The ADSALA library is provided as open-source implementation for the community to use and extend.

The remainder of this article is structured as follows. We provide background information concerning BLAS L3 operations and ML in Section II. We review the improved software design in Section III, and discuss the adopted ML methods in Section IV. Section V details the experimentation platform and settings. We present and analyze the performance of the ML models and software speedup in Section VI. Section VII concludes.

TABLE I: Specifications of BLAS level III subroutines.


		Matrix A		Matrix B		Matrix C
	dims	shape	type	shape	type	shape	type
GEMM	3	m $\times$ k	regular	k $\times$ n	regular	m $\times$ n	regular
SYMM	2	m $\times$ m	symmetric	m $\times$ n	regular	m $\times$ n	regular
SYRK	2	n $\times$ k	regular	n $\times$ k	regular	n $\times$ n	symmetric
SYR2K	2	n $\times$ k	regular	n $\times$ k	regular	n $\times$ n	symmetric
TRMM	2	m $\times$ m	triangular	m $\times$ n	regular	—	—
TRSM	2	m $\times$ m	triangular	m $\times$ n	regular	—	—

Refer to caption — TABLE II: Comparisons of ML model characteristics.


Model Catagories	Models	Parametric	Good with Data Imbalance	Data Size Requirement
Linear Models	Linear Regression	Yes	No	Medium
	ElasticNet			Medium
	Bayesian Regression			Small
Tree Based Models	Decision Tree	No	Yes	Medium
	XGBoost
	AdaBoost
	Random Forest
	LightGBM
Other Models	SVM Regressor	No	No	Small
	KNN Regressor			Medium

	Three Dimensions $m$ , $k$ , $n$		Two Dimensions $m$ , $n$
1	m	${\text{m}}/{\text{nt}}$	m
2	k	${\text{k}}/{\text{nt}}$	n
3	n	${\text{n}}/{\text{nt}}$	nt
4	nt	${\text{m*k}}/{\text{nt}}$	m*n
5	m*k	${\text{m*n}}/{\text{nt}}$	memory_footprint
6	m*n	${\text{k*n}}/{\text{nt}}$	m/nt
7	k*n	${\text{mkn}}/{\text{nt}}$	n/nt
8	mkn	memory_footprint / nt	m*n/nt
9	memory_footprint		memory_footprint / nt

subroutine	best_model
dgemm	LinearRegression
dsymm	XGBRegressor
dsyr2k	XGBRegressor
dsyrk	XGBRegressor
dtrmm	XGBRegressor
dtrsm	XGBRegressor
sgemm	XGBRegressor
ssymm	XGBRegressor
ssyr2k	XGBRegressor
ssyrk	XGBRegressor
strmm	XGBRegressor
strsm	XGBRegressor

subroutine	best_model
dgemm	BayesianRidge
dsymm	RandomForestRegressor
dsyr2k	XGBRegressor
dsyrk	LinearRegression
dtrmm	XGBRegressor
dtrsm	LinearRegression
sgemm	XGBRegressor
ssymm	RandomForestRegressor
ssyr2k	XGBRegressor
ssyrk	XGBRegressor
strmm	XGBRegressor
strsm	LinearRegression


Model	Normalised Test RMSE	Ideal mean speedup	Ideal aggregate speedup	Model evaluation time in $\mu$ s	Estimated mean speedup	Estimated aggregate speedup
Linear Regression	0.93	1.13	1.01	15.3	1.12	1.01
ElasticNet	1.00	1.07	0.94	10.63	1.07	0.94
Bayes Regression	0.93	1.13	1.01	8.11	1.13	1.01
Decision Tree	0.32	0.82	0.36	8.02	0.82	0.36
Random Forest	0.24	1.08	0.97	983.09	1.01	0.94
AdaBoost	0.52	0.84	0.46	89.77	0.84	0.46
KNN	0.28	1.07	0.95	6449.32	0.78	0.78
XGBoost	0.23	1.1	0.98	290.76	1.08	0.97

Machine-Learning-Driven Runtime Optimization of BLAS Level 3 on Modern Multi-Core Systems

Abstract

I Introduction

II Background

II-A BLAS Level III Subroutines

II-B Machine Learning Algorithms

II-C Data Preprocessing Techniques

III Software Workflow

III-A Installation Workflow

III-B Runtime Workflow

IV Machine Learning Methods

IV-A Mechanism for Predictions

IV-B Data Gathering

IV-C Feature Engineering and Data Preprocessing

IV-D Model Selection

V Experimentation Information

V-A Experimentation Platforms

V-A1 Setonix

V-A2 Gadi

V-B Experimentation Setup

VI Performance Analysis

VI-A Datasets, Training Time, and Data Visualization

VI-B Model Performance and Selection

VI-C Performance Assessment using Software Implementation

VII Conclusions

References

	Mean	Std	Min	25%	50%	75%	Max
dgemm	1.54	0.66	0.87	1.16	1.29	1.74	4.79
sgemm	1.32	0.41	0.76	1.05	1.18	1.37	9.05
dsymm	2.89	1.80	0.82	1.38	2.36	4.09	8.46
ssymm	2.22	1.69	0.34	1.04	1.60	2.88	7.42
dsyr2k	1.48	0.46	0.94	1.17	1.36	1.56	3.59
ssyr2k	1.53	0.49	0.75	1.20	1.43	1.72	3.79
dsyrk	1.46	0.46	0.92	1.15	1.28	1.69	3.54
ssyrk	1.55	0.42	0.98	1.22	1.46	1.82	2.80
dtrmm	1.61	0.82	0.87	1.13	1.27	1.78	6.51
strmm	1.67	0.92	0.94	1.10	1.34	1.81	7.13
dtrsm	1.71	1.07	0.99	1.13	1.27	1.77	7.43
strsm	1.68	1.21	0.98	1.12	1.34	1.85	12.38

	Mean	Std	Min	25%	50%	75%	Max
dgemm	1.27	0.55	0.41	1.01	1.13	1.29	4.25
sgemm	1.07	0.70	0.88	1.00	1.00	1.02	3.01
dsymm	2.28	1.89	0.88	1.18	1.43	2.50	12.08
ssymm	2.16	1.98	0.99	1.18	1.30	2.22	11.05
dsyr2k	1.28	0.38	0.90	1.08	1.20	1.37	4.27
ssyr2k	1.47	0.60	0.41	1.14	1.26	1.61	4.52
dsyrk	1.40	0.36	0.86	1.20	1.30	1.47	3.98
ssyrk	1.65	0.47	1.11	1.29	1.45	1.96	3.03
dtrmm	1.30	0.24	1.03	1.16	1.25	1.34	2.75
strmm	1.35	0.26	0.98	1.17	1.30	1.47	2.78
dtrsm	1.31	0.25	0.75	1.17	1.29	1.41	2.86
strsm	1.40	0.40	0.74	1.19	1.33	1.47	4.40

m, k, n	Thread Count	Total Time (s)	Thread Sync (s)	Kernel Call (s)	Data Copy (s)
dgemm 64,2048,64 no ML	96	11.001	4.806	0.010	2.060
dgemm 64,2048,64 with ML	16	4.430	1.453	0.001	0.887
sgemm 64,2048,64 no ML	96	7.698	2.716	0.005	1.133
sgemm 64,2048,64 with ML	5	2.701	0.155	0.001	0.301
dsymm 248,39944 no ML	96	29.063	11.627	0.496	9.783
dsymm 248,39944 with ML	25	22.067	4.962	0.441	9.698
ssymm 2759,41681 no ML	96	27.299	8.986	1.208	10.605
ssymm 2759,41681 with ML	12	15.597	2.085	0.241	8.254
dsyrk 124,160163 no ML	96	36.060	8.545	0.005	13.672
dsyrk 124,160163 with ML	43	34.371	7.543	0.001	12.529
ssyrk 175,15095 no ML	96	65.844	62.666	1.639	1.103
ssyrk 175,15095 with ML	46	4.581	1.323	1.744	1.263