Search | arXiv e-print repository

Co-Design of the Dense Linear AlgebravSoftware Stack for Multicore Processors

Authors: Héctor Martínez, Sandra Catalán, Francisco D. Igual, José R. Herrero, Rafael Rodríguez-Sánchez, Enrique S. Quintana-Ortí

Abstract: This paper advocates for an intertwined design of the dense linear algebra software stack that breaks down the strict barriers between the high-level, blocked algorithms in LAPACK (Linear Algebra PACKage) and the low-level, architecture-dependent kernels in BLAS (Basic Linear Algebra Subprograms). Specifically, we propose customizing the GEMM (general matrix multiplication) kernel, which is invoke… ▽ More This paper advocates for an intertwined design of the dense linear algebra software stack that breaks down the strict barriers between the high-level, blocked algorithms in LAPACK (Linear Algebra PACKage) and the low-level, architecture-dependent kernels in BLAS (Basic Linear Algebra Subprograms). Specifically, we propose customizing the GEMM (general matrix multiplication) kernel, which is invoked from the blocked algorithms for relevant matrix factorizations in LAPACK, to improve performance on modern multicore processors with hierarchical cache memories. To achieve this, we leverage an analytical model to dynamically adapt the cache configuration parameters of the GEMM to the shape of the matrix operands. Additionally, we accommodate a flexible development of architecture-specific micro-kernels that allow us to further improve the utilization of the cache hierarchy. Our experiments on two platforms, equipped with ARM (NVIDIA Carmel, Neon) and x86 (AMD EPYC, AVX2) multi-core processors, demonstrate the benefits of this approach in terms of better cache utilization and, in general, higher performance. However, they also reveal the delicate balance between optimizing for multi-threaded parallelism versus cache usage. △ Less

Submitted 27 April, 2023; originally announced April 2023.

arXiv:1905.05881 [pdf, other]

Resource-aware Elastic Swap Random Forest for Evolving Data Streams

Authors: Diego Marrón, Eduard Ayguadé, José Ramon Herrero, Albert Bifet

Abstract: Continual learning based on data stream mining deals with ubiquitous sources of Big Data arriving at high-velocity and in real-time. Adaptive Random Forest ({\em ARF}) is a popular ensemble method used for continual learning due to its simplicity in combining adaptive leveraging bagging with fast random Hoeffding trees. While the default ARF size provides competitive accuracy, it is usually over-p… ▽ More Continual learning based on data stream mining deals with ubiquitous sources of Big Data arriving at high-velocity and in real-time. Adaptive Random Forest ({\em ARF}) is a popular ensemble method used for continual learning due to its simplicity in combining adaptive leveraging bagging with fast random Hoeffding trees. While the default ARF size provides competitive accuracy, it is usually over-provisioned resulting in the use of additional classifiers that only contribute to increasing CPU and memory consumption with marginal impact in the overall accuracy. This paper presents Elastic Swap Random Forest ({\em ESRF}), a method for reducing the number of trees in the ARF ensemble while providing similar accuracy. {\em ESRF} extends {\em ARF} with two orthogonal components: 1) a swap component that splits learners into two sets based on their accuracy (only classifiers with the highest accuracy are used to make predictions); and 2) an elastic component for dynamically increasing or decreasing the number of classifiers in the ensemble. The experimental evaluation of {\em ESRF} and comparison with the original {\em ARF} shows how the two new components contribute to reducing the number of classifiers up to one third while providing almost the same accuracy, resulting in speed-ups in terms of per-sample execution time close to 3x. △ Less

Submitted 14 May, 2019; originally announced May 2019.

arXiv:1904.12754 [pdf, ps, other]

A highly parallel algorithm for computing the action of a matrix exponential on a vector based on a multilevel Monte Carlo method

Authors: Juan A. Acebron, Jose R. Herrero, Jose Monteiro

Abstract: A novel algorithm for computing the action of a matrix exponential over a vector is proposed. The algorithm is based on a multilevel Monte Carlo method, and the vector solution is computed probabilistically generating suitable random paths which evolve through the indices of the matrix according to a suitable probability law. The computational complexity is proved in this paper to be significantly… ▽ More A novel algorithm for computing the action of a matrix exponential over a vector is proposed. The algorithm is based on a multilevel Monte Carlo method, and the vector solution is computed probabilistically generating suitable random paths which evolve through the indices of the matrix according to a suitable probability law. The computational complexity is proved in this paper to be significantly better than the classical Monte Carlo method, which allows the computation of much more accurate solutions. Furthermore, the positive features of the algorithm in terms of parallelism were exploited in practice to develop a highly scalable implementation capable of solving some test problems very efficiently using high performance supercomputers equipped with a large number of cores. For the specific case of shared memory architectures the performance of the algorithm was compared with the results obtained using an available Krylov-based algorithm, outperforming the latter in all benchmarks analyzed so far. △ Less

Submitted 4 July, 2019; v1 submitted 29 April, 2019; originally announced April 2019.

arXiv:1709.00302 [pdf, other]

Look-Ahead in the Two-Sided Reduction to Compact Band Forms for Symmetric Eigenvalue Problems and the SVD

Authors: Rafael Rodríguez-Sánchez, Sandra Catalán, José R. Herrero, Enrique S. Quintana-Ortí, Andrés E. Tomás

Abstract: We address the reduction to compact band forms, via unitary similarity transformations, for the solution of symmetric eigenvalue problems and the computation of the singular value decomposition (SVD). Concretely, in the first case we revisit the reduction to symmetric band form while, for the second case, we propose a similar alternative, which transforms the original matrix to (unsymmetric) band… ▽ More We address the reduction to compact band forms, via unitary similarity transformations, for the solution of symmetric eigenvalue problems and the computation of the singular value decomposition (SVD). Concretely, in the first case we revisit the reduction to symmetric band form while, for the second case, we propose a similar alternative, which transforms the original matrix to (unsymmetric) band form, replacing the conventional reduction method that produces a triangular--band output. In both cases, we describe algorithmic variants of the standard Level-3 BLAS-based procedures, enhanced with look-ahead, to overcome the performance bottleneck imposed by the panel factorization. Furthermore, our solutions employ an algorithmic block size that differs from the target bandwidth, illustrating the important performance benefits of this decision. Finally, we show that our alternative compact band form for the SVD is key to introduce an effective look-ahead strategy into the corresponding reduction procedure. △ Less

Submitted 6 November, 2017; v1 submitted 1 September, 2017; originally announced September 2017.

arXiv:1611.06365 [pdf, other]

A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization with Partial Pivoting

Authors: Sandra Catalán, José R. Herrero, Enrique S. Quintana-Ortí, Rafael Rodríguez-Sánchez, Robert van de Geijn

Abstract: We propose two novel techniques for overcoming load-imbalance encountered when implementing so-called look-ahead mechanisms in relevant dense matrix factorizations for the solution of linear systems. Both techniques target the scenario where two thread teams are created/activated during the factorization, with each team in charge of performing an independent task/branch of execution. The first tec… ▽ More We propose two novel techniques for overcoming load-imbalance encountered when implementing so-called look-ahead mechanisms in relevant dense matrix factorizations for the solution of linear systems. Both techniques target the scenario where two thread teams are created/activated during the factorization, with each team in charge of performing an independent task/branch of execution. The first technique promotes worker sharing (WS) between the two tasks, allowing the threads of the task that completes first to be reallocated for use by the costlier task. The second technique allows a fast task to alert the slower task of completion, enforcing the early termination (ET) of the second task, and a smooth transition of the factorization procedure into the next iteration. The two mechanisms are instantiated via a new malleable thread-level implementation of the Basic Linear Algebra Subprograms (BLAS), and their benefits are illustrated via an implementation of the LU factorization with partial pivoting enhanced with look-ahead. Concretely, our experimental results on a six core Intel-Xeon processor show the benefits of combining WS+ET, reporting competitive performance in comparison with a task-parallel runtime-based solution. △ Less

Submitted 19 November, 2016; originally announced November 2016.

arXiv:1511.02171 [pdf, other]

Multi-Threaded Dense Linear Algebra Libraries for Low-Power Asymmetric Multicore Processors

Authors: Sandra Catalán, José R. Herrero, Francisco D. Igual, Rafael Rodríguez-Sánchez, Enrique S. Quintana-Ortí

Abstract: Dense linear algebra libraries, such as BLAS and LAPACK, provide a relevant collection of numerical tools for many scientific and engineering applications. While there exist high performance implementations of the BLAS (and LAPACK) functionality for many current multi-threaded architectures,the adaption of these libraries for asymmetric multicore processors (AMPs)is still pending. In this paper we… ▽ More Dense linear algebra libraries, such as BLAS and LAPACK, provide a relevant collection of numerical tools for many scientific and engineering applications. While there exist high performance implementations of the BLAS (and LAPACK) functionality for many current multi-threaded architectures,the adaption of these libraries for asymmetric multicore processors (AMPs)is still pending. In this paper we address this challenge by develo** an asymmetry-aware implementation of the BLAS, based on the BLIS framework, and tailored for AMPs equipped with two types of cores: fast/power hungry versus slow/energy efficient. For this purpose, we integrate coarse-grain and fine-grain parallelization strategies into the library routines which, respectively, dynamically distribute the workload between the two core types and statically repartition this work among the cores of the same type. Our results on an ARM big.LITTLE processor embedded in the Exynos 5422 SoC, using the asymmetry-aware version of the BLAS and a plain migration of the legacy version of LAPACK, experimentally assess the benefits, limitations, and potential of this approach. △ Less

Submitted 6 November, 2015; originally announced November 2015.

arXiv:1202.4236 [pdf, ps, other]

A study on new computational local orders of convergence

Authors: Miquel Grau-Sánchez, Miquel Noguera, Àngela Grau, José R. Herrero

Abstract: Four new variants of the Computational Order of Convergence (COC) of a one-point iterative method with memory for solving nonlinear equations are presented. Furthermore, the way to approximate the new variants to the local order of convergence is analyzed. Three of the new definitions given here do not involve the unknown root. Numerical experiments using adaptive arithmetic with multiple precisio… ▽ More Four new variants of the Computational Order of Convergence (COC) of a one-point iterative method with memory for solving nonlinear equations are presented. Furthermore, the way to approximate the new variants to the local order of convergence is analyzed. Three of the new definitions given here do not involve the unknown root. Numerical experiments using adaptive arithmetic with multiple precision and a stop** criteria are implemented without using any known root. △ Less

Submitted 20 February, 2012; originally announced February 2012.

MSC Class: 65H05; 41A25

Showing 1–7 of 7 results for author: Herrero, J R