Search | arXiv e-print repository

An Auto-tuning Method for Run-time Data Transformation for Sparse Matrix-Vector Multiplication

Authors: Takahiro Katagiri, Masahiko Sato

Abstract: In this paper, we research the run-time sparse matrix data transformation from Compressed Row Storage (CRS) to Coordinate (COO) storage and an ELL (ELLPACK/ITPACK) format with OpenMP parallelization for sparse matrix-vector multiplication (SpMV). We propose an auto-tuning (AT) method by using the $D_{mat}^i$ - $R_{ell}^i$ graph, which plots the derivation/average for the number of non-zero element… ▽ More In this paper, we research the run-time sparse matrix data transformation from Compressed Row Storage (CRS) to Coordinate (COO) storage and an ELL (ELLPACK/ITPACK) format with OpenMP parallelization for sparse matrix-vector multiplication (SpMV). We propose an auto-tuning (AT) method by using the $D_{mat}^i$ - $R_{ell}^i$ graph, which plots the derivation/average for the number of non-zero elements per row ($D_{mat}^i$) and the ratio, SpMV speedups/transformation time from the CRS to ELL ($R_{ell}^i$ ). The experimental results show the ELL format is very effective in the Earth Simulator 2. The speedup factor of 151 with the ELL-Row inner-parallelized format is obtained. The transformation overhead is also very small, such as 0.01 to 1.0 SpMV time with the CRS format. In addition, the $D_{mat}^i$ - $R_{ell}^i$ graph can be modeled for the effectiveness of transformation according to the $D_{mat}^i$ value. △ Less

Submitted 6 May, 2024; originally announced July 2024.

arXiv:2405.10973 [pdf]

Adaptation of XAI to Auto-tuning for Numerical Libraries

Authors: Shota Aoki, Takahiro Katagiri, Satoshi Ohshima, Masatoshi Kawai, Toru Nagai, Tetsuya Hoshino

Abstract: Concerns have arisen regarding the unregulated utilization of artificial intelligence (AI) outputs, potentially leading to various societal issues. While humans routinely validate information, manually inspecting the vast volumes of AI-generated results is impractical. Therefore, automation and visualization are imperative. In this context, Explainable AI (XAI) technology is gaining prominence, ai… ▽ More Concerns have arisen regarding the unregulated utilization of artificial intelligence (AI) outputs, potentially leading to various societal issues. While humans routinely validate information, manually inspecting the vast volumes of AI-generated results is impractical. Therefore, automation and visualization are imperative. In this context, Explainable AI (XAI) technology is gaining prominence, aiming to streamline AI model development and alleviate the burden of explaining AI outputs to users. Simultaneously, software auto-tuning (AT) technology has emerged, aiming to reduce the man-hours required for performance tuning in numerical calculations. AT is a potent tool for cost reduction during parameter optimization and high-performance programming for numerical computing. The synergy between AT mechanisms and AI technology is noteworthy, with AI finding extensive applications in AT. However, applying AI to AT mechanisms introduces challenges in AI model explainability. This research focuses on XAI for AI models when integrated into two different processes for practical numerical computations: performance parameter tuning of accuracy-guaranteed numerical calculations and sparse iterative algorithm. △ Less

Submitted 12 May, 2024; originally announced May 2024.

Comments: This article has been submitted to Special Session: Performance Optimization and Auto-Tuning of Software on Multicore/Manycore Systems (POAT), In conjunction with IEEE MCSoC-2024 (Dec 16-19, 2024, Days Hotel & Suites by Wyndham Fraser Business Park, Kuala Lumpur)

arXiv:2405.01599 [pdf]

Xabclib:A Fully Auto-tuned Sparse Iterative Solver

Authors: Takahiro Katagiri, Takao Sakurai, Mitsuyoshi Igai, Shoji Itoh, Satoshi Ohshima, Hisayasu Kuroda, Ken Naono, Kengo Nakajima

Abstract: In this paper, we propose a general application programming interface named OpenATLib for auto-tuning (AT). OpenATLib is designed to establish the reusability of AT functions. By using OpenATLib, we develop a fully auto-tuned sparse iterative solver named Xabclib. Xabclib has several novel run-time AT functions. First, the following new implementations of sparse matrix-vector multiplication (SpMV)… ▽ More In this paper, we propose a general application programming interface named OpenATLib for auto-tuning (AT). OpenATLib is designed to establish the reusability of AT functions. By using OpenATLib, we develop a fully auto-tuned sparse iterative solver named Xabclib. Xabclib has several novel run-time AT functions. First, the following new implementations of sparse matrix-vector multiplication (SpMV) for thread processing are implemented:(1) non-zero elements; (2) omission of zero-elements computation for vector reduction; (3) branchless segmented scan (BSS). According to the performance evaluation and the comparison with conventional implementations, the following results are obtained: (1) 14x speedup for non-zero elements and zero-elements computation omission for symmetric SpMV; (2) 4.62x speedup by using BSS. We also develop a "numerical computation policy" that can optimize memory space and computational accuracy. Using the policy, we obtain the following: (1) an averaged 1/45 memory space reduction; (2) avoidance of the "fault convergence" situation, which is a problem of conventional solvers. △ Less

Submitted 30 April, 2024; originally announced May 2024.

Comments: This article was submitted to SC11, and also was published as a preprint for Research Gate in April 2011. Please refer to: https://www.researchgate.net/publication/258223774_Xabclib_A_Fully_Auto-tuned_Sparse_Iterative_Solver

arXiv:2405.00326 [pdf]

A Communication Avoiding and Reducing Algorithm for Symmetric Eigenproblem for Very Small Matrices

Authors: Takahiro Katagiri, Jun'ichi Iwata, Kazuyuki Uchida

Abstract: In this paper, a parallel symmetric eigensolver with very small matrices in massively parallel processing is considered. We define very small matrices that fit the sizes of caches per node in a supercomputer. We assume that the sizes also fit the exa-scale computing requirements of current production runs of an application. To minimize communication time, we added several communication avoiding an… ▽ More In this paper, a parallel symmetric eigensolver with very small matrices in massively parallel processing is considered. We define very small matrices that fit the sizes of caches per node in a supercomputer. We assume that the sizes also fit the exa-scale computing requirements of current production runs of an application. To minimize communication time, we added several communication avoiding and communication reducing algorithms based on Message Passing Interface (MPI) non-blocking implementations. A performance evaluation with up to full nodes of the FX10 system indicates that (1) the MPI non-blocking implementation is 3x as efficient as the baseline implementation, (2) the hybrid MPI execution is 1.9x faster than the pure MPI execution, (3) our proposed solver is 2.3x and 22x faster than a ScaLAPACK routine with optimized blocking size and cyclic-cyclic distribution, respectively. △ Less

Submitted 1 May, 2024; originally announced May 2024.

Comments: This article was submitted to Parallel Computing in December 9, 2013.This article was also published in IPSJ SIG Notes, Vol. 2015-HPC-148, Vol.2, pp.1-17 (February 23, 2015). (a non-reviewed technical report)

arXiv:2404.15752 [pdf]

Performance Evaluation of CMOS Annealing with Support Vector Machine

Authors: Ryoga Fukuhara, Makoto Morishita, Takahiro Katagiri, Masatoshi Kawai, Toru Nagai, Tetsuya Hoshino

Abstract: In this paper, support vector machine (SVM) performance was assessed utilizing a quantum-inspired complementary metal-oxide semiconductor (CMOS) annealer. The primary focus during performance evaluation was the accuracy rate in binary classification problems. A comparative analysis was conducted between SVM running on a CPU (classical computation) and executed on a quantum-inspired annealer. The p… ▽ More In this paper, support vector machine (SVM) performance was assessed utilizing a quantum-inspired complementary metal-oxide semiconductor (CMOS) annealer. The primary focus during performance evaluation was the accuracy rate in binary classification problems. A comparative analysis was conducted between SVM running on a CPU (classical computation) and executed on a quantum-inspired annealer. The performance outcome was evaluated using a CMOS annealing machine, thereby obtaining an accuracy rate of 93.7% for linearly separable problems, 92.7% for non-linearly separable problem 1, and 97.6% for non-linearly separable problem 2. These results reveal that a CMOS annealing machine can achieve an accuracy rate that closely rivals that of classical computation. △ Less

Submitted 24 April, 2024; originally announced April 2024.

arXiv:2312.05779 [pdf]

Autotuning by Changing Directives and Number of Threads in OpenMP using ppOpen-AT

Authors: Toma Sakurai, Satoshi Ohshima, Takahiro Katagiri, Toru Nagai

Abstract: Recently, computers have diversified architectures. To achieve high numerical calculation software performance, it is necessary to tune the software according to the target computer architecture. However, code optimization for each environment is difficult unless it is performed by a specialist who knows computer architectures well. By applying autotuning (AT), the tuning effort can be reduced. Op… ▽ More Recently, computers have diversified architectures. To achieve high numerical calculation software performance, it is necessary to tune the software according to the target computer architecture. However, code optimization for each environment is difficult unless it is performed by a specialist who knows computer architectures well. By applying autotuning (AT), the tuning effort can be reduced. Optimized implementation by AT that enhances computer performance can be used even by non-experts. In this research, we propose a technique for AT for programs using open multi-processing (OpenMP). We propose an AT method using an AT language that changes the OpenMP optimized loop and dynamically changes the number of threads in OpenMP according to computational kernels. Performance evaluation was performed using the Fujitsu PRIMEHPC FX100, which is a K-computer type supercomputer installed at the Information Technology Center, Nagoya University. As a result, we found there was a performance increase of 1.801 times that of the original code in a plasma turbulence analysis. △ Less

Submitted 10 December, 2023; originally announced December 2023.

Showing 1–6 of 6 results for author: Katagiri, T