-
Design, Implementation and Evaluation of the SVNAPOT Extension on a RISC-V Processor
Authors:
Nikolaos-Charalampos Papadopoulos,
Stratos Psomadakis,
Vasileios Karakostas,
Nectarios Koziris,
Dionisios N. Pnevmatikatos
Abstract:
The RISC-V SVNAPOT Extension aims to remedy the performance overhead of the Memory Management Unit (MMU), under heavy memory loads. The Privileged Specification defines additional Natural-Power-of-Two (NAPOT) multiples of the 4KB base page size, with 64KB as the default candidate. In this paper we extend the MMU of the Rocket Chip Generator, in order to manage the collocation of 64KB pages along w…
▽ More
The RISC-V SVNAPOT Extension aims to remedy the performance overhead of the Memory Management Unit (MMU), under heavy memory loads. The Privileged Specification defines additional Natural-Power-of-Two (NAPOT) multiples of the 4KB base page size, with 64KB as the default candidate. In this paper we extend the MMU of the Rocket Chip Generator, in order to manage the collocation of 64KB pages along with 4KB pages in the L2 TLB. We present the design challenges we had to overcome and the trade-offs of our design choices. We conduct a preliminary sensitivity analysis of the L2 TLB with different configurations/page sizes. Finally, we summarize on techniques which could further improve memory management performance on RISC-V systems.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
SmartPQ: An Adaptive Concurrent Priority Queue for NUMA Architectures
Authors:
Christina Giannoula,
Foteini Strati,
Dimitrios Siakavaras,
Georgios Goumas,
Nectarios Koziris
Abstract:
Concurrent priority queues are widely used in important workloads, such as graph applications and discrete event simulations. However, designing scalable concurrent priority queues for NUMA architectures is challenging. Even though several NUMA-oblivious implementations can scale up to a high number of threads, exploiting the potential parallelism of insert operation, NUMA-oblivious implementation…
▽ More
Concurrent priority queues are widely used in important workloads, such as graph applications and discrete event simulations. However, designing scalable concurrent priority queues for NUMA architectures is challenging. Even though several NUMA-oblivious implementations can scale up to a high number of threads, exploiting the potential parallelism of insert operation, NUMA-oblivious implementations scale poorly in deleteMin-dominated workloads. This is because all threads compete for accessing the same memory locations, i.e., the highest-priority element of the queue, thus incurring excessive cache coherence traffic and non-uniform memory accesses between nodes of a NUMA system. In such scenarios, NUMA-aware implementations are typically used to improve system performance on a NUMA system.
In this work, we propose an adaptive priority queue, called SmartPQ. SmartPQ tunes itself by switching between a NUMA-oblivious and a NUMA-aware algorithmic mode to achieve high performance under all various contention scenarios. SmartPQ has two key components. First, it is built on top of NUMA Node Delegation (Nuddle), a generic low-overhead technique to construct efficient NUMA-aware data structures using any arbitrary concurrent NUMA-oblivious implementation as its backbone. Second, SmartPQ integrates a lightweight decision making mechanism to decide when to switch between NUMA-oblivious and NUMA-aware algorithmic modes. Our evaluation shows that, in NUMA systems, SmartPQ performs best in all various contention scenarios with 87.9% success rate, and dynamically adapts between NUMA-aware and NUMA-oblivious algorithmic mode, with negligible performance overheads. SmartPQ improves performance by 1.87x on average over SprayList, the state-of-theart NUMA-oblivious priority queue.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Feature-based SpMV Performance Analysis on Contemporary Devices
Authors:
Panagiotis Mpakos,
Dimitrios Galanopoulos,
Petros Anastasiadis,
Nikela Papadopoulou,
Nectarios Koziris,
Georgios Goumas
Abstract:
The SpMV kernel is characterized by high performance variation per input matrix and computing platform. While GPUs were considered State-of-the-Art for SpMV, with the emergence of advanced multicore CPUs and low-power FPGA accelerators, we need to revisit its performance and energy efficiency. This paper provides a high-level SpMV performance analysis based on structural features of matrices relat…
▽ More
The SpMV kernel is characterized by high performance variation per input matrix and computing platform. While GPUs were considered State-of-the-Art for SpMV, with the emergence of advanced multicore CPUs and low-power FPGA accelerators, we need to revisit its performance and energy efficiency. This paper provides a high-level SpMV performance analysis based on structural features of matrices related to common bottlenecks of memory-bandwidth intensity, low ILP, load imbalance and memory latency overheads. Towards this, we create a wide artificial matrix dataset that spans these features and study the performance of different storage formats in nine modern HPC platforms; five CPUs, three GPUs and an FPGA. After validating our proposed methodology using real-world matrices, we analyze our extensive experimental results and draw key insights on the competitiveness of different target architectures for SpMV and the impact of each feature/bottleneck on its performance.
△ Less
Submitted 8 February, 2023;
originally announced February 2023.
-
Architectural Support for Efficient Data Movement in Disaggregated Systems
Authors:
Christina Giannoula,
Kailong Huang,
Jonathan Tang,
Nectarios Koziris,
Georgios Goumas,
Zeshan Chishti,
Nandita Vijaykumar
Abstract:
Resource disaggregation offers a cost effective solution to resource scaling, utilization, and failure-handling in data centers by physically separating hardware devices in a server. Servers are architected as pools of processor, memory, and storage devices, organized as independent failure-isolated components interconnected by a high-bandwidth network. A critical challenge, however, is the high p…
▽ More
Resource disaggregation offers a cost effective solution to resource scaling, utilization, and failure-handling in data centers by physically separating hardware devices in a server. Servers are architected as pools of processor, memory, and storage devices, organized as independent failure-isolated components interconnected by a high-bandwidth network. A critical challenge, however, is the high performance penalty of accessing data from a remote memory module over the network. Addressing this challenge is difficult as disaggregated systems have high runtime variability in network latencies/bandwidth, and page migration can significantly delay critical path cache line accesses in other pages. This paper introduces DaeMon, the first software-transparent and robust mechanism to significantly alleviate data movement overheads in fully disaggregated systems. First, to enable scalability to multiple hardware components in the system, we enhance each compute and memory unit with specialized engines that transparently handle data migrations. Second, to achieve high performance and provide robustness across various network, architecture and application characteristics, we implement a synergistic approach of bandwidth partitioning, link compression, decoupled data movement of multiple granularities, and adaptive granularity selection in data movements. We evaluate DaeMon in a wide variety of workloads at different network and architecture configurations using a state-of-the-art accurate simulator and demonstrate that DaeMon significantly improves system performance and data access costs over the widely-adopted approach of moving data at page granularity.
△ Less
Submitted 23 January, 2023;
originally announced January 2023.
-
DaeMon: Architectural Support for Efficient Data Movement in Disaggregated Systems
Authors:
Christina Giannoula,
Kailong Huang,
Jonathan Tang,
Nectarios Koziris,
Georgios Goumas,
Zeshan Chishti,
Nandita Vijaykumar
Abstract:
Resource disaggregation offers a cost effective solution to resource scaling, utilization, and failure-handling in data centers by physically separating hardware devices in a server. Servers are architected as pools of processor, memory, and storage devices, organized as independent failure-isolated components interconnected by a high-bandwidth network. A critical challenge, however, is the high p…
▽ More
Resource disaggregation offers a cost effective solution to resource scaling, utilization, and failure-handling in data centers by physically separating hardware devices in a server. Servers are architected as pools of processor, memory, and storage devices, organized as independent failure-isolated components interconnected by a high-bandwidth network. A critical challenge, however, is the high performance penalty of accessing data from a remote memory module over the network. Addressing this challenge is difficult as disaggregated systems have high runtime variability in network latencies/bandwidth, and page migration can significantly delay critical path cache line accesses in other pages.
This paper conducts a characterization analysis on different data movement strategies in fully disaggregated systems, evaluates their performance overheads in a variety of workloads, and introduces DaeMon, the first software-transparent mechanism to significantly alleviate data movement overheads in fully disaggregated systems. First, to enable scalability to multiple hardware components in the system, we enhance each compute and memory unit with specialized engines that transparently handle data migrations. Second, to achieve high performance and provide robustness across various network, architecture and application characteristics, we implement a synergistic approach of bandwidth partitioning, link compression, decoupled data movement of multiple granularities, and adaptive granularity selection in data movements. We evaluate DaeMon in a wide variety of workloads at different network and architecture configurations using a state-of-the-art accurate simulator. DaeMon improves system performance and data access costs by 2.39$\times$ and 3.06$\times$, respectively, over the widely-adopted approach of moving data at page granularity.
△ Less
Submitted 18 January, 2023; v1 submitted 1 January, 2023;
originally announced January 2023.
-
Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems
Authors:
Christina Giannoula,
Ivan Fernandez,
Juan Gómez-Luna,
Nectarios Koziris,
Georgios Goumas,
Onur Mutlu
Abstract:
Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures. Near-bank PIM architectures place simple cores close to DRAM banks and can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low me…
▽ More
Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures. Near-bank PIM architectures place simple cores close to DRAM banks and can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low memory access latency, thereby being a good fit to accelerate the widely-used, memory-bound Sparse Matrix Vector Multiplication (SpMV) kernel.
This paper provides the first comprehensive analysis of SpMV on a real-world PIM architecture, and presents SparseP, the first SpMV library for real PIM architectures. We make two key contributions. First, we design efficient SpMV algorithms to accelerate the SpMV kernel in current and future PIM systems, while covering a wide variety of sparse matrices with diverse sparsity patterns. Second, we provide the first comprehensive analysis of SpMV on a real PIM architecture. Specifically, we conduct our rigorous experimental analysis of SpMV kernels in the UPMEM PIM system, the first publicly-available real-world PIM architecture. Our extensive evaluation provides new insights and recommendations for software designers and hardware architects to efficiently accelerate the SpMV kernel on real PIM systems. For more information about our thorough characterization on the SpMV PIM execution, results, insights and the open-source SparseP software package [26], we refer the reader to the full version of the paper [3, 4]. The SparseP software package is publicly and freely available at https://github.com/CMU-SAFARI/SparseP.
△ Less
Submitted 2 April, 2022;
originally announced April 2022.
-
QueryER: A Framework for Fast Analysis-Aware Deduplication over Dirty Data
Authors:
Giorgos Alexiou,
George Papastefanatos,
Vassilis Stamatopoulos,
Georgia Koutrika,
Nectarios Koziris
Abstract:
In this work, we explore the problem of correctly and efficiently answering complex SPJ queries issued directly on top of dirty data. We introduce QueryER, a framework that seamlessly integrates Entity Resolution into Query Processing. QueryER executes analysis-aware deduplication by weaving ER operators into the query plan. The experimental evaluation of our approach exhibits that it adapts to th…
▽ More
In this work, we explore the problem of correctly and efficiently answering complex SPJ queries issued directly on top of dirty data. We introduce QueryER, a framework that seamlessly integrates Entity Resolution into Query Processing. QueryER executes analysis-aware deduplication by weaving ER operators into the query plan. The experimental evaluation of our approach exhibits that it adapts to the workload and scales on both real and synthetic datasets.
△ Less
Submitted 3 February, 2022;
originally announced February 2022.
-
SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems
Authors:
Christina Giannoula,
Ivan Fernandez,
Juan Gómez-Luna,
Nectarios Koziris,
Georgios Goumas,
Onur Mutlu
Abstract:
Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures. Near-bank PIM architectures place simple cores close to DRAM banks and can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low me…
▽ More
Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures. Near-bank PIM architectures place simple cores close to DRAM banks and can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low memory access latency, thereby being a good fit to accelerate the widely-used, memory-bound Sparse Matrix Vector Multiplication (SpMV) kernel.
This paper provides the first comprehensive analysis of SpMV on a real-world PIM architecture, and presents SparseP, the first SpMV library for real PIM architectures. We make three key contributions. First, we implement a wide variety of software strategies on SpMV for a multithreaded PIM core and characterize the computational limits of a single multithreaded PIM core. Second, we design various load balancing schemes across multiple PIM cores, and two types of data partitioning techniques to execute SpMV on thousands of PIM cores: (1) 1D-partitioned kernels to perform the complete SpMV computation only using PIM cores, and (2) 2D-partitioned kernels to strive a balance between computation and data transfer costs to PIM-enabled memory. Third, we compare SpMV execution on a real-world PIM system with 2528 PIM cores to state-of-the-art CPU and GPU systems to study the performance and energy efficiency of various devices. SparseP software package provides 25 SpMV kernels for real PIM systems supporting the four most widely used compressed matrix formats, and a wide range of data types. Our extensive evaluation provides new insights and recommendations for software designers and hardware architects to efficiently accelerate SpMV on real PIM systems.
△ Less
Submitted 23 May, 2022; v1 submitted 13 January, 2022;
originally announced January 2022.
-
SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures
Authors:
Christina Giannoula,
Nandita Vijaykumar,
Nikela Papadopoulou,
Vasileios Karakostas,
Ivan Fernandez,
Juan Gómez-Luna,
Lois Orosa,
Nectarios Koziris,
Georgios Goumas,
Onur Mutlu
Abstract:
Near-Data-Processing (NDP) architectures present a promising way to alleviate data movement costs and can provide significant performance and energy benefits to parallel applications. Typically, NDP architectures support several NDP units, each including multiple simple cores placed close to memory. To fully leverage the benefits of NDP and achieve high performance for parallel workloads, efficien…
▽ More
Near-Data-Processing (NDP) architectures present a promising way to alleviate data movement costs and can provide significant performance and energy benefits to parallel applications. Typically, NDP architectures support several NDP units, each including multiple simple cores placed close to memory. To fully leverage the benefits of NDP and achieve high performance for parallel workloads, efficient synchronization among the NDP cores of a system is necessary. However, supporting synchronization in many NDP systems is challenging because they lack shared caches and hardware cache coherence support, which are commonly used for synchronization in multicore systems, and communication across different NDP units can be expensive.
This paper comprehensively examines the synchronization problem in NDP systems, and proposes SynCron, an end-to-end synchronization solution for NDP systems. SynCron adds low-cost hardware support near memory for synchronization acceleration, and avoids the need for hardware cache coherence support. SynCron has three components: 1) a specialized cache memory structure to avoid memory accesses for synchronization and minimize latency overheads, 2) a hierarchical message-passing communication protocol to minimize expensive communication across NDP units of the system, and 3) a hardware-only overflow management scheme to avoid performance degradation when hardware resources for synchronization tracking are exceeded.
We evaluate SynCron using a variety of parallel workloads, covering various contention scenarios. SynCron improves performance by 1.27$\times$ on average (up to 1.78$\times$) under high-contention scenarios, and by 1.35$\times$ on average (up to 2.29$\times$) under low-contention real applications, compared to state-of-the-art approaches. SynCron reduces system energy consumption by 2.08$\times$ on average (up to 4.25$\times$).
△ Less
Submitted 13 February, 2021; v1 submitted 19 January, 2021;
originally announced January 2021.
-
Enabling Virtual Memory Research on RISC-V with a Configurable TLB Hierarchy for the Rocket Chip Generator
Authors:
Nikolaos Charalampos Papadopoulos,
Vasileios Karakostas,
Konstantinos Nikas,
Nectarios Koziris,
Dionisios N. Pnevmatikatos
Abstract:
The Rocket Chip Generator uses a collection of parameterized processor components to produce RISC-V-based SoCs. It is a powerful tool that can produce a wide variety of processor designs ranging from tiny embedded processors to complex multi-core systems. In this paper we extend the features of the Memory Management Unit of the Rocket Chip Generator and specifically the TLB hierarchy. TLBs are ess…
▽ More
The Rocket Chip Generator uses a collection of parameterized processor components to produce RISC-V-based SoCs. It is a powerful tool that can produce a wide variety of processor designs ranging from tiny embedded processors to complex multi-core systems. In this paper we extend the features of the Memory Management Unit of the Rocket Chip Generator and specifically the TLB hierarchy. TLBs are essential in terms of performance because they mitigate the overhead of frequent Page Table Walks, but may harm the critical path of the processor due to their size and/or associativity. In the original Rocket Chip implementation the L1 Instruction/Data TLB is fully-associative and the shared L2 TLB is direct-mapped. We lift these restrictions and design and implement configurable, set-associative L1 and L2 TLB templates that can create any organization from direct-mapped to fully-associative to achieve the desired ratio of performance and resource utilization, especially for larger TLBs. We evaluate different TLB configurations and present performance, area, and frequency results of our design using benchmarks from the SPEC2006 suite on the Xilinx ZCU102 FPGA.
△ Less
Submitted 16 September, 2020;
originally announced September 2020.
-
Graph Operator Modeling over Large Graph Datasets
Authors:
Tasos Bakogiannis,
Ioannis Giannakopoulos,
Dimitrios Tsoumakos,
Nectarios Koziris
Abstract:
As graph representations of data emerge in multiple domains, data analysts need to be able to intelligently select among a magnitude of different data graphs based on the effects different graph operators have on them. Exhaustive execution of an operator over the bulk of available data sources is impractical due to the massive resources it requires. Additionally, the same process would have to be…
▽ More
As graph representations of data emerge in multiple domains, data analysts need to be able to intelligently select among a magnitude of different data graphs based on the effects different graph operators have on them. Exhaustive execution of an operator over the bulk of available data sources is impractical due to the massive resources it requires. Additionally, the same process would have to be re-implemented whenever a different operator is considered. To address this challenge, this work proposes an efficient graph operator modeling methodology. Our novel approach focuses on the inputs themselves, utilizing graph similarity to infer knowledge about input graphs. The modeled operator is only executed for a small subset of the available graphs and its behavior is approximated for the rest of the graphs using machine learning techniques. Our method is operator-agnostic, as the same similarity information can be reused for modeling multiple graph operators. We also propose a family of similarity measures based on the degree distribution that prove capable of producing high quality estimations, comparable or even surpassing other much more costly, state-of-the-art similarity measures. Our evaluation over both real-world and synthetic graphs indicates that our method achieves extremely accurate modeling of many commonly encountered operators, managing massive speedups over a brute-force alternative.
△ Less
Submitted 20 August, 2018; v1 submitted 15 February, 2018;
originally announced February 2018.
-
Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors
Authors:
Athena Elafrou,
Georgios Goumas,
Nektarios Koziris
Abstract:
This paper presents a low-overhead optimizer for the ubiquitous sparse matrix-vector multiplication (SpMV) kernel. Architectural diversity among different processors together with structural diversity among different sparse matrices lead to bottleneck diversity. This justifies an SpMV optimizer that is both matrix- and architecture-adaptive through runtime specialization. To this direction, we pre…
▽ More
This paper presents a low-overhead optimizer for the ubiquitous sparse matrix-vector multiplication (SpMV) kernel. Architectural diversity among different processors together with structural diversity among different sparse matrices lead to bottleneck diversity. This justifies an SpMV optimizer that is both matrix- and architecture-adaptive through runtime specialization. To this direction, we present an approach that first identifies the performance bottlenecks of SpMV for a given sparse matrix on the target platform either through profiling or by matrix property inspection, and then selects suitable optimizations to tackle those bottlenecks. Our optimization pool is based on the widely used Compressed Sparse Row (CSR) sparse matrix storage format and has low preprocessing overheads, making our overall approach practical even in cases where fast decision making and optimization setup is required. We evaluate our optimizer on three x86-based computing platforms and demonstrate that it is able to distinguish and appropriately optimize SpMV for the majority of matrices in a representative test suite, leading to significant speedups over the CSR and Inspector-Executor CSR SpMV kernels available in the latest release of the Intel MKL library.
△ Less
Submitted 15 November, 2017;
originally announced November 2017.
-
A Decision Tree Based Approach Towards Adaptive Profiling of Distributed Applications
Authors:
Ioannis Giannakopoulos,
Dimitrios Tsoumakos,
Nectarios Koziris
Abstract:
The adoption of the distributed paradigm has allowed applications to increase their scalability, robustness and fault tolerance, but it has also complicated their structure, leading to an exponential growth of the applications' configuration space and increased difficulty in predicting their performance. In this work, we describe a novel, automated profiling methodology that makes no assumptions o…
▽ More
The adoption of the distributed paradigm has allowed applications to increase their scalability, robustness and fault tolerance, but it has also complicated their structure, leading to an exponential growth of the applications' configuration space and increased difficulty in predicting their performance. In this work, we describe a novel, automated profiling methodology that makes no assumptions on application structure. Our approach utilizes oblique Decision Trees in order to recursively partition an application's configuration space in disjoint regions, choose a set of representative samples from each subregion according to a defined policy and return a model for the entire space as a composition of linear models over each subregion. An extensive evaluation over real-life applications and synthetic performance functions showcases that our scheme outperforms other state-of-the-art profiling methodologies. It particularly excels at reflecting abnormalities and discontinuities of the performance function, allowing the user to influence the sampling policy based on the modeling accuracy and the space coverage.
△ Less
Submitted 21 May, 2017; v1 submitted 10 April, 2017;
originally announced April 2017.
-
Elastic Resource Management with Adaptive State Space Partitioning of Markov Decision Processes
Authors:
Konstantinos Lolos,
Ioannis Konstantinou,
Verena Kantere,
Nectarios Koziris
Abstract:
Modern large-scale computing deployments consist of complex applications running over machine clusters. An important issue in these is the offering of elasticity, i.e., the dynamic allocation of resources to applications to meet fluctuating workload demands. Threshold based approaches are typically employed, yet they are difficult to configure and optimize. Approaches based on reinforcement learni…
▽ More
Modern large-scale computing deployments consist of complex applications running over machine clusters. An important issue in these is the offering of elasticity, i.e., the dynamic allocation of resources to applications to meet fluctuating workload demands. Threshold based approaches are typically employed, yet they are difficult to configure and optimize. Approaches based on reinforcement learning have been proposed, but they require a large number of states in order to model complex application behavior. Methods that adaptively partition the state space have been proposed, but their partitioning criteria and strategies are sub-optimal. In this work we present MDP_DT, a novel full-model based reinforcement learning algorithm for elastic resource management that employs adaptive state space partitioning. We propose two novel statistical criteria and three strategies and we experimentally prove that they correctly decide both where and when to partition, outperforming existing approaches. We experimentally evaluate MDP_DT in a real large scale cluster over variable not-encountered workloads and we show that it takes more informed decisions compared to static and model-free approaches, while requiring a minimal amount of training data.
△ Less
Submitted 9 February, 2017;
originally announced February 2017.
-
Improving virtual host efficiency through resource and interference aware scheduling
Authors:
Evangelos Angelou,
Konstantinos Kaffes,
Athanasia Asiki,
Georgios Goumas,
Nectarios Koziris
Abstract:
Modern Infrastructure-as-a-Service Clouds operate in a competitive environment that caters to any user's requirements for computing resources. The sharing of the various types of resources by diverse applications poses a series of challenges in order to optimize resource utilization while avoiding performance degradation caused by application interference. In this paper, we present two scheduling…
▽ More
Modern Infrastructure-as-a-Service Clouds operate in a competitive environment that caters to any user's requirements for computing resources. The sharing of the various types of resources by diverse applications poses a series of challenges in order to optimize resource utilization while avoiding performance degradation caused by application interference. In this paper, we present two scheduling methodologies enforcing consolidation techniques on multicore physical machines. Our resource-aware and interference-aware scheduling schemes aim at improving physical host efficiency while preserving the application performance by taking into account host oversubscription and the resulting workload interference. We validate our fully operational framework through a set of real-life workloads representing a wide class of modern cloud applications. The experimental results prove the efficiency of our system in optimizing resource utilization and thus energy consumption even in the presence of oversubscription. Both methodologies achieve significant reductions of the CPU time consumed, reaching up to 50%, while at the same time maintaining workload performance compared to widely used scheduling schemes under a variety of representative cloud platform scenarios.
△ Less
Submitted 27 January, 2016;
originally announced January 2016.
-
A lightweight optimization selection method for Sparse Matrix-Vector Multiplication
Authors:
Athena Elafrou,
Georgios Goumas,
Nectarios Koziris
Abstract:
In this paper, we propose an optimization selection methodology for the ubiquitous sparse matrix-vector multiplication (SpMV) kernel. We propose two models that attempt to identify the major performance bottleneck of the kernel for every instance of the problem and then select an appropriate optimization to tackle it. Our first model requires online profiling of the input matrix in order to detect…
▽ More
In this paper, we propose an optimization selection methodology for the ubiquitous sparse matrix-vector multiplication (SpMV) kernel. We propose two models that attempt to identify the major performance bottleneck of the kernel for every instance of the problem and then select an appropriate optimization to tackle it. Our first model requires online profiling of the input matrix in order to detect its most prevailing performance issue, while our second model only uses comprehensive structural features of the sparse matrix. Our method delivers high performance stability for SpMV across different platforms and sparse matrices, due to its application and architecture awareness. Our experimental results demonstrate that a) our approach is able to distinguish and appropriately optimize special matrices in multicore platforms that fall out of the standard class of memory bandwidth bound matrices, and b) lead to a significant performance gain of 29% in a manycore platform compared to an architecture-centric optimization, as a result of the successful selection of the appropriate optimization for the great majority of the matrices. With a runtime overhead equivalent to a couple dozen SpMV operations, our approach is practical for use in iterative numerical solvers of real-life applications.
△ Less
Submitted 10 January, 2016; v1 submitted 8 November, 2015;
originally announced November 2015.