Search | arXiv e-print repository

Towards Hardware Support for FPGA Resource Elasticity

Authors: Ahsan Javed Awan, Fidan Aliyeva

Abstract: FPGAs are increasingly being deployed in the cloud to accelerate diverse applications. They are to be shared among multiple tenants to improve the total cost of ownership. Partial reconfiguration technology enables multi-tenancy on FPGA by partitioning it into regions, each hosting a specific application's accelerator. However, the region's size can not be changed once they are defined, resulting… ▽ More FPGAs are increasingly being deployed in the cloud to accelerate diverse applications. They are to be shared among multiple tenants to improve the total cost of ownership. Partial reconfiguration technology enables multi-tenancy on FPGA by partitioning it into regions, each hosting a specific application's accelerator. However, the region's size can not be changed once they are defined, resulting in the underutilization of FPGA resources. This paper argues to divide the acceleration requirements of an application into multiple small computation modules. The devised FPGA shell can reconfigure the available PR regions with those modules and enable them to communicate with each other over Crossbar interconnect with the Wishbone bus interface. For each PR region being reconfigured, it updates the register file with the valid destination addresses and the bandwidth allocation of the interconnect. Any invalid communication request originating from the Wishbone master interface is masked in the corresponding master port of the crossbar. The allocated bandwidth for the PR region is ensured by the weighted round-robin arbiter in the slave port of the crossbar. Finally, the envisioned resource manager can increase or decrease the number of PR regions allocated to an application based on its acceleration requirements and PR regions' availability. △ Less

Submitted 4 July, 2022; v1 submitted 23 December, 2021; originally announced December 2021.

Comments: Preprint of paper presented at Euromicro Conference on Digital System Design (DSD'22)

arXiv:2106.15284 [pdf, other]

NMPO: Near-Memory Computing Profiling and Offloading

Authors: Stefano Corda, Madhurya Kumaraswamy, Ahsan Javed Awan, Roel Jordans, Akash Kumar, Henk Corporaal

Abstract: Real-world applications are now processing big-data sets, often bottlenecked by the data movement between the compute units and the main memory. Near-memory computing (NMC), a modern data-centric computational paradigm, can alleviate these bottlenecks, thereby improving the performance of applications. The lack of NMC system availability makes simulators the primary evaluation tool for performance… ▽ More Real-world applications are now processing big-data sets, often bottlenecked by the data movement between the compute units and the main memory. Near-memory computing (NMC), a modern data-centric computational paradigm, can alleviate these bottlenecks, thereby improving the performance of applications. The lack of NMC system availability makes simulators the primary evaluation tool for performance estimation. However, simulators are usually time-consuming, and methods that can reduce this overhead would accelerate the early-stage design process of NMC systems. This work proposes Near-Memory computing Profiling and Offloading (NMPO), a high-level framework capable of predicting NMC offloading suitability employing an ensemble machine learning model. NMPO predicts NMC suitability with an accuracy of 85.6% and, compared to prior works, can reduce the prediction time by using hardware-dependent applications features by up to 3 order of magnitude. △ Less

Submitted 29 June, 2021; originally announced June 2021.

Comments: Euromicro Conference on Digital System Design 2021

arXiv:2005.04098 [pdf, other]

Near Memory Acceleration on High Resolution Radio Astronomy Imaging

Authors: Stefano Corda, Bram Veenboer, Ahsan Javed Awan, Akash Kumar, Roel Jordans, Henk Corporaal

Abstract: Modern radio telescopes like the Square Kilometer Array (SKA) will need to process in real-time exabytes of radio-astronomical signals to construct a high-resolution map of the sky. Near-Memory Computing (NMC) could alleviate the performance bottlenecks due to frequent memory accesses in a state-of-the-art radio-astronomy imaging algorithm. In this paper, we show that a sub-module performing a two… ▽ More Modern radio telescopes like the Square Kilometer Array (SKA) will need to process in real-time exabytes of radio-astronomical signals to construct a high-resolution map of the sky. Near-Memory Computing (NMC) could alleviate the performance bottlenecks due to frequent memory accesses in a state-of-the-art radio-astronomy imaging algorithm. In this paper, we show that a sub-module performing a two-dimensional fast Fourier transform (2D FFT) is memory bound using CPI breakdown analysis on IBM Power9. Then, we present an NMC approach on FPGA for 2D FFT that outperforms a CPU by up to a factor of 120x and performs comparably to a high-end GPU, while using less bandwidth and memory. △ Less

Submitted 4 May, 2020; originally announced May 2020.

arXiv:1909.12183 [pdf, other]

K-Means Clustering on Noisy Intermediate Scale Quantum Computers

Authors: Sumsam Ullah Khan, Ahsan Javed Awan, Gemma Vall-Llosera

Abstract: Real-time clustering of big performance data generated by the telecommunication networks requires domain-specific high performance compute infrastructure to detect anomalies. In this paper, we evaluate noisy intermediate-scale quantum (NISQ) computers characterized by low decoherence times, for K-means clustering and propose three strategies to generate shorter-depth quantum circuits needed to ove… ▽ More Real-time clustering of big performance data generated by the telecommunication networks requires domain-specific high performance compute infrastructure to detect anomalies. In this paper, we evaluate noisy intermediate-scale quantum (NISQ) computers characterized by low decoherence times, for K-means clustering and propose three strategies to generate shorter-depth quantum circuits needed to overcome the limitation of NISQ computers. The strategies are based on exploiting; i) quantum interference, ii) negative rotations and iii) destructive interference. By comparing our implementations on IBMQX2 machine for representative data sets, we show that NISQ computers can solve the K-means clustering problem with the same level of accuracy as that of classical computers. △ Less

Submitted 26 September, 2019; originally announced September 2019.

arXiv:1909.11988 [pdf, other]

Support Vector Machines on Noisy Intermediate Scale Quantum Computers

Authors: Jiaying Yang, Ahsan Javed Awan, Gemma Vall-Llosera

Abstract: Support vector machine algorithms are considered essential for the implementation of automation in a radio access network. Specifically, they are critical in the prediction of the quality of user experience for video streaming based on device and network-level metrics. Quantum SVM is the quantum analogue of the classical SVM algorithm, which utilizes the properties of quantum computers to speed up… ▽ More Support vector machine algorithms are considered essential for the implementation of automation in a radio access network. Specifically, they are critical in the prediction of the quality of user experience for video streaming based on device and network-level metrics. Quantum SVM is the quantum analogue of the classical SVM algorithm, which utilizes the properties of quantum computers to speed up the algorithm exponentially. In this work, we derive an optimized preprocessing unit for a quantum SVM that allows classifying any two-dimensional datasets that are linearly separable. We further provide a result readout method of the kernel matrix generation circuit to avoid quantum tomography that, in turn, reduces the quantum circuit depth. We also derive a quantum SVM system based on an optimized HHL quantum circuit with reduced circuit depth. △ Less

Submitted 26 September, 2019; originally announced September 2019.

arXiv:1908.02640 [pdf, other]

Near-Memory Computing: Past, Present, and Future

Authors: Gagandeep Singh, Lorenzo Chelini, Stefano Corda, Ahsan Javed Awan, Sander Stuijk, Roel Jordans, Henk Corporaal, Albert-Jan Boonstra

Abstract: The conventional approach of moving data to the CPU for computation has become a significant performance bottleneck for emerging scale-out data-intensive applications due to their limited data reuse. At the same time, the advancement in 3D integration technologies has made the decade-old concept of coupling compute units close to the memory --- called near-memory computing (NMC) --- more viable. P… ▽ More The conventional approach of moving data to the CPU for computation has become a significant performance bottleneck for emerging scale-out data-intensive applications due to their limited data reuse. At the same time, the advancement in 3D integration technologies has made the decade-old concept of coupling compute units close to the memory --- called near-memory computing (NMC) --- more viable. Processing right at the "home" of data can significantly diminish the data movement problem of data-intensive applications. In this paper, we survey the prior art on NMC across various dimensions (architecture, applications, tools, etc.) and identify the key challenges and open issues with future research directions. We also provide a glimpse of our approach to near-memory computing that includes i) NMC specific microarchitecture independent application characterization ii) a compiler framework to offload the NMC kernels on our target NMC platform and iii) an analytical model to evaluate the potential of NMC. △ Less

Submitted 7 August, 2019; originally announced August 2019.

Comments: Preprint

arXiv:1906.10037 [pdf, ps, other]

Platform Independent Software Analysis for Near Memory Computing

Authors: Stefano Corda, Gagandeep Singh, Ahsan Javed Awan, Roel Jordans, Henk Corporaal

Abstract: Near-memory Computing (NMC) promises improved performance for the applications that can exploit the features of emerging memory technologies such as 3D-stacked memory. However, it is not trivial to find such applications and specialized tools are needed to identify them. In this paper, we present PISA-NMC, which extends a state-of-the-art hardware agnostic profiling tool with metrics concerning me… ▽ More Near-memory Computing (NMC) promises improved performance for the applications that can exploit the features of emerging memory technologies such as 3D-stacked memory. However, it is not trivial to find such applications and specialized tools are needed to identify them. In this paper, we present PISA-NMC, which extends a state-of-the-art hardware agnostic profiling tool with metrics concerning memory and parallelism, which are relevant for NMC. The metrics include memory entropy, spatial locality, data-level, and basic-block-level parallelism. By profiling a set of representative applications and correlating the metrics with the application's performance on a simulated NMC system, we verify the importance of those metrics. Finally, we demonstrate which metrics are useful in identifying applications suitable for NMC architectures. △ Less

Submitted 24 June, 2019; originally announced June 2019.

Journal ref: Euromicro Conference on Digital System Design (DSD) 2019

arXiv:1904.08762 [pdf, other]

doi 10.1145/3323439.3323988

Memory and Parallelism Analysis Using a Platform-Independent Approach

Authors: Stefano Corda, Gagandeep Singh, Ahsan Javed Awan, Roel Jordans, Henk Corporaal

Abstract: Emerging computing architectures such as near-memory computing (NMC) promise improved performance for applications by reducing the data movement between CPU and memory. However, detecting such applications is not a trivial task. In this ongoing work, we extend the state-of-the-art platform-independent software analysis tool with NMC related metrics such as memory entropy, spatial locality, data-le… ▽ More Emerging computing architectures such as near-memory computing (NMC) promise improved performance for applications by reducing the data movement between CPU and memory. However, detecting such applications is not a trivial task. In this ongoing work, we extend the state-of-the-art platform-independent software analysis tool with NMC related metrics such as memory entropy, spatial locality, data-level, and basic-block-level parallelism. These metrics help to identify the applications more suitable for NMC architectures. △ Less

Submitted 18 April, 2019; originally announced April 2019.

Comments: 22nd ACM International Workshop on Software and Compilers for Embedded Systems (SCOPES '19), May 2019

arXiv:1707.09323 [pdf, other]

Identifying the potential of Near Data Computing for Apache Spark

Authors: Ahsan Javed Awan, Mats Brorsson, Vladimir Vlassov, Eduard Ayguade

Abstract: While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream data processing. There is also a renewed interest is Near Data Computing (NDC) due to technological advancement in the last decade. However, it is not known if NDC… ▽ More While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream data processing. There is also a renewed interest is Near Data Computing (NDC) due to technological advancement in the last decade. However, it is not known if NDC architectures can improve the performance of big data processing frameworks such as Apache Spark. In this position paper, we hypothesize in favour of NDC architecture comprising programmable logic based hybrid 2D integrated processing-in-memory and in-storage processing for Apache Spark, by extensive profiling of Apache Spark based workloads on Ivy Bridge Server. △ Less

Submitted 8 May, 2017; originally announced July 2017.

Comments: position paper

arXiv:1604.08484 [pdf, other]

Architectural Impact on Performance of In-memory Data Analytics: Apache Spark Case Study

Authors: Ahsan Javed Awan, Mats Brorsson, Vladimir Vlassov, Eduard Ayguade

Abstract: While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream data processing. However, recent studies on micro-architectural characterization of in-memory data analytics are limited to only batch processing workloads. We com… ▽ More While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream data processing. However, recent studies on micro-architectural characterization of in-memory data analytics are limited to only batch processing workloads. We compare micro-architectural performance of batch processing and stream processing workloads in Apache Spark using hardware performance counters on a dual socket server. In our evaluation experiments, we have found that batch processing are stream processing workloads have similar micro-architectural characteristics and are bounded by the latency of frequent data access to DRAM. For data accesses we have found that simultaneous multi-threading is effective in hiding the data latencies. We have also observed that (i) data locality on NUMA nodes can improve the performance by 10% on average and(ii) disabling next-line L1-D prefetchers can reduce the execution time by up-to 14\% and (iii) multiple small executors can provide up-to 36\% speedup over single large executor. △ Less

Submitted 28 April, 2016; originally announced April 2016.

arXiv:1507.08340 [pdf, other]

How Data Volume Affects Spark Based Data Analytics on a Scale-up Server

Authors: Ahsan Javed Awan, Mats Brorsson, Vladimir Vlassov, Eduard Ayguade

Abstract: Sheer increase in volume of data over the last decade has triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark is gaining popularity for exhibiting superior scale-out performance on the commodity machines, the impact of data volume on the performance of Spark based data analytics in scale-up configuration is not we… ▽ More Sheer increase in volume of data over the last decade has triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark is gaining popularity for exhibiting superior scale-out performance on the commodity machines, the impact of data volume on the performance of Spark based data analytics in scale-up configuration is not well understood. We present a deep-dive analysis of Spark based applications on a large scale-up server machine. Our analysis reveals that Spark based data analytics are DRAM bound and do not benefit by using more than 12 cores for an executor. By enlarging input data size, application performance degrades significantly due to substantial increase in wait time during I/O operations and garbage collection, despite 10\% better instruction retirement rate (due to lower L1 cache misses and higher core utilization). We match memory behaviour with the garbage collector to improve performance of applications between 1.6x to 3x. △ Less

Submitted 29 July, 2015; originally announced July 2015.

Comments: accepted to 6th International Workshop on Big Data Benchmarks, Performance Optimization and Emerging Hardware (BpoE-6) held in conjunction with VLDB 2015. arXiv admin note: text overlap with arXiv:1506.07742

arXiv:1506.07742 [pdf, other]

doi 10.1109/BDCloud.2015.37

Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server

Authors: Ahsan Javed Awan, Mats Brorsson, Vladimir Vlassov, Eduard Ayguade

Abstract: In last decade, data analytics have rapidly progressed from traditional disk-based processing to modern in-memory processing. However, little effort has been devoted at enhancing performance at micro-architecture level. This paper characterizes the performance of in-memory data analytics using Apache Spark framework. We use a single node NUMA machine and identify the bottlenecks hampering the scal… ▽ More In last decade, data analytics have rapidly progressed from traditional disk-based processing to modern in-memory processing. However, little effort has been devoted at enhancing performance at micro-architecture level. This paper characterizes the performance of in-memory data analytics using Apache Spark framework. We use a single node NUMA machine and identify the bottlenecks hampering the scalability of workloads. We also quantify the inefficiencies at micro-architecture level for various data analysis workloads. Through empirical evaluation, we show that spark workloads do not scale linearly beyond twelve threads, due to work time inflation and thread level load imbalance. Further, at the micro-architecture level, we observe memory bound latency to be the major cause of work time inflation. △ Less

Submitted 25 June, 2015; originally announced June 2015.

Comments: Accepted to The 5th IEEE International Conference on Big Data and Cloud Computing (BDCloud 2015)

Showing 1–12 of 12 results for author: Awan, A J