-
Using Artificial Intelligence and IoT for Constructing a Smart Trash Bin
Authors:
Khang Nhut Lam,
Nguyen Hoang Huynh,
Nguyen Bao Ngoc,
To Thi Huynh Nhu,
Nguyen Thanh Thao,
Pham Hoang Hao,
Vo Van Kiet,
Bui Xuan Huynh,
Jugal Kalita
Abstract:
The research reported in this paper transforms a normal trash bin into a smarter one by applying computer vision technology. With the support of sensors and actuator devices, the trash bin can automatically classify garbage. In particular, a camera on the trash bin takes pictures of trash, then the central processing unit analyzes and makes decisions regarding which bin to drop trash into. The acc…
▽ More
The research reported in this paper transforms a normal trash bin into a smarter one by applying computer vision technology. With the support of sensors and actuator devices, the trash bin can automatically classify garbage. In particular, a camera on the trash bin takes pictures of trash, then the central processing unit analyzes and makes decisions regarding which bin to drop trash into. The accuracy of our trash bin system achieves 90%. Besides, our model is connected to the Internet to update the bin status for further management. A mobile application is developed for managing the bin.
△ Less
Submitted 12 August, 2022;
originally announced August 2022.
-
High-performance computing for super-resolution microscopy on a cluster of computers
Authors:
Quan Do,
Jon Ivar Kristiansen,
Krishna Agarwal,
Phuong Hoai Ha
Abstract:
Multiple signal classification algorithm (MUSICAL) provides a super-resolution microscopy method. In the previous research, MUSICAL has enabled data-parallelism well on a desktop computer or a Linux-based server. However, the running time needs to be shorter. This paper will develop a new parallel MUSICAL with high efficiency and scalability on a cluster of computers. We achieve the purpose by usi…
▽ More
Multiple signal classification algorithm (MUSICAL) provides a super-resolution microscopy method. In the previous research, MUSICAL has enabled data-parallelism well on a desktop computer or a Linux-based server. However, the running time needs to be shorter. This paper will develop a new parallel MUSICAL with high efficiency and scalability on a cluster of computers. We achieve the purpose by using the optimal speed of the cluster cores, the latest parallel programming techniques, and the high-performance computing libraries, such as the Intel Threading Building Blocks (TBB), the Intel Math Kernel Library (MKL), and the unified parallel C++ (UPC++) for the cluster of computers. Our experimental results show that the new parallel MUSICAL achieves a speed-up of 240.29x within 10 seconds on the 256-core cluster with an efficiency of 93.86%. Our MUSICAL offers a high possibility for real-life applications to make super-resolution microscopy within seconds.
△ Less
Submitted 13 June, 2022; v1 submitted 7 June, 2022;
originally announced June 2022.
-
CDN-MEDAL: Two-stage Density and Difference Approximation Framework for Motion Analysis
Authors:
Synh Viet-Uyen Ha,
Cuong Tien Nguyen,
Hung Ngoc Phan,
Nhat Minh Chung,
Phuong Hoai Ha
Abstract:
Background modeling and subtraction is a promising research area with a variety of applications for video surveillance. Recent years have witnessed a proliferation of effective learning-based deep neural networks in this area. However, the techniques have only provided limited descriptions of scenes' properties while requiring heavy computations, as their single-valued map** functions are learne…
▽ More
Background modeling and subtraction is a promising research area with a variety of applications for video surveillance. Recent years have witnessed a proliferation of effective learning-based deep neural networks in this area. However, the techniques have only provided limited descriptions of scenes' properties while requiring heavy computations, as their single-valued map** functions are learned to approximate the temporal conditional averages of observed target backgrounds and foregrounds. On the other hand, statistical learning in imagery domains has been a prevalent approach with high adaptation to dynamic context transformation, notably using Gaussian Mixture Models (GMM) with its generalization capabilities. By leveraging both, we propose a novel method called CDN-MEDAL-net for background modeling and subtraction with two convolutional neural networks. The first architecture, CDN-GM, is grounded on an unsupervised GMM statistical learning strategy to describe observed scenes' salient features. The second one, MEDAL-net, implements a light-weighted pipeline of online video background subtraction. Our two-stage architecture is small, but it is very effective with rapid convergence to representations of intricate motion patterns. Our experiments show that the proposed approach is not only capable of effectively extracting regions of moving objects in unseen cases, but it is also very efficient.
△ Less
Submitted 21 September, 2021; v1 submitted 7 June, 2021;
originally announced June 2021.
-
A Newcomer In The PGAS World -- UPC++ vs UPC: A Comparative Study
Authors:
Jérémie Lagravière,
Johannes Langguth,
Martina Prugger,
Phuong H. Ha,
Xing Cai
Abstract:
A newcomer in the Partitioned Global Address Space (PGAS) 'world' has arrived in its version 1.0: Unified Parallel C++ (UPC++). UPC++ targets distributed data structures where communication is irregular or fine-grained. The key abstractions are global pointers, asynchronous programming via RPC, futures and promises. UPC++ API for moving non-contiguous data and handling memories with different opti…
▽ More
A newcomer in the Partitioned Global Address Space (PGAS) 'world' has arrived in its version 1.0: Unified Parallel C++ (UPC++). UPC++ targets distributed data structures where communication is irregular or fine-grained. The key abstractions are global pointers, asynchronous programming via RPC, futures and promises. UPC++ API for moving non-contiguous data and handling memories with different optimal access methods resemble those used in modern C++. In this study we provide two kernels implemented in UPC++: a sparse-matrix vector multiplication (SpMV) as part of a Partial-Differential Equation solver, and an implementation of the Heat Equation on a 2D-domain. Code listings of these two kernels are available in the article in order to show the differences in programming style between UPC and UPC++. We provide a performance comparison between UPC and UPC++ using single-node, multi-node hardware and many-core hardware (Intel Xeon Phi Knight's Landing).
△ Less
Submitted 6 February, 2021;
originally announced February 2021.
-
Performance optimization and modeling of fine-grained irregular communication in UPC
Authors:
Jérémie Lagravière,
Johannes Langguth,
Martina Prugger,
Lukas Einkemmer,
Phuong H. Ha,
Xing Cai
Abstract:
The UPC programming language offers parallelism via logically partitioned shared memory, which typically spans physically disjoint memory sub-systems. One convenient feature of UPC is its ability to automatically execute between-thread data movement, such that the entire content of a shared data array appears to be freely accessible by all the threads. The programmer friendliness, however, can com…
▽ More
The UPC programming language offers parallelism via logically partitioned shared memory, which typically spans physically disjoint memory sub-systems. One convenient feature of UPC is its ability to automatically execute between-thread data movement, such that the entire content of a shared data array appears to be freely accessible by all the threads. The programmer friendliness, however, can come at the cost of substantial performance penalties. This is especially true when indirectly indexing the elements of a shared array, for which the induced between-thread data communication can be irregular and have a fine-grained pattern. In this paper we study performance enhancement strategies specifically targeting such fine-grained irregular communication in UPC. Starting from explicit thread privatization, continuing with block-wise communication, and arriving at message condensing and consolidation, we obtained considerable performance improvement of UPC programs that originally require fine-grained irregular communication. Besides the performance enhancement strategies, the main contribution of the present paper is to propose performance models for the different scenarios, in form of quantifiable formulas that hinge on the actual volumes of various data movements plus a small number of easily obtainable hardware characteristic parameters. These performance models help to verify the enhancements obtained, while also providing insightful predictions of similar parallel implementations, not limited to UPC, that also involve between-thread or between-process irregular communication. As a further validation, we also apply our performance modeling methodology and hardware characteristic parameters to an existing UPC code for solving a 2D heat equation on a uniform mesh.
△ Less
Submitted 29 December, 2019;
originally announced December 2019.
-
On the Performance and Energy Efficiency of the PGAS Programming Model on Multicore Architectures
Authors:
Jérémie Lagravière,
Johannes Langguth,
Mohammed Sourouri,
Phuong H. Ha,
Xing Cai
Abstract:
Using large-scale multicore systems to get the maximum performance and energy efficiency with manageable programmability is a major challenge. The partitioned global address space (PGAS) programming model enhances programmability by providing a global address space over large-scale computing systems. However, so far the performance and energy efficiency of the PGAS model on multicore-based paralle…
▽ More
Using large-scale multicore systems to get the maximum performance and energy efficiency with manageable programmability is a major challenge. The partitioned global address space (PGAS) programming model enhances programmability by providing a global address space over large-scale computing systems. However, so far the performance and energy efficiency of the PGAS model on multicore-based parallel architectures have not been investigated thoroughly. In this paper we use a set of selected kernels from the well-known NAS Parallel Benchmarks to evaluate the performance and energy efficiency of the UPC programming language, which is a widely used implementation of the PGAS model. In addition, the MPI and OpenMP versions of the same parallel kernels are used for comparison with their UPC counterparts. The investigated hardware platforms are based on multicore CPUs, both within a single 16-core node and across multiple nodes involving up to 1024 physical cores. On the multi-node platform we used the hardware measurement solution called High definition Energy Efficiency Monitoring tool in order to measure energy. On the single-node system we used the hybrid measurement solution to make an effort into understanding the observed performance differences, we use the Intel Performance Counter Monitor to quantify in detail the communication time, cache hit/miss ratio and memory usage. Our experiments show that UPC is competitive with OpenMP and MPI on single and multiple nodes, with respect to both the performance and energy efficiency.
△ Less
Submitted 29 December, 2019;
originally announced December 2019.
-
HyperProv: Decentralized Resilient Data Provenance at the Edge with Blockchains
Authors:
Petter Tunstad,
Amin M. Khan,
Phuong Hoai Ha
Abstract:
Data provenance and lineage are critical for ensuring integrity and reproducibility of information in research and application. This is particularly challenging for distributed scenarios, where data may be originating from decentralized sources without any central control by a single trusted entity. We present HyperProv, a general framework for data provenance based on the permissioned blockchain…
▽ More
Data provenance and lineage are critical for ensuring integrity and reproducibility of information in research and application. This is particularly challenging for distributed scenarios, where data may be originating from decentralized sources without any central control by a single trusted entity. We present HyperProv, a general framework for data provenance based on the permissioned blockchain Hyperledger Fabric (HLF), and to the best of our knowledge, the first system that is ported to ARM based devices such as Raspberry Pi (RPi). HyperProv tracks the metadata, operation history and data lineage through a set of built-in queries using smart contracts, enabling lightweight retrieval of provenance data. HyperProv provides convenient integration through a NodeJS client library, and also includes off-chain storage through the SSH file system. We evaluate HyperProv's performance, throughput, resource consumption, and energy efficiency on x86-64 machines, as well as on RPi devices for IoT use cases at the edge.
△ Less
Submitted 13 October, 2019;
originally announced October 2019.
-
D2.4 Report on the final prototype of programming abstractions for energy-efficient inter-process communication
Authors:
Phuong Hoai Ha,
Vi Ngoc-Nha Tran,
Ibrahim Umar,
Aras Atalar,
Anders Gidenstam,
Paul Renaud-Goud,
Philippas Tsigas,
Ivan Walulya
Abstract:
Work package 2 (WP2) aims to develop libraries for energy-efficient inter-process communication and data sharing on the EXCESS platforms. The Deliverable D2.4 reports on the final prototype of programming abstractions for energy-efficient inter- process communication. Section 1 is the updated overview of the prototype of programming abstraction and devised power/energy models. The Section 2-6 cont…
▽ More
Work package 2 (WP2) aims to develop libraries for energy-efficient inter-process communication and data sharing on the EXCESS platforms. The Deliverable D2.4 reports on the final prototype of programming abstractions for energy-efficient inter- process communication. Section 1 is the updated overview of the prototype of programming abstraction and devised power/energy models. The Section 2-6 contain the latest results of the four studies: i) GreenBST, a energy-efficient and concurrent search tree (cf. Section 2) ii) Customization methodology for implementation of streaming aggregation in embedded systems (cf. Section 3) iii) Energy Model on CPU for Lock-free Data-structures in Dynamic Environments (cf. Section 4.10) iv) A General and Validated Energy Complexity Model for Multithreaded Algorithms (cf. Section 5)
△ Less
Submitted 8 February, 2018;
originally announced February 2018.
-
D2.3 Power models, energy models and libraries for energy-efficient concurrent data structures and algorithms
Authors:
Phuong Hoai Ha,
Vi Ngoc-Nha Tran,
Ibrahim Umar,
Aras Atalar,
Anders Gidenstam,
Paul Renaud-Goud,
Philippas Tsigas,
Ivan Walulya
Abstract:
This deliverable reports the results of the power models, energy models and libraries for energy-efficient concurrent data structures and algorithms as available by project month 30 of Work Package 2 (WP2). It reports i) the latest results of Task 2.2-2.4 on providing programming abstractions and libraries for develo** energy-efficient data structures and algorithms and ii) the improved results…
▽ More
This deliverable reports the results of the power models, energy models and libraries for energy-efficient concurrent data structures and algorithms as available by project month 30 of Work Package 2 (WP2). It reports i) the latest results of Task 2.2-2.4 on providing programming abstractions and libraries for develo** energy-efficient data structures and algorithms and ii) the improved results of Task 2.1 on investigating and modeling the trade-off between energy and performance of concurrent data structures and algorithms. The work has been conducted on two main EXCESS platforms: Intel platforms with recent Intel multicore CPUs and Movidius Myriad platforms.
△ Less
Submitted 8 February, 2018; v1 submitted 31 January, 2018;
originally announced January 2018.
-
REOH: Using Probabilistic Network for Runtime Energy Optimization of Heterogeneous Systems
Authors:
Vi Ngoc-Nha Tran,
Tommy Oines,
Alexander Horsch,
Phuong Hoai Ha
Abstract:
Significant efforts have been devoted to choosing the best configuration of a computing system to run an application energy efficiently. However, available tuning approaches mainly focus on homogeneous systems and are inextensible for heterogeneous systems which include several components (e.g., CPUs, GPUs) with different architectures. This study proposes a holistic tuning approach called REOH us…
▽ More
Significant efforts have been devoted to choosing the best configuration of a computing system to run an application energy efficiently. However, available tuning approaches mainly focus on homogeneous systems and are inextensible for heterogeneous systems which include several components (e.g., CPUs, GPUs) with different architectures. This study proposes a holistic tuning approach called REOH using probabilistic network to predict the most energy-efficient configuration (i.e., which platform and its setting) of a heterogeneous system for running a given application. Based on the computation and communication patterns from Berkeley dwarfs, we conduct experiments to devise the training set including 7074 data samples covering varying application patterns and characteristics. Validating the REOH approach on heterogeneous systems including CPUs and GPUs shows that the energy consumption by the REOH approach is close to the optimal energy consumption by the Brute Force approach while saving 17% of sampling runs compared to the previous (homogeneous) approach using probabilistic network. Based on the REOH approach, we develop an open-source energy-optimizing runtime framework for selecting an energy efficient configuration of a heterogeneous system for a given application at runtime.
△ Less
Submitted 16 September, 2018; v1 submitted 30 January, 2018;
originally announced January 2018.
-
D2.1 Models for energy consumption of data structures and algorithms
Authors:
Phuong Hoai Ha,
Vi Ngoc-Nha Tran,
Ibrahim Umar,
Philippas Tsigas,
Anders Gidenstam,
Paul Renaud-Goud,
Ivan Walulya,
Aras Atalar
Abstract:
This deliverable reports our early energy models for data structures and algorithms based on both micro-benchmarks and concurrent algorithms. It reports the early results of Task 2.1 on investigating and modeling the trade-off between energy and performance in concurrent data structures and algorithms, which forms the basis for the whole work package 2 (WP2). The work has been conducted on the two…
▽ More
This deliverable reports our early energy models for data structures and algorithms based on both micro-benchmarks and concurrent algorithms. It reports the early results of Task 2.1 on investigating and modeling the trade-off between energy and performance in concurrent data structures and algorithms, which forms the basis for the whole work package 2 (WP2). The work has been conducted on the two main EXCESS platforms: (1) Intel platform with recent Intel multi-core CPUs and (2) Movidius embedded platform.
△ Less
Submitted 8 February, 2018; v1 submitted 29 January, 2018;
originally announced January 2018.
-
D2.2 White-box methodologies, programming abstractions and libraries
Authors:
Phuong Hoai Ha,
Vi Ngoc-Nha Tran,
Ibrahim Umar,
Aras Atalar,
Anders Gidenstam,
Paul Renaud-Goud,
Philippas Tsigas
Abstract:
This deliverable reports the results of white-box methodologies and early results of the first prototype of libraries and programming abstractions as available by project month 18 by Work Package 2 (WP2). It reports i) the latest results of Task 2.2 on white-box methodologies, programming abstractions and libraries for develo** energy-efficient data structures and algorithms and ii) the improved…
▽ More
This deliverable reports the results of white-box methodologies and early results of the first prototype of libraries and programming abstractions as available by project month 18 by Work Package 2 (WP2). It reports i) the latest results of Task 2.2 on white-box methodologies, programming abstractions and libraries for develo** energy-efficient data structures and algorithms and ii) the improved results of Task 2.1 on investigating and modeling the trade-off between energy and performance of concurrent data structures and algorithms. The work has been conducted on two main EXCESS platforms: Intel platforms with recent Intel multicore CPUs and Movidius Myriad1 platform. Regarding white-box methodologies, we have devised new relaxed cache-oblivious models and proposed a new power model for Myriad1 platform and an energy model for lock-free queues on CPU platforms. For Myriad1 platform, the im- proved model now considers both computation and data movement cost as well as architecture and application properties. The model has been evaluated with a set of micro-benchmarks and application benchmarks. For Intel platforms, we have generalized the model for concurrent queues on CPU platforms to offer more flexibility according to the workers calling the data structure (parallel section sizes of enqueuers and dequeuers are decoupled). Regarding programming abstractions and libraries, we have continued investigat- ing the trade-offs between energy consumption and performance of data structures such as concurrent queues and concurrent search trees based on the early results of Task 2.1.The preliminary results show that our concurrent trees are faster and more energy efficient than the state-of-the-art on commodity HPC and embedded platforms.
△ Less
Submitted 8 February, 2018; v1 submitted 26 January, 2018;
originally announced January 2018.
-
ICE: A General and Validated Energy Complexity Model for Multithreaded Algorithms
Authors:
Vi Ngoc-Nha Tran,
Phuong Hoai Ha
Abstract:
Like time complexity models that have significantly contributed to the analysis and development of fast algorithms, energy complexity models for parallel algorithms are desired as crucial means to develop energy efficient algorithms for ubiquitous multicore platforms. Ideal energy complexity models should be validated on real multicore platforms and applicable to a wide range of parallel algorithm…
▽ More
Like time complexity models that have significantly contributed to the analysis and development of fast algorithms, energy complexity models for parallel algorithms are desired as crucial means to develop energy efficient algorithms for ubiquitous multicore platforms. Ideal energy complexity models should be validated on real multicore platforms and applicable to a wide range of parallel algorithms. However, existing energy complexity models for parallel algorithms are either theoretical without model validation or algorithm-specific without ability to analyze energy complexity for a wide-range of parallel algorithms.
This paper presents a new general validated energy complexity model for parallel (multithreaded) algorithms. The new model abstracts away possible multicore platforms by their static and dynamic energy of computational operations and data access, and derives the energy complexity of a given algorithm from its work, span and I/O complexity. The new model is validated by different sparse matrix vector multiplication (SpMV) algorithms and dense matrix multiplication (matmul) algorithms running on high performance computing (HPC) platforms (e.g., Intel Xeon and Xeon Phi). The new energy complexity model is able to characterize and compare the energy consumption of SpMV and matmul kernels according to three aspects: different algorithms, different input matrix types and different platforms. The prediction of the new model regarding which algorithm consumes more energy with different inputs on different platforms, is confirmed by the experimental results. In order to improve the usability and accuracy of the new model for a wide range of platforms, the platform parameters of ICE model are provided for eleven platforms including HPC, accelerator and embedded platforms.
△ Less
Submitted 4 October, 2016; v1 submitted 26 May, 2016;
originally announced May 2016.
-
NB-FEB: An Easy-to-Use and Scalable Universal Synchronization Primitive for Parallel Programming
Authors:
Phuong Hoai Ha,
Philippas Tsigas,
Otto J. Anshus
Abstract:
This paper addresses the problem of universal synchronization primitives that can support scalable thread synchronization for large-scale many-core architectures. The universal synchronization primitives that have been deployed widely in conventional architectures like CAS and LL/SC are expected to reach their scalability limits in the evolution to many-core architectures with thousands of cores…
▽ More
This paper addresses the problem of universal synchronization primitives that can support scalable thread synchronization for large-scale many-core architectures. The universal synchronization primitives that have been deployed widely in conventional architectures like CAS and LL/SC are expected to reach their scalability limits in the evolution to many-core architectures with thousands of cores. We introduce a non-blocking full/empty bit primitive, or NB-FEB for short, as a promising synchronization primitive for parallel programming on may-core architectures. We show that the NB-FEB primitive is universal, scalable, feasible and convenient to use. NB-FEB, together with registers, can solve the consensus problem for an arbitrary number of processes (universality). NB-FEB is combinable, namely its memory requests to the same memory location can be combined into only one memory request, which consequently mitigates performance degradation due to synchronization "hot spots" (scalability). Since NB-FEB is a variant of the original full/empty bit that always returns a value instead of waiting for a conditional flag, it is as feasible as the original full/empty bit, which has been implemented in many computer systems (feasibility). The original full/empty bit is well-known as a special-purpose primitive for fast producer-consumer synchronization and has been used extensively in the specific domain of applications. In this paper, we show that NB-FEB can be deployed easily as a general-purpose primitive. Using NB-FEB, we construct a non-blocking software transactional memory system called NBFEB-STM, which can be used to handle concurrent threads conveniently. NBFEB-STM is space efficient: the space complexity of each object updated by $N$ concurrent threads/transactions is $Θ(N)$, the optimal.
△ Less
Submitted 8 November, 2008;
originally announced November 2008.