-
Overview of Caching Mechanisms to Improve Hadoop Performance
Authors:
Rana Ghazali,
Douglas G. Down
Abstract:
Nowadays distributed computing environments, large amounts of data are generated from different resources with a high velocity, rendering the data difficult to capture, manage, and process within existing relational databases. Hadoop is a tool to store and process large datasets in a parallel manner across a cluster of machines in a distributed environment. Hadoop brings many benefits like flexibi…
▽ More
Nowadays distributed computing environments, large amounts of data are generated from different resources with a high velocity, rendering the data difficult to capture, manage, and process within existing relational databases. Hadoop is a tool to store and process large datasets in a parallel manner across a cluster of machines in a distributed environment. Hadoop brings many benefits like flexibility, scalability, and high fault tolerance; however, it faces some challenges in terms of data access time, I/O operation, and duplicate computations resulting in extra overhead, resource wastage, and poor performance. Many researchers have utilized caching mechanisms to tackle these challenges. For example, they have presented approaches to improve data access time, enhance data locality rate, remove repetitive calculations, reduce the number of I/O operations, decrease the job execution time, and increase resource efficiency. In the current study, we provide a comprehensive overview of caching strategies to improve Hadoop performance. Additionally, a novel classification is introduced based on cache utilization. Using this classification, we analyze the impact on Hadoop performance and discuss the advantages and disadvantages of each group. Finally, a novel hybrid approach called Hybrid Intelligent Cache (HIC) that combines the benefits of two methods from different groups, H-SVM-LRU and CLQLMRS, is presented. Experimental results show that our hybrid method achieves an average improvement of 31.2% in job execution time.
△ Less
Submitted 23 October, 2023;
originally announced October 2023.
-
Hadoop-Oriented SVM-LRU (H-SVM-LRU): An Intelligent Cache Replacement Algorithm to Improve MapReduce Performance
Authors:
Rana Ghazali,
Sahar Adabi,
Ali Rezaee,
Douglas G. Down,
Ali Movaghar
Abstract:
Modern applications can generate a large amount of data from different sources with high velocity, a combination that is difficult to store and process via traditional tools. Hadoop is one framework that is used for the parallel processing of a large amount of data in a distributed environment, however, various challenges can lead to poor performance. Two particular issues that can limit performan…
▽ More
Modern applications can generate a large amount of data from different sources with high velocity, a combination that is difficult to store and process via traditional tools. Hadoop is one framework that is used for the parallel processing of a large amount of data in a distributed environment, however, various challenges can lead to poor performance. Two particular issues that can limit performance are the high access time for I/O operations and the recomputation of intermediate data. The combination of these two issues can result in resource wastage. In recent years, there have been attempts to overcome these problems by using caching mechanisms. Due to cache space limitations, it is crucial to use this space efficiently and avoid cache pollution (the cache contains data that is not used in the future). We propose Hadoop-oriented SVM-LRU (HSVM- LRU) to improve Hadoop performance. For this purpose, we use an intelligent cache replacement algorithm, SVM-LRU, that combines the well-known LRU mechanism with a machine learning algorithm, SVM, to classify cached data into two groups based on their future usage. Experimental results show a significant decrease in execution time as a result of an increased cache hit ratio, leading to a positive impact on Hadoop performance.
△ Less
Submitted 28 September, 2023;
originally announced September 2023.
-
Demand Forecasting for Platelet Usage: from Univariate Time Series to Multivariate Models
Authors:
Maryam Motamedi,
Jessica Dawson,
Na Li,
Douglas G. Down,
Nancy M. Heddle
Abstract:
Platelet products are both expensive and have very short shelf lives. As usage rates for platelets are highly variable, the effective management of platelet demand and supply is very important yet challenging. The primary goal of this paper is to present an efficient forecasting model for platelet demand at Canadian Blood Services (CBS). To accomplish this goal, four different demand forecasting m…
▽ More
Platelet products are both expensive and have very short shelf lives. As usage rates for platelets are highly variable, the effective management of platelet demand and supply is very important yet challenging. The primary goal of this paper is to present an efficient forecasting model for platelet demand at Canadian Blood Services (CBS). To accomplish this goal, four different demand forecasting methods, ARIMA (Auto Regressive Moving Average), Prophet, lasso regression (least absolute shrinkage and selection operator) and LSTM (Long Short-Term Memory) networks are utilized and evaluated. We use a large clinical dataset for a centralized blood distribution centre for four hospitals in Hamilton, Ontario, spanning from 2010 to 2018 and consisting of daily platelet transfusions along with information such as the product specifications, the recipients' characteristics, and the recipients' laboratory test results. This study is the first to utilize different methods from statistical time series models to data-driven regression and a machine learning technique for platelet transfusion using clinical predictors and with different amounts of data. We find that the multivariate approaches have the highest accuracy in general, however, if sufficient data are available, a simpler time series approach such as ARIMA appears to be sufficient. We also comment on the approach to choose clinical indicators (inputs) for the multivariate models.
△ Less
Submitted 23 December, 2022; v1 submitted 6 January, 2021;
originally announced January 2021.
-
SEH: Size Estimate Hedging for Single-Server Queues
Authors:
Maryam Akbari-Moghaddam,
Douglas G. Down
Abstract:
For a single server system, Shortest Remaining Processing Time (SRPT) is an optimal size-based policy. In this paper, we discuss scheduling a single-server system when exact information about the jobs' processing times is not available. When the SRPT policy uses estimated processing times, the underestimation of large jobs can significantly degrade performance. We propose a simple heuristic, Size…
▽ More
For a single server system, Shortest Remaining Processing Time (SRPT) is an optimal size-based policy. In this paper, we discuss scheduling a single-server system when exact information about the jobs' processing times is not available. When the SRPT policy uses estimated processing times, the underestimation of large jobs can significantly degrade performance. We propose a simple heuristic, Size Estimate Hedging (SEH), that only uses estimated processing times for scheduling decisions. A job's priority is increased dynamically according to an SRPT rule until it is determined that it is underestimated, at which time the priority is frozen. Numerical results suggest that SEH has desirable performance for estimation error variance that is consistent with what is seen in practice.
△ Less
Submitted 20 January, 2023; v1 submitted 29 December, 2020;
originally announced January 2021.
-
Delays at signalised intersections with exhaustive traffic control
Authors:
Marko Boon,
Ivo Adan,
Erik Winands,
Doug Down
Abstract:
In this paper we study a traffic intersection with vehicle-actuated traffic signal control. Traffic lights stay green until all lanes within a group are emptied. Assuming general renewal arrival processes, we derive exact limiting distributions of the delays under Heavy Traffic (HT) conditions. Furthermore, we derive the Light Traffic (LT) limit of the mean delays for intersections with Poisson ar…
▽ More
In this paper we study a traffic intersection with vehicle-actuated traffic signal control. Traffic lights stay green until all lanes within a group are emptied. Assuming general renewal arrival processes, we derive exact limiting distributions of the delays under Heavy Traffic (HT) conditions. Furthermore, we derive the Light Traffic (LT) limit of the mean delays for intersections with Poisson arrivals, and develop a heuristic adaptation of this limit to capture the LT behaviour for other interarrival-time distributions. We combine the LT and HT results to develop closed-form approximations for the mean delays of vehicles in each lane. These closed-form approximations are quite accurate, very insightful and simple to implement.
△ Less
Submitted 1 August, 2014;
originally announced August 2014.