-
Investigating Memory Failure Prediction Across CPU Architectures
Authors:
Qiao Yu,
Wengui Zhang,
Min Zhou,
Jialiang Yu,
Zhenli Sheng,
Jasmin Bogatinovski,
Jorge Cardoso,
Odej Kao
Abstract:
Large-scale datacenters often experience memory failures, where Uncorrectable Errors (UEs) highlight critical malfunction in Dual Inline Memory Modules (DIMMs). Existing approaches primarily utilize Correctable Errors (CEs) to predict UEs, yet they typically neglect how these errors vary between different CPU architectures, especially in terms of Error Correction Code (ECC) applicability. In this…
▽ More
Large-scale datacenters often experience memory failures, where Uncorrectable Errors (UEs) highlight critical malfunction in Dual Inline Memory Modules (DIMMs). Existing approaches primarily utilize Correctable Errors (CEs) to predict UEs, yet they typically neglect how these errors vary between different CPU architectures, especially in terms of Error Correction Code (ECC) applicability. In this paper, we investigate the correlation between CEs and UEs across different CPU architectures, including X86 and ARM. Our analysis identifies unique patterns of memory failure associated with each processor platform. Leveraging Machine Learning (ML) techniques on production datasets, we conduct the memory failure prediction in different processors' platforms, achieving up to 15% improvements in F1-score compared to the existing algorithm. Finally, an MLOps (Machine Learning Operations) framework is provided to consistently improve the failure prediction in the production environment.
△ Less
Submitted 8 June, 2024;
originally announced June 2024.
-
LogRCA: Log-based Root Cause Analysis for Distributed Services
Authors:
Thorsten Wittkopp,
Philipp Wiesner,
Odej Kao
Abstract:
To assist IT service developers and operators in managing their increasingly complex service landscapes, there is a growing effort to leverage artificial intelligence in operations. To speed up troubleshooting, log anomaly detection has received much attention in particular, dealing with the identification of log events that indicate the reasons for a system failure. However, faults often propagat…
▽ More
To assist IT service developers and operators in managing their increasingly complex service landscapes, there is a growing effort to leverage artificial intelligence in operations. To speed up troubleshooting, log anomaly detection has received much attention in particular, dealing with the identification of log events that indicate the reasons for a system failure. However, faults often propagate extensively within systems, which can result in a large number of anomalies being detected by existing approaches. In this case, it can remain very challenging for users to quickly identify the actual root cause of a failure.
We propose LogRCA, a novel method for identifying a minimal set of log lines that together describe a root cause. LogRCA uses a semi-supervised learning approach to deal with rare and unknown errors and is designed to handle noisy data. We evaluated our approach on a large-scale production log data set of 44.3 million log lines, which contains 80 failures, whose root causes were labeled by experts. LogRCA consistently outperforms baselines based on deep learning and statistical analysis in terms of precision and recall to detect candidate root causes. In addition, we investigated the impact of our deployed data balancing approach, demonstrating that it considerably improves performance on rare failures.
△ Less
Submitted 22 May, 2024;
originally announced May 2024.
-
On Software Ageing Indicators in OpenStack
Authors:
Yevhen Yazvinskyi,
Jasmin Bogatinovski,
Jorge Cardoso,
Odej Kao
Abstract:
Distributed systems in general and cloud systems in particular, are susceptible to failures that can lead to substantial economic and data losses, security breaches, and even potential threats to human safety. Software ageing is an example of one such vulnerability. It emerges due to routine re-usage of computational systems units which induce fatigue within the components, resulting in an increas…
▽ More
Distributed systems in general and cloud systems in particular, are susceptible to failures that can lead to substantial economic and data losses, security breaches, and even potential threats to human safety. Software ageing is an example of one such vulnerability. It emerges due to routine re-usage of computational systems units which induce fatigue within the components, resulting in an increased failure rate and potential system breakdown. Due to its stochastic nature, ageing cannot be directly measured, instead ageing indicators as proxies are used. While there are dozens of studies on different ageing indicators, their comprehensive comparison in different settings remains underexplored. In this paper, we compare two ageing indicators in OpenStack as a use case. Specifically, our evaluation compares memory usage (including swap memory) and request response time, as readily available indicators. By executing multiple OpenStack deployments with varying configurations, we conduct a series of experiments and analyze the ageing indicators. Comparative analysis through statistical tests provides valuable insights into the strengths and weaknesses of the utilised ageing indicators. Finally, through an in-depth analysis of other OpenStack failures, we identify underlying failure patterns and their impact on the studied ageing indicators.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
Demeter: Resource-Efficient Distributed Stream Processing under Dynamic Loads with Multi-Configuration Optimization
Authors:
Morgan Geldenhuys,
Dominik Scheinert,
Odej Kao,
Lauritz Thamsen
Abstract:
Distributed Stream Processing (DSP) focuses on the near real-time processing of large streams of unbounded data. To increase processing capacities, DSP systems are able to dynamically scale across a cluster of commodity nodes, ensuring a good Quality of Service despite variable workloads. However, selecting scaleout configurations which maximize resource utilization remains a challenge. This is es…
▽ More
Distributed Stream Processing (DSP) focuses on the near real-time processing of large streams of unbounded data. To increase processing capacities, DSP systems are able to dynamically scale across a cluster of commodity nodes, ensuring a good Quality of Service despite variable workloads. However, selecting scaleout configurations which maximize resource utilization remains a challenge. This is especially true in environments where workloads change over time and node failures are all but inevitable. Furthermore, configuration parameters such as memory allocation and checkpointing intervals impact performance and resource usage as well. Sub-optimal configurations easily lead to high operational costs, poor performance, or unacceptable loss of service.
In this paper, we present Demeter, a method for dynamically optimizing key DSP system configuration parameters for resource efficiency. Demeter uses Time Series Forecasting to predict future workloads and Multi-Objective Bayesian Optimization to model runtime behaviors in relation to parameter settings and workload rates. Together, these techniques allow us to determine whether or not enough is known about the predicted workload rate to proactively initiate short-lived parallel profiling runs for data gathering. Once trained, the models guide the adjustment of multiple, potentially dependent system configuration parameters ensuring optimized performance and resource usage in response to changing workload rates. Our experiments on a commodity cluster using Apache Flink demonstrate that Demeter significantly improves the operational efficiency of long-running benchmark jobs.
△ Less
Submitted 4 March, 2024;
originally announced March 2024.
-
Daedalus: Self-Adaptive Horizontal Autoscaling for Resource Efficiency of Distributed Stream Processing Systems
Authors:
Benjamin J. J. Pfister,
Dominik Scheinert,
Morgan K. Geldenhuys,
Odej Kao
Abstract:
Distributed Stream Processing (DSP) systems are capable of processing large streams of unbounded data, offering high throughput and low latencies. To maintain a stable Quality of Service (QoS), these systems require a sufficient allocation of resources. At the same time, over-provisioning can result in wasted energy and high operating costs. Therefore, to maximize resource utilization, autoscaling…
▽ More
Distributed Stream Processing (DSP) systems are capable of processing large streams of unbounded data, offering high throughput and low latencies. To maintain a stable Quality of Service (QoS), these systems require a sufficient allocation of resources. At the same time, over-provisioning can result in wasted energy and high operating costs. Therefore, to maximize resource utilization, autoscaling methods have been proposed that aim to efficiently match the resource allocation with the incoming workload. However, determining when and by how much to scale remains a significant challenge. Given the long-running nature of DSP jobs, scaling actions need to be executed at runtime, and to maintain a good QoS, they should be both accurate and infrequent. To address the challenges of autoscaling, the concept of self-adaptive systems is particularly fitting. These systems monitor themselves and their environment, adapting to changes with minimal need for expert involvement.
This paper introduces Daedalus, a self-adaptive manager for autoscaling in DSP systems, which draws on the principles of self-adaption to address the challenge of efficient autoscaling. Daedalus monitors a running DSP job and builds performance models, aiming to predict the maximum processing capacity at different scale-outs. When combined with time series forecasting to predict future workloads, Daedalus proactively scales DSP jobs, optimizing for maximum throughput and minimizing both latencies and resource usage. We conducted experiments using Apache Flink and Kafka Streams to evaluate the performance of Daedalus against two state-of-the-art approaches. Daedalus was able to achieve comparable latencies while reducing resource usage by up to 71%.
△ Less
Submitted 5 March, 2024; v1 submitted 4 March, 2024;
originally announced March 2024.
-
Progressing from Anomaly Detection to Automated Log Labeling and Pioneering Root Cause Analysis
Authors:
Thorsten Wittkopp,
Alexander Acker,
Odej Kao
Abstract:
The realm of AIOps is transforming IT landscapes with the power of AI and ML. Despite the challenge of limited labeled data, supervised models show promise, emphasizing the importance of leveraging labels for training, especially in deep learning contexts. This study enhances the field by introducing a taxonomy for log anomalies and exploring automated data labeling to mitigate labeling challenges…
▽ More
The realm of AIOps is transforming IT landscapes with the power of AI and ML. Despite the challenge of limited labeled data, supervised models show promise, emphasizing the importance of leveraging labels for training, especially in deep learning contexts. This study enhances the field by introducing a taxonomy for log anomalies and exploring automated data labeling to mitigate labeling challenges. It goes further by investigating the potential of diverse anomaly detection techniques and their alignment with specific anomaly types. However, the exploration doesn't stop at anomaly detection. The study envisions a future where root cause analysis follows anomaly detection, unraveling the underlying triggers of anomalies. This uncharted territory holds immense potential for revolutionizing IT systems management. In essence, this paper enriches our understanding of anomaly detection, and automated labeling, and sets the stage for transformative root cause analysis. Together, these advances promise more resilient IT systems, elevating operational efficiency and user satisfaction in an ever-evolving technological landscape.
△ Less
Submitted 22 December, 2023;
originally announced December 2023.
-
Exploring Error Bits for Memory Failure Prediction: An In-Depth Correlative Study
Authors:
Qiao Yu,
Wengui Zhang,
Jorge Cardoso,
Odej Kao
Abstract:
In large-scale datacenters, memory failure is a common cause of server crashes, with Uncorrectable Errors (UEs) being a major indicator of Dual Inline Memory Module (DIMM) defects. Existing approaches primarily focus on predicting UEs using Correctable Errors (CEs), without fully considering the information provided by error bits. However, error bit patterns have a strong correlation with the occu…
▽ More
In large-scale datacenters, memory failure is a common cause of server crashes, with Uncorrectable Errors (UEs) being a major indicator of Dual Inline Memory Module (DIMM) defects. Existing approaches primarily focus on predicting UEs using Correctable Errors (CEs), without fully considering the information provided by error bits. However, error bit patterns have a strong correlation with the occurrence of UEs. In this paper, we present a comprehensive study on the correlation between CEs and UEs, specifically emphasizing the importance of spatio-temporal error bit information. Our analysis reveals a strong correlation between spatio-temporal error bits and UE occurrence. Through evaluations using real-world datasets, we demonstrate that our approach significantly improves prediction performance by 15% in F1-score compared to the state-of-the-art algorithms. Overall, our approach effectively reduces the number of virtual machine interruptions caused by UEs by approximately 59%.
△ Less
Submitted 18 December, 2023; v1 submitted 5 December, 2023;
originally announced December 2023.
-
Predicting Dynamic Memory Requirements for Scientific Workflow Tasks
Authors:
Jonathan Bader,
Nils Diedrich,
Lauritz Thamsen,
Odej Kao
Abstract:
With the increasing amount of data available to scientists in disciplines as diverse as bioinformatics, physics, and remote sensing, scientific workflow systems are becoming increasingly important for composing and executing scalable data analysis pipelines. When writing such workflows, users need to specify the resources to be reserved for tasks so that sufficient resources are allocated on the t…
▽ More
With the increasing amount of data available to scientists in disciplines as diverse as bioinformatics, physics, and remote sensing, scientific workflow systems are becoming increasingly important for composing and executing scalable data analysis pipelines. When writing such workflows, users need to specify the resources to be reserved for tasks so that sufficient resources are allocated on the target cluster infrastructure. Crucially, underestimating a task's memory requirements can result in task failures. Therefore, users often resort to overprovisioning, resulting in significant resource wastage and decreased throughput. In this paper, we propose a novel online method that uses monitoring time series data to predict task memory usage in order to reduce the memory wastage of scientific workflow tasks. Our method predicts a task's runtime, divides it into k equally-sized segments, and learns the peak memory value for each segment depending on the total file input size. We evaluate the prototype implementation of our method using workflows from the publicly available nf-core repository, showing an average memory wastage reduction of 29.48% compared to the best state-of-the-art approach.
△ Less
Submitted 19 March, 2024; v1 submitted 14 November, 2023;
originally announced November 2023.
-
Offloading Real-Time Tasks in IIoT Environments under Consideration of Networking Uncertainties
Authors:
Ilja Behnke,
Philipp Wiesner,
Paul Voelker,
Odej Kao
Abstract:
Offloading is a popular way to overcome the resource and power constraints of networked embedded devices, which are increasingly found in industrial environments. It involves moving resource-intensive computational tasks to a more powerful device on the network, often in close proximity to enable wireless communication. However, many Industrial Internet of Things (IIoT) applications have real-time…
▽ More
Offloading is a popular way to overcome the resource and power constraints of networked embedded devices, which are increasingly found in industrial environments. It involves moving resource-intensive computational tasks to a more powerful device on the network, often in close proximity to enable wireless communication. However, many Industrial Internet of Things (IIoT) applications have real-time constraints. Offloading such tasks over a wireless network with latency uncertainties poses new challenges.
In this paper, we aim to better understand these challenges by proposing a system architecture and scheduler for real-time task offloading in wireless IIoT environments. Based on a prototype, we then evaluate different system configurations and discuss their trade-offs and implications. Our design showed to prevent deadline misses under high load and network uncertainties and was able to outperform a reference scheduler in terms of successful task throughput. Under heavy task load, where the reference scheduler had a success rate of 5%, our design achieved a success rate of 60%.
△ Less
Submitted 31 October, 2023;
originally announced October 2023.
-
Carbon-Awareness in CI/CD
Authors:
Henrik Claßen,
Jonas Thierfeldt,
Julian Tochman-Szewc,
Philipp Wiesner,
Odej Kao
Abstract:
While the environmental impact of digitalization is becoming more and more evident, the climate crisis has become a major issue for society. For instance, data centers alone account for 2.7% of Europe's energy consumption today. A considerable part of this load is accounted for by cloud-based services for automated software development, such as continuous integration and delivery (CI/CD) workflows…
▽ More
While the environmental impact of digitalization is becoming more and more evident, the climate crisis has become a major issue for society. For instance, data centers alone account for 2.7% of Europe's energy consumption today. A considerable part of this load is accounted for by cloud-based services for automated software development, such as continuous integration and delivery (CI/CD) workflows.
In this paper, we discuss opportunities and challenges for greening CI/CD services by better aligning their execution with the availability of low-carbon energy. We propose a system architecture for carbon-aware CI/CD services, which uses historical runtime information and, optionally, user-provided information. We examined the potential effectiveness of different scheduling strategies using real carbon intensity data and 7,392 workflow executions of Github Actions, a popular CI/CD service. Our results show, that user-provided information on workflow deadlines can effectively improve carbon-aware scheduling.
△ Less
Submitted 28 October, 2023;
originally announced October 2023.
-
OpenIncrement: A Unified Framework for Open Set Recognition and Deep Class-Incremental Learning
Authors:
Jiawen Xu,
Claas Grohnfeldt,
Odej Kao
Abstract:
In most works on deep incremental learning research, it is assumed that novel samples are pre-identified for neural network retraining. However, practical deep classifiers often misidentify these samples, leading to erroneous predictions. Such misclassifications can degrade model performance. Techniques like open set recognition offer a means to detect these novel samples, representing a significa…
▽ More
In most works on deep incremental learning research, it is assumed that novel samples are pre-identified for neural network retraining. However, practical deep classifiers often misidentify these samples, leading to erroneous predictions. Such misclassifications can degrade model performance. Techniques like open set recognition offer a means to detect these novel samples, representing a significant area in the machine learning domain.
In this paper, we introduce a deep class-incremental learning framework integrated with open set recognition. Our approach refines class-incrementally learned features to adapt them for distance-based open set recognition. Experimental results validate that our method outperforms state-of-the-art incremental learning techniques and exhibits superior performance in open set recognition compared to baseline methods.
△ Less
Submitted 5 October, 2023;
originally announced October 2023.
-
Lotaru: Locally Predicting Workflow Task Runtimes for Resource Management on Heterogeneous Infrastructures
Authors:
Jonathan Bader,
Fabian Lehmann,
Lauritz Thamsen,
Ulf Leser,
Odej Kao
Abstract:
Many resource management techniques for task scheduling, energy and carbon efficiency, and cost optimization in workflows rely on a-priori task runtime knowledge. Building runtime prediction models on historical data is often not feasible in practice as workflows, their input data, and the cluster infrastructure change. Online methods, on the other hand, which estimate task runtimes on specific ma…
▽ More
Many resource management techniques for task scheduling, energy and carbon efficiency, and cost optimization in workflows rely on a-priori task runtime knowledge. Building runtime prediction models on historical data is often not feasible in practice as workflows, their input data, and the cluster infrastructure change. Online methods, on the other hand, which estimate task runtimes on specific machines while the workflow is running, have to cope with a lack of measurements during start-up. Frequently, scientific workflows are executed on heterogeneous infrastructures consisting of machines with different CPU, I/O, and memory configurations, further complicating predicting runtimes due to different task runtimes on different machine types.
This paper presents Lotaru, a method for locally predicting the runtimes of scientific workflow tasks before they are executed on heterogeneous compute clusters. Crucially, our approach does not rely on historical data and copes with a lack of training data during the start-up. To this end, we use microbenchmarks, reduce the input data to quickly profile the workflow locally, and predict a task's runtime with a Bayesian linear regression based on the gathered data points from the local workflow execution and the microbenchmarks. Due to its Bayesian approach, Lotaru provides uncertainty estimates that can be used for advanced scheduling methods on distributed cluster infrastructures.
In our evaluation with five real-world scientific workflows, our method outperforms two state-of-the-art runtime prediction baselines and decreases the absolute prediction error by more than 12.5%. In a second set of experiments, the prediction performance of our method, using the predicted runtimes for state-of-the-art scheduling, carbon reduction, and cost prediction, enables results close to those achieved with perfect prior knowledge of runtimes.
△ Less
Submitted 13 September, 2023;
originally announced September 2023.
-
Karasu: A Collaborative Approach to Efficient Cluster Configuration for Big Data Analytics
Authors:
Dominik Scheinert,
Philipp Wiesner,
Thorsten Wittkopp,
Lauritz Thamsen,
Jonathan Will,
Odej Kao
Abstract:
Selecting the right resources for big data analytics jobs is hard because of the wide variety of configuration options like machine type and cluster size. As poor choices can have a significant impact on resource efficiency, cost, and energy usage, automated approaches are gaining popularity. Most existing methods rely on profiling recurring workloads to find near-optimal solutions over time. Due…
▽ More
Selecting the right resources for big data analytics jobs is hard because of the wide variety of configuration options like machine type and cluster size. As poor choices can have a significant impact on resource efficiency, cost, and energy usage, automated approaches are gaining popularity. Most existing methods rely on profiling recurring workloads to find near-optimal solutions over time. Due to the cold-start problem, this often leads to lengthy and costly profiling phases. However, big data analytics jobs across users can share many common properties: they often operate on similar infrastructure, using similar algorithms implemented in similar frameworks. The potential in sharing aggregated profiling runs to collaboratively address the cold start problem is largely unexplored.
We present Karasu, an approach to more efficient resource configuration profiling that promotes data sharing among users working with similar infrastructures, frameworks, algorithms, or datasets. Karasu trains lightweight performance models using aggregated runtime information of collaborators and combines them into an ensemble method to exploit inherent knowledge of the configuration search space. Moreover, Karasu allows the optimization of multiple objectives simultaneously. Our evaluation is based on performance data from diverse workload executions in a public cloud environment. We show that Karasu is able to significantly boost existing methods in terms of performance, search time, and cost, even when few comparable profiling runs are available that share only partial common characteristics with the target job.
△ Less
Submitted 23 November, 2023; v1 submitted 22 August, 2023;
originally announced August 2023.
-
Towards Benchmarking Power-Performance Characteristics of Federated Learning Clients
Authors:
Pratik Agrawal,
Philipp Wiesner,
Odej Kao
Abstract:
Federated Learning (FL) is a decentralized machine learning approach where local models are trained on distributed clients, allowing privacy-preserving collaboration by sharing model updates instead of raw data. However, the added communication overhead and increased training time caused by heterogenous data distributions results in higher energy consumption and carbon emissions for achieving simi…
▽ More
Federated Learning (FL) is a decentralized machine learning approach where local models are trained on distributed clients, allowing privacy-preserving collaboration by sharing model updates instead of raw data. However, the added communication overhead and increased training time caused by heterogenous data distributions results in higher energy consumption and carbon emissions for achieving similar model performance than traditional machine learning. At the same time, efficient usage of available energy is an important requirement for battery constrained devices. Because of this, many different approaches on energy-efficient and carbon-efficient FL scheduling and client selection have been published in recent years. However, most of this research oversimplifies power performance characteristics of clients by assuming that they always require the same amount of energy per processed sample throughout training. This overlooks real-world effects arising from operating devices under different power modes or the side effects of running other workloads in parallel. In this work, we take a first look on the impact of such factors and discuss how better power-performance estimates can improve energy-efficient and carbon-efficient FL scheduling.
△ Less
Submitted 16 August, 2023;
originally announced August 2023.
-
Evaluation of Data Enrichment Methods for Distributed Stream Processing Systems
Authors:
Dominik Scheinert,
Fabian Casares,
Morgan K. Geldenhuys,
Kevin Styp-Rekowski,
Odej Kao
Abstract:
Stream processing has become a critical component in the architecture of modern applications. With the exponential growth of data generation from sources such as the Internet of Things, business intelligence, and telecommunications, real-time processing of unbounded data streams has become a necessity. DSP systems provide a solution to this challenge, offering high horizontal scalability, fault-to…
▽ More
Stream processing has become a critical component in the architecture of modern applications. With the exponential growth of data generation from sources such as the Internet of Things, business intelligence, and telecommunications, real-time processing of unbounded data streams has become a necessity. DSP systems provide a solution to this challenge, offering high horizontal scalability, fault-tolerant execution, and the ability to process data streams from multiple sources in a single DSP job. Often enough though, data streams need to be enriched with extra information for correct processing, which introduces additional dependencies and potential bottlenecks.
In this paper, we present an in-depth evaluation of data enrichment methods for DSP systems and identify the different use cases for stream processing in modern systems. Using a representative DSP system and conducting the evaluation in a realistic cloud environment, we found that outsourcing enrichment data to the DSP system can improve performance for specific use cases. However, this increased resource consumption highlights the need for stream processing solutions specifically designed for the performance-intensive workloads of cloud-based applications.
△ Less
Submitted 23 November, 2023; v1 submitted 26 July, 2023;
originally announced July 2023.
-
Vessim: A Testbed for Carbon-Aware Applications and Systems
Authors:
Philipp Wiesner,
Ilja Behnke,
Paul Kilian,
Marvin Steinke,
Odej Kao
Abstract:
To reduce the carbon footprint of computing and stabilize electricity grids, there is an increasing focus on approaches that align the power usage of IT infrastructure with the availability of clean energy. Unfortunately, research on energy-aware and carbon-aware applications, as well as the interfaces between computing and energy systems, remains complex due to the scarcity of available testing e…
▽ More
To reduce the carbon footprint of computing and stabilize electricity grids, there is an increasing focus on approaches that align the power usage of IT infrastructure with the availability of clean energy. Unfortunately, research on energy-aware and carbon-aware applications, as well as the interfaces between computing and energy systems, remains complex due to the scarcity of available testing environments. To this day, almost all new approaches are evaluated on custom simulation testbeds, which leads to repeated development efforts and limited comparability of results.
In this paper, we present Vessim, a co-simulation environment for testing applications and computing systems that interact with their energy systems. Our testbed connects domain-specific simulators for renewable power generation and energy storage, and enables users to implement interfaces to integrate real systems through software and hardware-in-the-loop simulation. Vessim offers an easy-to-use interface, is extendable to new simulators, and provides direct access to historical datasets. We aim to not only accelerate research in carbon-aware computing but also facilitate development and operation, as in continuous testing or digital twins. Vessim is publicly available: https://github.com/dos-group/vessim.
△ Less
Submitted 19 June, 2024; v1 submitted 16 June, 2023;
originally announced June 2023.
-
Selecting Efficient Cluster Resources for Data Analytics: When and How to Allocate for In-Memory Processing?
Authors:
Jonathan Will,
Lauritz Thamsen,
Dominik Scheinert,
Odej Kao
Abstract:
Distributed dataflow systems such as Apache Spark or Apache Flink enable parallel, in-memory data processing on large clusters of commodity hardware. Consequently, the appropriate amount of memory to allocate to the cluster is a crucial consideration.
In this paper, we analyze the challenge of efficient resource allocation for distributed data processing, focusing on memory. We emphasize that in…
▽ More
Distributed dataflow systems such as Apache Spark or Apache Flink enable parallel, in-memory data processing on large clusters of commodity hardware. Consequently, the appropriate amount of memory to allocate to the cluster is a crucial consideration.
In this paper, we analyze the challenge of efficient resource allocation for distributed data processing, focusing on memory. We emphasize that in-memory processing with in-memory data processing frameworks can undermine resource efficiency. Based on the findings of our trace data analysis, we compile requirements towards an automated solution for efficient cluster resource allocation.
△ Less
Submitted 7 June, 2023; v1 submitted 6 June, 2023;
originally announced June 2023.
-
FedZero: Leveraging Renewable Excess Energy in Federated Learning
Authors:
Philipp Wiesner,
Ramin Khalili,
Dennis Grinwald,
Pratik Agrawal,
Lauritz Thamsen,
Odej Kao
Abstract:
Federated Learning (FL) is an emerging machine learning technique that enables distributed model training across data silos or edge devices without data sharing. Yet, FL inevitably introduces inefficiencies compared to centralized model training, which will further increase the already high energy usage and associated carbon emissions of machine learning in the future. One idea to reduce FL's carb…
▽ More
Federated Learning (FL) is an emerging machine learning technique that enables distributed model training across data silos or edge devices without data sharing. Yet, FL inevitably introduces inefficiencies compared to centralized model training, which will further increase the already high energy usage and associated carbon emissions of machine learning in the future. One idea to reduce FL's carbon footprint is to schedule training jobs based on the availability of renewable excess energy that can occur at certain times and places in the grid. However, in the presence of such volatile and unreliable resources, existing FL schedulers cannot always ensure fast, efficient, and fair training.
We propose FedZero, an FL system that operates exclusively on renewable excess energy and spare capacity of compute infrastructure to effectively reduce a training's operational carbon emissions to zero. Using energy and load forecasts, FedZero leverages the spatio-temporal availability of excess resources by selecting clients for fast convergence and fair participation. Our evaluation, based on real solar and load traces, shows that FedZero converges significantly faster than existing approaches under the mentioned constraints while consuming less energy. Furthermore, it is robust to forecasting errors and scalable to tens of thousands of clients.
△ Less
Submitted 10 January, 2024; v1 submitted 24 May, 2023;
originally announced May 2023.
-
Towards a Real-Time IoT: Approaches for Incoming Packet Processing in Cyber-Physical Systems
Authors:
Ilja Behnke,
Christoph Blumschein,
Robert Danicki,
Philipp Wiesner,
Lauritz Thamsen,
Odej Kao
Abstract:
Embedded real-time devices for monitoring, controlling, and collaboration purposes in cyber-physical systems are now commonly equipped with IP networking capabilities. However, the reception and processing of IP packets generates workloads in unpredictable frequencies as networks are outside of a developer's control and difficult to anticipate, especially when networks are connected to the interne…
▽ More
Embedded real-time devices for monitoring, controlling, and collaboration purposes in cyber-physical systems are now commonly equipped with IP networking capabilities. However, the reception and processing of IP packets generates workloads in unpredictable frequencies as networks are outside of a developer's control and difficult to anticipate, especially when networks are connected to the internet. As of now, embedded network controllers and IP stacks are not designed for real-time capabilities, even when used in real-time environments and operating systems.
Our work focuses on real-time aware packet reception from open network connections, without a real-time networking infrastructure. This article presents two experimentally evaluated modifications to the IP processing subsystem and embedded network interface controllers of constrained IoT devices. The first, our software approach, introduces early packet classification and priority-aware processing in the network driver. In our experiments this allowed the network subsystem to remain active at a seven-fold increase in network traffic load before disabling the receive interrupts as a last resort. The second, our hardware approach, makes changes to the network interface controller, applying interrupt moderation based on real-time priorities to minimize the number of network-generated interrupts. Furthermore, this article provides an outlook on how the software and hardware approaches can be combined in a co-designed packet receive architecture.
△ Less
Submitted 3 May, 2023;
originally announced May 2023.
-
PULL: Reactive Log Anomaly Detection Based On Iterative PU Learning
Authors:
Thorsten Wittkopp,
Dominik Scheinert,
Philipp Wiesner,
Alexander Acker,
Odej Kao
Abstract:
Due to the complexity of modern IT services, failures can be manifold, occur at any stage, and are hard to detect. For this reason, anomaly detection applied to monitoring data such as logs allows gaining relevant insights to improve IT services steadily and eradicate failures. However, existing anomaly detection methods that provide high accuracy often rely on labeled training data, which are tim…
▽ More
Due to the complexity of modern IT services, failures can be manifold, occur at any stage, and are hard to detect. For this reason, anomaly detection applied to monitoring data such as logs allows gaining relevant insights to improve IT services steadily and eradicate failures. However, existing anomaly detection methods that provide high accuracy often rely on labeled training data, which are time-consuming to obtain in practice. Therefore, we propose PULL, an iterative log analysis method for reactive anomaly detection based on estimated failure time windows provided by monitoring systems instead of labeled data. Our attention-based model uses a novel objective function for weak supervision deep learning that accounts for imbalanced data and applies an iterative learning strategy for positive and unknown samples (PU learning) to identify anomalous logs. Our evaluation shows that PULL consistently outperforms ten benchmark baselines across three different datasets and detects anomalous log messages with an F1-score of more than 0.99 even within imprecise failure time windows.
△ Less
Submitted 25 January, 2023;
originally announced January 2023.
-
First CE Matters: On the Importance of Long Term Properties on Memory Failure Prediction
Authors:
Jasmin Bogatinovski,
Qiao Yu,
Jorge Cardoso,
Odej Kao
Abstract:
Dynamic random access memory failures are a threat to the reliability of data centres as they lead to data loss and system crashes. Timely predictions of memory failures allow for taking preventive measures such as server migration and memory replacement. Thereby, memory failure prediction prevents failures from externalizing, and it is a vital task to improve system reliability. In this paper, we…
▽ More
Dynamic random access memory failures are a threat to the reliability of data centres as they lead to data loss and system crashes. Timely predictions of memory failures allow for taking preventive measures such as server migration and memory replacement. Thereby, memory failure prediction prevents failures from externalizing, and it is a vital task to improve system reliability. In this paper, we revisited the problem of memory failure prediction. We analyzed the correctable errors (CEs) from hardware logs as indicators for a degraded memory state. As memories do not always work with full occupancy, access to faulty memory parts is time distributed. Following this intuition, we observed that important properties for memory failure prediction are distributed through long time intervals. In contrast, related studies, to fit practical constraints, frequently only analyze the CEs from the last fixed-size time interval while ignoring the predating information. Motivated by the observed discrepancy, we study the impact of including the overall (long-range) CE evolution and propose novel features that are calculated incrementally to preserve long-range properties. By coupling the extracted features with machine learning methods, we learn a predictive model to anticipate upcoming failures three hours in advance while improving the average relative precision and recall for 21% and 19% accordingly. We evaluated our methodology on real-world memory failures from the server fleet of a large cloud provider, justifying its validity and practicality.
△ Less
Submitted 21 November, 2022;
originally announced December 2022.
-
Probabilistic Time Series Forecasting for Adaptive Monitoring in Edge Computing Environments
Authors:
Dominik Scheinert,
Babak Sistani Zadeh Aghdam,
Soeren Becker,
Odej Kao,
Lauritz Thamsen
Abstract:
With increasingly more computation being shifted to the edge of the network, monitoring of critical infrastructures, such as intermediate processing nodes in autonomous driving, is further complicated due to the typically resource-constrained environments. In order to reduce the resource overhead on the network link imposed by monitoring, various methods have been discussed that either follow a fi…
▽ More
With increasingly more computation being shifted to the edge of the network, monitoring of critical infrastructures, such as intermediate processing nodes in autonomous driving, is further complicated due to the typically resource-constrained environments. In order to reduce the resource overhead on the network link imposed by monitoring, various methods have been discussed that either follow a filtering approach for data-emitting devices or conduct dynamic sampling based on employed prediction models. Still, existing methods are mainly requiring adaptive monitoring on edge devices, which demands device reconfigurations, utilizes additional resources, and limits the sophistication of employed models.
In this paper, we propose a sampling-based and cloud-located approach that internally utilizes probabilistic forecasts and hence provides means of quantifying model uncertainties, which can be used for contextualized adaptations of sampling frequencies and consequently relieves constrained network resources. We evaluate our prototype implementation for the monitoring pipeline on a publicly available streaming dataset and demonstrate its positive impact on resource efficiency in a method comparison.
△ Less
Submitted 30 January, 2023; v1 submitted 24 November, 2022;
originally announced November 2022.
-
Towards Advanced Monitoring for Scientific Workflows
Authors:
Jonathan Bader,
Joel Witzke,
Soeren Becker,
Ansgar Lößer,
Fabian Lehmann,
Leon Doehler,
Anh Duc Vu,
Odej Kao
Abstract:
Scientific workflows consist of thousands of highly parallelized tasks executed in a distributed environment involving many components. Automatic tracing and investigation of the components' and tasks' performance metrics, traces, and behavior are necessary to support the end user with a level of abstraction since the large amount of data cannot be analyzed manually. The execution and monitoring o…
▽ More
Scientific workflows consist of thousands of highly parallelized tasks executed in a distributed environment involving many components. Automatic tracing and investigation of the components' and tasks' performance metrics, traces, and behavior are necessary to support the end user with a level of abstraction since the large amount of data cannot be analyzed manually. The execution and monitoring of scientific workflows involves many components, the cluster infrastructure, its resource manager, the workflow, and the workflow tasks. All components in such an execution environment access different monitoring metrics and provide metrics on different abstraction levels. The combination and analysis of observed metrics from different components and their interdependencies are still widely unregarded.
We specify four different monitoring layers that can serve as an architectural blueprint for the monitoring responsibilities and the interactions of components in the scientific workflow execution context. We describe the different monitoring metrics subject to the four layers and how the layers interact. Finally, we examine five state-of-the-art scientific workflow management systems (SWMS) in order to assess which steps are needed to enable our four-layer-based approach.
△ Less
Submitted 18 July, 2023; v1 submitted 23 November, 2022;
originally announced November 2022.
-
Leveraging Reinforcement Learning for Task Resource Allocation in Scientific Workflows
Authors:
Jonathan Bader,
Nicolas Zunker,
Soeren Becker,
Odej Kao
Abstract:
Scientific workflows are designed as directed acyclic graphs (DAGs) and consist of multiple dependent task definitions. They are executed over a large amount of data, often resulting in thousands of tasks with heterogeneous compute requirements and long runtimes, even on cluster infrastructures. In order to optimize the workflow performance, enough resources, e.g., CPU and memory, need to be provi…
▽ More
Scientific workflows are designed as directed acyclic graphs (DAGs) and consist of multiple dependent task definitions. They are executed over a large amount of data, often resulting in thousands of tasks with heterogeneous compute requirements and long runtimes, even on cluster infrastructures. In order to optimize the workflow performance, enough resources, e.g., CPU and memory, need to be provisioned for the respective tasks. Typically, workflow systems rely on user resource estimates which are known to be highly error-prone and can result in over- or underprovisioning. While resource overprovisioning leads to high resource wastage, underprovisioning can result in long runtimes or even failed tasks.
In this paper, we propose two different reinforcement learning approaches based on gradient bandits and Q-learning, respectively, in order to minimize resource wastage by selecting suitable CPU and memory allocations. We provide a prototypical implementation in the well-known scientific workflow management system Nextflow, evaluate our approaches with five workflows, and compare them against the default resource configurations and a state-of-the-art feedback loop baseline. The evaluation yields that our reinforcement learning approaches significantly reduce resource wastage compared to the default configuration. Further, our approaches also reduce the allocated CPU hours compared to the state-of-the-art feedback loop by 6.79% and 24.53%.
△ Less
Submitted 18 July, 2023; v1 submitted 22 November, 2022;
originally announced November 2022.
-
Perona: Robust Infrastructure Fingerprinting for Resource-Efficient Big Data Analytics
Authors:
Dominik Scheinert,
Soeren Becker,
Jonathan Bader,
Lauritz Thamsen,
Jonathan Will,
Odej Kao
Abstract:
Choosing a good resource configuration for big data analytics applications can be challenging, especially in cloud environments. Automated approaches are desirable as poor decisions can reduce performance and raise costs. The majority of existing automated approaches either build performance models from previous workload executions or conduct iterative resource configuration profiling until a near…
▽ More
Choosing a good resource configuration for big data analytics applications can be challenging, especially in cloud environments. Automated approaches are desirable as poor decisions can reduce performance and raise costs. The majority of existing automated approaches either build performance models from previous workload executions or conduct iterative resource configuration profiling until a near-optimal solution has been found. In doing so, they only obtain an implicit understanding of the underlying infrastructure, which is difficult to transfer to alternative infrastructures and, thus, profiling and modeling insights are not sustained beyond very specific situations.
We present Perona, a novel approach to robust infrastructure fingerprinting for usage in the context of big data analytics. Perona employs common sets and configurations of benchmarking tools for target resources, so that resulting benchmark metrics are directly comparable and ranking is enabled. Insignificant benchmark metrics are discarded by learning a low-dimensional representation of the input metric vector, and previous benchmark executions are taken into consideration for context-awareness as well, allowing to detect resource degradation. We evaluate our approach both on data gathered from our own experiments as well as within related works for resource configuration optimization, demonstrating that Perona captures the characteristics from benchmark runs in a compact manner and produces representations that can be used directly.
△ Less
Submitted 30 January, 2023; v1 submitted 15 November, 2022;
originally announced November 2022.
-
Federated Learning for Autoencoder-based Condition Monitoring in the Industrial Internet of Things
Authors:
Soeren Becker,
Kevin Styp-Rekowski,
Oliver Vincent Leon Stoll,
Odej Kao
Abstract:
Enabled by the increasing availability of sensor data monitored from production machinery, condition monitoring and predictive maintenance methods are key pillars for an efficient and robust manufacturing production cycle in the Industrial Internet of Things. The employment of machine learning models to detect and predict deteriorating behavior by analyzing a variety of data collected across sever…
▽ More
Enabled by the increasing availability of sensor data monitored from production machinery, condition monitoring and predictive maintenance methods are key pillars for an efficient and robust manufacturing production cycle in the Industrial Internet of Things. The employment of machine learning models to detect and predict deteriorating behavior by analyzing a variety of data collected across several industrial environments shows promising results in recent works, yet also often requires transferring the sensor data to centralized servers located in the cloud. Moreover, although collaborating and sharing knowledge between industry sites yields large benefits, especially in the area of condition monitoring, it is often prohibited due to data privacy issues. To tackle this situation, we propose an Autoencoder-based Federated Learning method utilizing vibration sensor data from rotating machines, that allows for a distributed training on edge devices, located on-premise and close to the monitored machines. Preserving data privacy and at the same time exonerating possibly unreliable network connections of remote sites, our approach enables knowledge transfer across organizational boundaries, without sharing the monitored data. We conducted an evaluation utilizing two real-world datasets as well as multiple testbeds and the results indicate that our method enables a competitive performance compared to previous results, while significantly reducing the resource and network utilization.
△ Less
Submitted 14 November, 2022;
originally announced November 2022.
-
Ruya: Memory-Aware Iterative Optimization of Cluster Configurations for Big Data Processing
Authors:
Jonathan Will,
Lauritz Thamsen,
Jonathan Bader,
Dominik Scheinert,
Odej Kao
Abstract:
Selecting appropriate computational resources for data processing jobs on large clusters is difficult, even for expert users like data engineers. Inadequate choices can result in vastly increased costs, without significantly improving performance. One crucial aspect of selecting an efficient resource configuration is avoiding memory bottlenecks. By knowing the required memory of a job in advance,…
▽ More
Selecting appropriate computational resources for data processing jobs on large clusters is difficult, even for expert users like data engineers. Inadequate choices can result in vastly increased costs, without significantly improving performance. One crucial aspect of selecting an efficient resource configuration is avoiding memory bottlenecks. By knowing the required memory of a job in advance, the search space for an optimal resource configuration can be greatly reduced.
Therefore, we present Ruya, a method for memory-aware optimization of data processing cluster configurations based on iteratively exploring a narrowed-down search space. First, we perform job profiling runs with small samples of the dataset on just a single machine to model the job's memory usage patterns. Second, we prioritize cluster configurations with a suitable amount of total memory and within this reduced search space, we iteratively search for the best cluster configuration with Bayesian optimization. This search process stops once it converges on a configuration that is believed to be optimal for the given job. In our evaluation on a dataset with 1031 Spark and Hadoop jobs, we see a reduction of search iterations to find an optimal configuration by around half, compared to the baseline.
△ Less
Submitted 3 February, 2023; v1 submitted 8 November, 2022;
originally announced November 2022.
-
Macaw: The Machine Learning Magnetometer Calibration Workflow
Authors:
Jonathan Bader,
Kevin Styp-Rekowski,
Leon Doehler,
Soeren Becker,
Odej Kao
Abstract:
In Earth Systems Science, many complex data pipelines combine different data sources and apply data filtering and analysis steps. Typically, such data analysis processes are historically grown and implemented with many sequentially executed scripts. Scientific workflow management systems (SWMS) allow scientists to use their existing scripts and provide support for parallelization, reusability, mon…
▽ More
In Earth Systems Science, many complex data pipelines combine different data sources and apply data filtering and analysis steps. Typically, such data analysis processes are historically grown and implemented with many sequentially executed scripts. Scientific workflow management systems (SWMS) allow scientists to use their existing scripts and provide support for parallelization, reusability, monitoring, or failure handling. However, many scientists still rely on their sequentially called scripts and do not profit from the out-of-the-box advantages a SWMS can provide. In this work, we transform the data analysis processes of a Machine Learning-based approach to calibrate the platform magnetometers of non-dedicated satellites utilizing neural networks into a workflow called Macaw (MAgnetometer CAlibration Workflow). We provide details on the workflow and the steps needed to port these scripts to a scientific workflow. Our experimental evaluation compares the original sequential script executions on the original HPC cluster with our workflow implementation on a commodity cluster. Our results show that through porting, our implementation decreased the allocated CPU hours by 50.2% and the memory hours by 59.5%, leading to significantly less resource wastage. Further, through parallelizing single tasks, we reduced the runtime by 17.5%.
△ Less
Submitted 18 July, 2023; v1 submitted 17 October, 2022;
originally announced October 2022.
-
IoTreeplay: Synchronous Distributed Traffic Replay in IoT Environments
Authors:
Markus Toll,
Ilja Behnke,
Odej Kao
Abstract:
Use-cases in the Internet of Things (IoT) typically involve a high number of interconnected, heterogeneous devices. Due to the criticality of many IoT scenarios, systems and applications need to be tested thoroughly before rollout. Existing staging environments and testing frameworks are able to emulate network properties but fail to deliver actual network-wide traffic control to test systems appl…
▽ More
Use-cases in the Internet of Things (IoT) typically involve a high number of interconnected, heterogeneous devices. Due to the criticality of many IoT scenarios, systems and applications need to be tested thoroughly before rollout. Existing staging environments and testing frameworks are able to emulate network properties but fail to deliver actual network-wide traffic control to test systems application independently. To extend existing frameworks, we present the distributed traffic replaying tool IoTreeplay.
The tool embeds TCPLivePlay into an environment that allows the synchronous replaying of network traffic with multiple endpoints and connections. Replaying takes place in a user-defined network or testbed containing IoT use-cases. Network traffic can be captured and compared to the original trace to evaluate accuracy and reliability. The resulting implementation is able to accurately replay connections within a maximum transmission rate but struggles with deviations from regular TCP connections, like packet loss or connection reset. An evaluation has been performed, measuring individual and aggregated delays between packets, based on the recorded timestamps.
△ Less
Submitted 19 August, 2022;
originally announced August 2022.
-
Reshi: Recommending Resources for Scientific Workflow Tasks on Heterogeneous Infrastructures
Authors:
Jonathan Bader,
Fabian Lehmann,
Alexander Groth,
Lauritz Thamsen,
Dominik Scheinert,
Jonathan Will,
Ulf Leser,
Odej Kao
Abstract:
Scientific workflows typically comprise a multitude of different processing steps which often are executed in parallel on different partitions of the input data. These executions, in turn, must be scheduled on the compute nodes of the computational infrastructure at hand. This assignment is complicated by the facts that (a) tasks typically have highly heterogeneous resource requirements and (b) in…
▽ More
Scientific workflows typically comprise a multitude of different processing steps which often are executed in parallel on different partitions of the input data. These executions, in turn, must be scheduled on the compute nodes of the computational infrastructure at hand. This assignment is complicated by the facts that (a) tasks typically have highly heterogeneous resource requirements and (b) in many infrastructures, compute nodes offer highly heterogeneous resources. In consequence, predictions of the runtime of a given task on a given node, as required by many scheduling algorithms, are often rather imprecise, which can lead to sub-optimal scheduling decisions.
We propose Reshi, a method for recommending task-node assignments during workflow execution that can cope with heterogeneous tasks and heterogeneous nodes. Reshi approaches the problem as a regression task, where task-node pairs are modeled as feature vectors over the results of dedicated micro benchmarks and past task executions. Based on these features, Reshi trains a regression tree model to rank and recommend nodes for each ready-to-run task, which can be used as input to a scheduler. For our evaluation, we benchmarked 27 AWS machine types using three representative workflows. We compare Reshi's recommendations with three state-of-the-art schedulers. Our evaluation shows that Reshi outperforms HEFT by a mean makespan reduction of 7.18% and 18.01% assuming a mean task runtime prediction error of 15%.
△ Less
Submitted 17 October, 2022; v1 submitted 16 August, 2022;
originally announced August 2022.
-
Network Emulation in Large-Scale Virtual Edge Testbeds: A Note of Caution and the Way Forward
Authors:
Soeren Becker,
Tobias Pfandzelter,
Nils Japke,
David Bermbach,
Odej Kao
Abstract:
The growing research and industry interest in the Internet of Things and the edge computing paradigm has increased the need for cost-efficient virtual testbeds for large-scale distributed applications. Researchers, students, and practitioners need to test and evaluate the interplay of hundreds or thousands of real software components and services connected with a realistic edge network without acc…
▽ More
The growing research and industry interest in the Internet of Things and the edge computing paradigm has increased the need for cost-efficient virtual testbeds for large-scale distributed applications. Researchers, students, and practitioners need to test and evaluate the interplay of hundreds or thousands of real software components and services connected with a realistic edge network without access to physical infrastructure.
While advances in virtualization technologies have enabled parts of this, network emulation as a crucial part in the development of edge testbeds is lagging behind: As we show in this paper, NetEm, the current state-of-the-art network emulation tooling included in the Linux kernel, imposes prohibitive scalability limits. We quantify these limits, investigate possible causes, and present a way forward for network emulation in large-scale virtual edge testbeds based on eBPFs.
△ Less
Submitted 11 August, 2022;
originally announced August 2022.
-
Magpie: Automatically Tuning Static Parameters for Distributed File Systems using Deep Reinforcement Learning
Authors:
Houkun Zhu,
Dominik Scheinert,
Lauritz Thamsen,
Kordian Gontarska,
Odej Kao
Abstract:
Distributed file systems are widely used nowadays, yet using their default configurations is often not optimal. At the same time, tuning configuration parameters is typically challenging and time-consuming. It demands expertise and tuning operations can also be expensive. This is especially the case for static parameters, where changes take effect only after a restart of the system or workloads. W…
▽ More
Distributed file systems are widely used nowadays, yet using their default configurations is often not optimal. At the same time, tuning configuration parameters is typically challenging and time-consuming. It demands expertise and tuning operations can also be expensive. This is especially the case for static parameters, where changes take effect only after a restart of the system or workloads. We propose a novel approach, Magpie, which utilizes deep reinforcement learning to tune static parameters by strategically exploring and exploiting configuration parameter spaces. To boost the tuning of the static parameters, our method employs both server and client metrics of distributed file systems to understand the relationship between static parameters and performance. Our empirical evaluation results show that Magpie can noticeably improve the performance of the distributed file system Lustre, where our approach on average achieves 91.8% throughput gains against default configuration after tuning towards single performance indicator optimization, while it reaches 39.7% more throughput gains against the baseline.
△ Less
Submitted 22 July, 2022; v1 submitted 19 July, 2022;
originally announced July 2022.
-
Leveraging Log Instructions in Log-based Anomaly Detection
Authors:
Jasmin Bogatinovski,
Gjorgji Madjarov,
Sasho Nedelkoski,
Jorge Cardoso,
Odej Kao
Abstract:
Artificial Intelligence for IT Operations (AIOps) describes the process of maintaining and operating large IT systems using diverse AI-enabled methods and tools for, e.g., anomaly detection and root cause analysis, to support the remediation, optimization, and automatic initiation of self-stabilizing IT activities. The core step of any AIOps workflow is anomaly detection, typically performed on hi…
▽ More
Artificial Intelligence for IT Operations (AIOps) describes the process of maintaining and operating large IT systems using diverse AI-enabled methods and tools for, e.g., anomaly detection and root cause analysis, to support the remediation, optimization, and automatic initiation of self-stabilizing IT activities. The core step of any AIOps workflow is anomaly detection, typically performed on high-volume heterogeneous data such as log messages (logs), metrics (e.g., CPU utilization), and distributed traces. In this paper, we propose a method for reliable and practical anomaly detection from system logs. It overcomes the common disadvantage of related works, i.e., the need for a large amount of manually labeled training data, by building an anomaly detection model with log instructions from the source code of 1000+ GitHub projects. The instructions from diverse systems contain rich and heterogenous information about many different normal and abnormal IT events and serve as a foundation for anomaly detection. The proposed method, named ADLILog, combines the log instructions and the data from the system of interest (target system) to learn a deep neural network model through a two-phase learning procedure. The experimental results show that ADLILog outperforms the related approaches by up to 60% on the F1 score while satisfying core non-functional requirements for industrial deployments such as unsupervised design, efficient model updates, and small model sizes.
△ Less
Submitted 7 July, 2022;
originally announced July 2022.
-
Get Your Memory Right: The Crispy Resource Allocation Assistant for Large-Scale Data Processing
Authors:
Jonathan Will,
Lauritz Thamsen,
Jonathan Bader,
Dominik Scheinert,
Odej Kao
Abstract:
Distributed dataflow systems like Apache Spark and Apache Hadoop enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs -- that neither lead to bottlenecks nor to low resource utilization -- is often challenging, even for expert users such as data engineers. Further, existing automated approaches to resource selection rel…
▽ More
Distributed dataflow systems like Apache Spark and Apache Hadoop enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs -- that neither lead to bottlenecks nor to low resource utilization -- is often challenging, even for expert users such as data engineers. Further, existing automated approaches to resource selection rely on the assumption that a job is recurring to learn from previous runs or to warrant the cost of full test runs to learn from. However, this assumption often does not hold since many jobs are too unique.
Therefore, we present Crispy, a method for optimizing data processing cluster configurations based on job profiling runs with small samples of the dataset on just a single machine. Crispy attempts to extrapolate the memory usage for the full dataset to then choose a cluster configuration with enough total memory. In our evaluation on a dataset with 1031 Spark and Hadoop jobs, we see a reduction of job execution costs by 56% compared to the baseline, while on average spending less than ten minutes on profiling runs per job on a consumer-grade laptop.
△ Less
Submitted 10 January, 2023; v1 submitted 28 June, 2022;
originally announced June 2022.
-
Phoebe: QoS-Aware Distributed Stream Processing through Anticipating Dynamic Workloads
Authors:
Morgan K. Geldenhuys,
Dominik Scheinert,
Odej Kao,
Lauritz Thamsen
Abstract:
Distributed Stream Processing systems have become an essential part of big data processing platforms. They are characterized by the high-throughput processing of near to real-time event streams with the goal of delivering low-latency results and thus enabling time-sensitive decision making. At the same time, results are expected to be consistent even in the presence of partial failures where exact…
▽ More
Distributed Stream Processing systems have become an essential part of big data processing platforms. They are characterized by the high-throughput processing of near to real-time event streams with the goal of delivering low-latency results and thus enabling time-sensitive decision making. At the same time, results are expected to be consistent even in the presence of partial failures where exactly-once processing guarantees are required for correctness. Stream processing workloads are oftentimes dynamic in nature which makes static configurations highly inefficient as time goes by. Static resource allocations will almost certainly either negatively impact upon the Quality of Service and/or result in higher operational costs.
In this paper we present Phoebe, a proactive approach to system auto-tuning for Distributed Stream Processing jobs executing on dynamic workloads. Our approach makes use of parallel profiling runs, QoS modeling, and runtime optimization to provide a general solution whereby configuration parameters are automatically tuned to ensure a stable service as well as alignment with recovery time Quality of Service targets. Phoebe makes use of Time Series Forecasting to gain an insight into future workload requirements thereby delivering scaling decisions which are accurate, long-lived, and reliable. Our experiments demonstrate that Phoebe is able to deliver a stable service while at the same time reducing resource over-provisioning.
△ Less
Submitted 20 June, 2022;
originally announced June 2022.
-
Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview
Authors:
Lauritz Thamsen,
Dominik Scheinert,
Jonathan Will,
Jonathan Bader,
Odej Kao
Abstract:
Many organizations routinely analyze large datasets using systems for distributed data-parallel processing and clusters of commodity resources. Yet, users need to configure adequate resources for their data processing jobs. This requires significant insights into expected job runtimes and scaling behavior, resource characteristics, input data distributions, and other factors. Unable to estimate pe…
▽ More
Many organizations routinely analyze large datasets using systems for distributed data-parallel processing and clusters of commodity resources. Yet, users need to configure adequate resources for their data processing jobs. This requires significant insights into expected job runtimes and scaling behavior, resource characteristics, input data distributions, and other factors. Unable to estimate performance accurately, users frequently overprovision resources for their jobs, leading to low resource utilization and high costs. In this paper, we present major building blocks towards a collaborative approach for optimization of data processing cluster configurations based on runtime data and performance models. We believe that runtime data can be shared and used for performance models across different execution contexts, significantly reducing the reliance on the recurrence of individual processing jobs or, else, dedicated job profiling. For this, we describe how the similarity of processing jobs and cluster infrastructures can be employed to combine suitable data points from local and global job executions into accurate performance models. Furthermore, we outline approaches to performance prediction via more context-aware and reusable models. Finally, we lay out how metrics from previous executions can be combined with runtime monitoring to effectively re-configure models and clusters dynamically.
△ Less
Submitted 1 June, 2022;
originally announced June 2022.
-
Lotaru: Locally Estimating Runtimes of Scientific Workflow Tasks in Heterogeneous Clusters
Authors:
Jonathan Bader,
Fabian Lehmann,
Lauritz Thamsen,
Jonathan Will,
Ulf Leser,
Odej Kao
Abstract:
Many scientific workflow scheduling algorithms need to be informed about task runtimes a-priori to conduct efficient scheduling. In heterogeneous cluster infrastructures, this problem becomes aggravated because these runtimes are required for each task-node pair. Using historical data is often not feasible as logs are typically not retained indefinitely and workloads as well as infrastructure chan…
▽ More
Many scientific workflow scheduling algorithms need to be informed about task runtimes a-priori to conduct efficient scheduling. In heterogeneous cluster infrastructures, this problem becomes aggravated because these runtimes are required for each task-node pair. Using historical data is often not feasible as logs are typically not retained indefinitely and workloads as well as infrastructure changes. In contrast, online methods, which predict task runtimes on specific nodes while the workflow is running, have to cope with the lack of example runs, especially during the start-up.
In this paper, we present Lotaru, a novel online method for locally estimating task runtimes in scientific workflows on heterogeneous clusters. Lotaru first profiles all nodes of a cluster with a set of short-running and uniform microbenchmarks. Next, it runs the workflow to be scheduled on the user's local machine with drastically reduced data to determine important task characteristics. Based on these measurements, Lotaru learns a Bayesian linear regression model to predict a task's runtime given the input size and finally adjusts the predicted runtime specifically for each task-node pair in the cluster based on the micro-benchmark results. Due to its Bayesian approach, Lotaru can also compute robust uncertainty estimates and provides them as an input for advanced scheduling methods.
Our evaluation with five real-world scientific workflows and different datasets shows that Lotaru significantly outperforms the baselines in terms of prediction errors for homogeneous and heterogeneous clusters.
△ Less
Submitted 23 May, 2022;
originally announced May 2022.
-
Cucumber: Renewable-Aware Admission Control for Delay-Tolerant Cloud and Edge Workloads
Authors:
Philipp Wiesner,
Dominik Scheinert,
Thorsten Wittkopp,
Lauritz Thamsen,
Odej Kao
Abstract:
The growing electricity demand of cloud and edge computing increases operational costs and will soon have a considerable impact on the environment. A possible countermeasure is equip** IT infrastructure directly with on-site renewable energy sources. Yet, particularly smaller data centers may not be able to use all generated power directly at all times, while feeding it into the public grid or e…
▽ More
The growing electricity demand of cloud and edge computing increases operational costs and will soon have a considerable impact on the environment. A possible countermeasure is equip** IT infrastructure directly with on-site renewable energy sources. Yet, particularly smaller data centers may not be able to use all generated power directly at all times, while feeding it into the public grid or energy storage is often not an option. To maximize the usage of renewable excess energy, we propose Cucumber, an admission control policy that accepts delay-tolerant workloads only if they can be computed within their deadlines without the use of grid energy. Using probabilistic forecasting of computational load, energy consumption, and energy production, Cucumber can be configured towards more optimistic or conservative admission. We evaluate our approach on two scenarios using real solar production forecasts for Berlin, Mexico City, and Cape Town in a simulation environment. For scenarios where excess energy was actually available, our results show that Cucumber's default configuration achieves acceptance rates close to the optimal case and causes 97.0% of accepted workloads to be powered using excess energy, while more conservative admission results in 18.5% reduced acceptance at almost zero grid power usage.
△ Less
Submitted 27 August, 2022; v1 submitted 5 May, 2022;
originally announced May 2022.
-
Differentiating Network Flows for Priority-Aware Scheduling of Incoming Packets in Real-Time IoT Systems
Authors:
Christoph Blumschein,
Ilja Behnke,
Lauritz Thamsen,
Odej Kao
Abstract:
When IP-packet processing is unconditionally carried out on behalf of an operating system kernel thread, processing systems can experience overload in high incoming traffic scenarios. This is especially worrying for embedded real-time devices controlling their physical environment in industrial IoT scenarios and automotive systems. We propose an embedded real-time aware IP stack adaption with an e…
▽ More
When IP-packet processing is unconditionally carried out on behalf of an operating system kernel thread, processing systems can experience overload in high incoming traffic scenarios. This is especially worrying for embedded real-time devices controlling their physical environment in industrial IoT scenarios and automotive systems. We propose an embedded real-time aware IP stack adaption with an early demultiplexing scheme for incoming packets and subsequent per-flow aperiodic scheduling. By instrumenting existing embedded IP stacks, rigid prioritization with minimal latency is deployed without the need of further task resources. Simple mitigation techniques can be applied to individual flows, causing hardly measurable overhead while at the same time protecting the system from overload conditions. Our IP stack adaption is able to reduce the low-priority packet processing time by over 86% compared to an unmodified stack. The network subsystem can thereby remain active at a 7x higher general traffic load before disabling the receive IRQ as a last resort to assure deadlines.
△ Less
Submitted 19 April, 2022;
originally announced April 2022.
-
Failure Identification from Unstable Log Data using Deep Learning
Authors:
Jasmin Bogatinovski,
Sasho Nedelkoski,
Li Wu,
Jorge Cardoso,
Odej Kao
Abstract:
The reliability of cloud platforms is of significant relevance because society increasingly relies on complex software systems running on the cloud. To improve it, cloud providers are automating various maintenance tasks, with failure identification frequently being considered. The precondition for automation is the availability of observability tools, with system logs commonly being used. The foc…
▽ More
The reliability of cloud platforms is of significant relevance because society increasingly relies on complex software systems running on the cloud. To improve it, cloud providers are automating various maintenance tasks, with failure identification frequently being considered. The precondition for automation is the availability of observability tools, with system logs commonly being used. The focus of this paper is log-based failure identification. This problem is challenging because of the instability of the log data and the incompleteness of the explicit logging failure coverage within the code. To address the two challenges, we present CLog as a method for failure identification. The key idea presented herein based is on our observation that by representing the log data as sequences of subprocesses instead of sequences of log events, the effect of the unstable log data is reduced. CLog introduces a novel subprocess extraction method that uses context-aware neural network and clustering methods to extract meaningful subprocesses. The direct modeling of log event contexts allows the identification of failures with respect to the abrupt context changes, addressing the challenge of insufficient logging failure coverage. Our experimental results demonstrate that the learned subprocesses representations reduce the instability in the input, allowing CLog to outperform the baselines on the failure identification subproblems - 1) failure detection by 9-24% on F1 score and 2) failure type identification by 7% on the macro averaged F1 score. Further analysis shows the existent negative correlation between the instability in the input event sequences and the detection performance in a model-agnostic manner.
△ Less
Submitted 6 April, 2022;
originally announced April 2022.
-
Data-Driven Approach for Log Instruction Quality Assessment
Authors:
Jasmin Bogatinovski,
Sasho Nedelkoski,
Alexander Acker,
Jorge Cardoso,
Odej Kao
Abstract:
In the current IT world, developers write code while system operators run the code mostly as a black box. The connection between both worlds is typically established with log messages: the developer provides hints to the (unknown) operator, where the cause of an occurred issue is, and vice versa, the operator can report bugs during operation. To fulfil this purpose, developers write log instructio…
▽ More
In the current IT world, developers write code while system operators run the code mostly as a black box. The connection between both worlds is typically established with log messages: the developer provides hints to the (unknown) operator, where the cause of an occurred issue is, and vice versa, the operator can report bugs during operation. To fulfil this purpose, developers write log instructions that are structured text commonly composed of a log level (e.g., "info", "error"), static text ("IP {} cannot be reached"), and dynamic variables (e.g. IP {}). However, as opposed to well-adopted coding practices, there are no widely adopted guidelines on how to write log instructions with good quality properties. For example, a developer may assign a high log level (e.g., "error") for a trivial event that can confuse the operator and increase maintenance costs. Or the static text can be insufficient to hint at a specific issue. In this paper, we address the problem of log quality assessment and provide the first step towards its automation. We start with an in-depth analysis of quality log instruction properties in nine software systems and identify two quality properties: 1) correct log level assignment assessing the correctness of the log level, and 2) sufficient linguistic structure assessing the minimal richness of the static text necessary for verbose event description. Based on these findings, we developed a data-driven approach that adapts deep learning methods for each of the two properties. An extensive evaluation on large-scale open-source systems shows that our approach correctly assesses log level assignments with an accuracy of 0.88, and the sufficient linguistic structure with an F1 score of 0.99, outperforming the baselines. Our study shows the potential of the data-driven methods in assessing instructions quality and aid developers in comprehending and writing better code.
△ Less
Submitted 6 April, 2022;
originally announced April 2022.
-
Activity Report of the Second African Conference on Fundamental and Applied Physics, ACP2021
Authors:
Kétévi A. Assamagan,
Obinna Abah,
Amare Abebe,
Stephen Avery,
Diallo Boye,
Arame Boye-Faye,
Kenneth Cecire,
Mohamed Chabab,
Samuel Chigome,
Simon Connell,
Marie Chantal Cyulinyana,
Mark Macrae Dalton,
Christine Darve,
Lalla Btissam Drissi,
Farida Fassi,
Ulrich Goelach,
Mohamed Gouighri,
Paul Gueye,
Sonia Haddad,
Bjorn von der Heyden,
Oumar Ka,
Gihan Kamel,
Stéphane Kenmoe,
Diouma Kobor,
Tjaart Krüger
, et al. (16 additional authors not shown)
Abstract:
The African School of Fundamental Physics and Applications, also known as the African School of Physics (ASP), was initiated in 2010, as a three-week biennial event, to offer additional training in fundamental and applied physics to African students with a minimum of three-year university education. Since its inception, ASP has grown to be much more than a school. ASP has become a series of activi…
▽ More
The African School of Fundamental Physics and Applications, also known as the African School of Physics (ASP), was initiated in 2010, as a three-week biennial event, to offer additional training in fundamental and applied physics to African students with a minimum of three-year university education. Since its inception, ASP has grown to be much more than a school. ASP has become a series of activities and events with directed ethos towards physics as an engine for development in Africa. One such activity of ASP is the African Conference on Fundamental and Applied Physics (ACP). The first edition of ACP took place during the 2018 edition of ASP at the University of Namibia in Windhoek. In this paper, we report on the second edition of ACP, organized on March 7--11, 2022, as a virtual event.
△ Less
Submitted 6 April, 2022; v1 submitted 4 April, 2022;
originally announced April 2022.
-
SyncMesh: Improving Data Locality for Function-as-a-Service in Meshed Edge Networks
Authors:
Daniel Habenicht,
Kevin Kreutz,
Soeren Becker,
Jonathan Bader,
Lauritz Thamsen,
Odej Kao
Abstract:
The increasing use of Internet of Things devices coincides with more communication and data movement in networks, which can exceed existing network capabilities. These devices often process sensor or user information, where data privacy and latency are a major concern. Therefore, traditional approaches like cloud computing do not fit well, yet new architectures such as edge computing address this…
▽ More
The increasing use of Internet of Things devices coincides with more communication and data movement in networks, which can exceed existing network capabilities. These devices often process sensor or user information, where data privacy and latency are a major concern. Therefore, traditional approaches like cloud computing do not fit well, yet new architectures such as edge computing address this gap. In addition, the Function-as-a-Service (FaaS) paradigm gains in prevalence as a workload execution platform, however the decoupling of storage results in further challenges for highly distributed edge environments.
To address this, we propose SyncMesh, a system to manage, query, and transform data in a scalable and stateless manner by leveraging the capabilities of Function-as-a-Service and at the same time enabling data locality. Furthermore, we provide a prototypical implementation and evaluate it against established centralized and decentralized systems in regard to traffic usage and request times.
The preliminary results indicate that SyncMesh is able to exonerate the network layer and accelerate the transmission of data to clients, while simultaneously improving local data processing.
△ Less
Submitted 28 March, 2022;
originally announced March 2022.
-
Efficient Runtime Profiling for Black-box Machine Learning Services on Sensor Streams
Authors:
Soeren Becker,
Dominik Scheinert,
Florian Schmidt,
Odej Kao
Abstract:
In highly distributed environments such as cloud, edge and fog computing, the application of machine learning for automating and optimizing processes is on the rise. Machine learning jobs are frequently applied in streaming conditions, where models are used to analyze data streams originating from e.g. video streams or sensory data. Often the results for particular data samples need to be provided…
▽ More
In highly distributed environments such as cloud, edge and fog computing, the application of machine learning for automating and optimizing processes is on the rise. Machine learning jobs are frequently applied in streaming conditions, where models are used to analyze data streams originating from e.g. video streams or sensory data. Often the results for particular data samples need to be provided in time before the arrival of next data. Thus, enough resources must be provided to ensure the just-in-time processing for the specific data stream. This paper focuses on proposing a runtime modeling strategy for containerized machine learning jobs, which enables the optimization and adaptive adjustment of resources per job and component. Our black-box approach assembles multiple techniques into an efficient runtime profiling method, while making no assumptions about underlying hardware, data streams, or applied machine learning jobs. The results show that our method is able to capture the general runtime behaviour of different machine learning jobs already after a short profiling phase.
△ Less
Submitted 10 March, 2022;
originally announced March 2022.
-
A Taxonomy of Anomalies in Log Data
Authors:
Thorsten Wittkopp,
Philipp Wiesner,
Dominik Scheinert,
Odej Kao
Abstract:
Log data anomaly detection is a core component in the area of artificial intelligence for IT operations. However, the large amount of existing methods makes it hard to choose the right approach for a specific system. A better understanding of different kinds of anomalies, and which algorithms are suitable for detecting them, would support researchers and IT operators. Although a common taxonomy fo…
▽ More
Log data anomaly detection is a core component in the area of artificial intelligence for IT operations. However, the large amount of existing methods makes it hard to choose the right approach for a specific system. A better understanding of different kinds of anomalies, and which algorithms are suitable for detecting them, would support researchers and IT operators. Although a common taxonomy for anomalies already exists, it has not yet been applied specifically to log data, pointing out the characteristics and peculiarities in this domain.
In this paper, we present a taxonomy for different kinds of log data anomalies and introduce a method for analyzing such anomalies in labeled datasets. We applied our taxonomy to the three common benchmark datasets Thunderbird, Spirit, and BGL, and trained five state-of-the-art unsupervised anomaly detection algorithms to evaluate their performance in detecting different kinds of anomalies. Our results show, that the most common anomaly type is also the easiest to predict. Moreover, deep learning-based approaches outperform data mining-based approaches in all anomaly types, but especially when it comes to detecting contextual anomalies.
△ Less
Submitted 26 November, 2021;
originally announced November 2021.
-
Tarema: Adaptive Resource Allocation for Scalable Scientific Workflows in Heterogeneous Clusters
Authors:
Jonathan Bader,
Lauritz Thamsen,
Svetlana Kulagina,
Jonathan Will,
Henning Meyerhenke,
Odej Kao
Abstract:
Scientific workflow management systems like Nextflow support large-scale data analysis by abstracting away the details of scientific workflows. In these systems, workflows consist of several abstract tasks, of which instances are run in parallel and transform input partitions into output partitions. Resource managers like Kubernetes execute such workflow tasks on cluster infrastructures. However,…
▽ More
Scientific workflow management systems like Nextflow support large-scale data analysis by abstracting away the details of scientific workflows. In these systems, workflows consist of several abstract tasks, of which instances are run in parallel and transform input partitions into output partitions. Resource managers like Kubernetes execute such workflow tasks on cluster infrastructures. However, these resource managers only consider the number of CPUs and the amount of available memory when assigning tasks to resources; they do not consider hardware differences beyond these numbers, while computational speed and memory access rates can differ significantly.
We propose Tarema, a system for allocating task instances to heterogeneous cluster resources during the execution of scalable scientific workflows. First, Tarema profiles the available infrastructure with a set of benchmark programs and groups cluster nodes with similar performance. Second, Tarema uses online monitoring data of tasks, assigning labels to tasks depending on their resource usage. Third, Tarema uses the node groups and task labels to dynamically assign task instances evenly to resources based on resource demand. Our evaluation of a prototype implementation for Kubernetes, using five real-world Nextflow workflows from the popular nf-core framework and two 15-node clusters consisting of different virtual machines, shows a mean reduction of isolated job runtimes by 19.8% compared to popular schedulers in widely-used resource managers and 4.54% compared to the heuristic SJFN, while providing a better cluster usage. Moreover, executing two long-running workflows in parallel and on restricted resources shows that Tarema is able to reduce the runtimes even more while providing a fair cluster usage.
△ Less
Submitted 19 January, 2022; v1 submitted 9 November, 2021;
originally announced November 2021.
-
LOS: Local-Optimistic Scheduling of Periodic Model Training For Anomaly Detection on Sensor Data Streams in Meshed Edge Networks
Authors:
Soeren Becker,
Florian Schmidt,
Lauritz Thamsen,
Ana Juan Ferrer,
Odej Kao
Abstract:
Anomaly detection is increasingly important to handle the amount of sensor data in Edge and Fog environments, Smart Cities, as well as in Industry 4.0. To ensure good results, the utilized ML models need to be updated periodically to adapt to seasonal changes and concept drifts in the sensor data. Although the increasing resource availability at the edge can allow for in-situ execution of model tr…
▽ More
Anomaly detection is increasingly important to handle the amount of sensor data in Edge and Fog environments, Smart Cities, as well as in Industry 4.0. To ensure good results, the utilized ML models need to be updated periodically to adapt to seasonal changes and concept drifts in the sensor data. Although the increasing resource availability at the edge can allow for in-situ execution of model training directly on the devices, it is still often offloaded to fog devices or the cloud.
In this paper, we propose Local-Optimistic Scheduling (LOS), a method for executing periodic ML model training jobs in close proximity to the data sources, without overloading lightweight edge devices. Training jobs are offloaded to nearby neighbor nodes as necessary and the resource consumption is optimized to meet the training period while still ensuring enough resources for further training executions. This scheduling is accomplished in a decentralized, collaborative and opportunistic manner, without full knowledge of the infrastructure and workload. We evaluated our method in an edge computing testbed on real-world datasets. The experimental results show that LOS places the training executions close to the input sensor streams, decreases the deviation between training time and training period by up to 40% and increases the amount of successfully scheduled training jobs compared to an in-situ execution.
△ Less
Submitted 27 September, 2021;
originally announced September 2021.
-
EdgePier: P2P-based Container Image Distribution in Edge Computing Environments
Authors:
Soeren Becker,
Florian Schmidt,
Odej Kao
Abstract:
Edge and fog computing architectures utilize container technologies in order to offer a lightweight application deployment. Container images are stored in registry services and operated by orchestration platforms to download and start the respective applications on nodes of the infrastructure. During large application rollouts, the connection to the registry is prone to become a bottleneck, which…
▽ More
Edge and fog computing architectures utilize container technologies in order to offer a lightweight application deployment. Container images are stored in registry services and operated by orchestration platforms to download and start the respective applications on nodes of the infrastructure. During large application rollouts, the connection to the registry is prone to become a bottleneck, which results in longer provisioning times and deployment latencies. Previous work has mainly addressed this problem by proposing scalable registries, leveraging the BitTorrent protocol or distributed storage to host container images. However, for lightweight and dynamic edge environments the overhead of several dedicated components is not feasible in regard to its interference of the actual workload and is subject to failures due to the introduced complexity.
In this paper we introduce a fully decentralized container registry called EdgePier, that can be deployed across edge sites and is able to decrease container deployment times by utilizing peer-to-peer connections between participating nodes. Image layers are shared without the need for further centralized orchestration entities. The conducted evaluation shows that the provisioning times are improved by up to 65% in comparison to a baseline registry, even with limited bandwidth to the cloud.
△ Less
Submitted 27 September, 2021;
originally announced September 2021.
-
A2Log: Attentive Augmented Log Anomaly Detection
Authors:
Thorsten Wittkopp,
Alexander Acker,
Sasho Nedelkoski,
Jasmin Bogatinovski,
Dominik Scheinert,
Wu Fan,
Odej Kao
Abstract:
Anomaly detection becomes increasingly important for the dependability and serviceability of IT services. As log lines record events during the execution of IT services, they are a primary source for diagnostics. Thereby, unsupervised methods provide a significant benefit since not all anomalies can be known at training time. Existing unsupervised methods need anomaly examples to obtain a suitable…
▽ More
Anomaly detection becomes increasingly important for the dependability and serviceability of IT services. As log lines record events during the execution of IT services, they are a primary source for diagnostics. Thereby, unsupervised methods provide a significant benefit since not all anomalies can be known at training time. Existing unsupervised methods need anomaly examples to obtain a suitable decision boundary required for the anomaly detection task. This requirement poses practical limitations. Therefore, we develop A2Log, which is an unsupervised anomaly detection method consisting of two steps: Anomaly scoring and anomaly decision. First, we utilize a self-attention neural network to perform the scoring for each log message. Second, we set the decision boundary based on data augmentation of the available normal training data. The method is evaluated on three publicly available datasets and one industry dataset. We show that our approach outperforms existing methods. Furthermore, we utilize available anomaly examples to set optimal decision boundaries to acquire strong baselines. We show that our approach, which determines decision boundaries without utilizing anomaly examples, can reach scores of the strong baselines.
△ Less
Submitted 20 September, 2021;
originally announced September 2021.
-
Khaos: Dynamically Optimizing Checkpointing for Dependable Distributed Stream Processing
Authors:
Morgan K. Geldenhuys,
Benjamin J. J. Pfister,
Dominik Scheinert,
Lauritz Thamsen,
Odej Kao
Abstract:
Distributed Stream Processing systems are becoming an increasingly essential part of Big Data processing platforms as users grow ever more reliant on their ability to provide fast access to new results. As such, making timely decisions based on these results is dependent on a system's ability to tolerate failure. Typically, these systems achieve fault tolerance and the ability to recover automatic…
▽ More
Distributed Stream Processing systems are becoming an increasingly essential part of Big Data processing platforms as users grow ever more reliant on their ability to provide fast access to new results. As such, making timely decisions based on these results is dependent on a system's ability to tolerate failure. Typically, these systems achieve fault tolerance and the ability to recover automatically from partial failures by implementing checkpoint and rollback recovery. However, owing to the statistical probability of partial failures occurring in these distributed environments and the variability of workloads upon which jobs are expected to operate, static configurations will often not meet Quality of Service constraints with low overhead.
In this paper we present Khaos, a new approach which utilizes the parallel processing capabilities of virtual cloud automation technologies for the automatic runtime optimization of fault tolerance configurations in Distributed Stream Processing jobs. Our approach employs three subsequent phases which borrows from the principles of Chaos Engineering: establish the steady-state processing conditions, conduct experiments to better understand how the system performs under failure, and use this knowledge to continuously minimize Quality of Service violations. We implemented Khaos prototypically together with Apache Flink and demonstrate its usefulness experimentally.
△ Less
Submitted 26 January, 2023; v1 submitted 6 September, 2021;
originally announced September 2021.