-
Privacy-Preserving Sharing of Data Analytics Runtime Metrics for Performance Modeling
Authors:
Jonathan Will,
Dominik Scheinert,
Jan Bode,
Cedric Kring,
Seraphin Zunzer,
Lauritz Thamsen
Abstract:
Performance modeling for large-scale data analytics workloads can improve the efficiency of cluster resource allocations and job scheduling. However, the performance of these workloads is influenced by numerous factors, such as job inputs and the assigned cluster resources. As a result, performance models require significant amounts of training data. This data can be obtained by exchanging runtime…
▽ More
Performance modeling for large-scale data analytics workloads can improve the efficiency of cluster resource allocations and job scheduling. However, the performance of these workloads is influenced by numerous factors, such as job inputs and the assigned cluster resources. As a result, performance models require significant amounts of training data. This data can be obtained by exchanging runtime metrics between collaborating organizations. Yet, not all organizations may be inclined to publicly disclose such metadata.
We present a privacy-preserving approach for sharing runtime metrics based on differential privacy and data synthesis. Our evaluation on performance data from 736 Spark job executions indicates that fully anonymized training data largely maintains performance prediction accuracy, particularly when there is minimal original data available. With 30 or fewer available original data samples, the use of synthetic training data resulted only in a one percent reduction in performance model accuracy on average.
△ Less
Submitted 13 March, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
Demeter: Resource-Efficient Distributed Stream Processing under Dynamic Loads with Multi-Configuration Optimization
Authors:
Morgan Geldenhuys,
Dominik Scheinert,
Odej Kao,
Lauritz Thamsen
Abstract:
Distributed Stream Processing (DSP) focuses on the near real-time processing of large streams of unbounded data. To increase processing capacities, DSP systems are able to dynamically scale across a cluster of commodity nodes, ensuring a good Quality of Service despite variable workloads. However, selecting scaleout configurations which maximize resource utilization remains a challenge. This is es…
▽ More
Distributed Stream Processing (DSP) focuses on the near real-time processing of large streams of unbounded data. To increase processing capacities, DSP systems are able to dynamically scale across a cluster of commodity nodes, ensuring a good Quality of Service despite variable workloads. However, selecting scaleout configurations which maximize resource utilization remains a challenge. This is especially true in environments where workloads change over time and node failures are all but inevitable. Furthermore, configuration parameters such as memory allocation and checkpointing intervals impact performance and resource usage as well. Sub-optimal configurations easily lead to high operational costs, poor performance, or unacceptable loss of service.
In this paper, we present Demeter, a method for dynamically optimizing key DSP system configuration parameters for resource efficiency. Demeter uses Time Series Forecasting to predict future workloads and Multi-Objective Bayesian Optimization to model runtime behaviors in relation to parameter settings and workload rates. Together, these techniques allow us to determine whether or not enough is known about the predicted workload rate to proactively initiate short-lived parallel profiling runs for data gathering. Once trained, the models guide the adjustment of multiple, potentially dependent system configuration parameters ensuring optimized performance and resource usage in response to changing workload rates. Our experiments on a commodity cluster using Apache Flink demonstrate that Demeter significantly improves the operational efficiency of long-running benchmark jobs.
△ Less
Submitted 4 March, 2024;
originally announced March 2024.
-
Daedalus: Self-Adaptive Horizontal Autoscaling for Resource Efficiency of Distributed Stream Processing Systems
Authors:
Benjamin J. J. Pfister,
Dominik Scheinert,
Morgan K. Geldenhuys,
Odej Kao
Abstract:
Distributed Stream Processing (DSP) systems are capable of processing large streams of unbounded data, offering high throughput and low latencies. To maintain a stable Quality of Service (QoS), these systems require a sufficient allocation of resources. At the same time, over-provisioning can result in wasted energy and high operating costs. Therefore, to maximize resource utilization, autoscaling…
▽ More
Distributed Stream Processing (DSP) systems are capable of processing large streams of unbounded data, offering high throughput and low latencies. To maintain a stable Quality of Service (QoS), these systems require a sufficient allocation of resources. At the same time, over-provisioning can result in wasted energy and high operating costs. Therefore, to maximize resource utilization, autoscaling methods have been proposed that aim to efficiently match the resource allocation with the incoming workload. However, determining when and by how much to scale remains a significant challenge. Given the long-running nature of DSP jobs, scaling actions need to be executed at runtime, and to maintain a good QoS, they should be both accurate and infrequent. To address the challenges of autoscaling, the concept of self-adaptive systems is particularly fitting. These systems monitor themselves and their environment, adapting to changes with minimal need for expert involvement.
This paper introduces Daedalus, a self-adaptive manager for autoscaling in DSP systems, which draws on the principles of self-adaption to address the challenge of efficient autoscaling. Daedalus monitors a running DSP job and builds performance models, aiming to predict the maximum processing capacity at different scale-outs. When combined with time series forecasting to predict future workloads, Daedalus proactively scales DSP jobs, optimizing for maximum throughput and minimizing both latencies and resource usage. We conducted experiments using Apache Flink and Kafka Streams to evaluate the performance of Daedalus against two state-of-the-art approaches. Daedalus was able to achieve comparable latencies while reducing resource usage by up to 71%.
△ Less
Submitted 5 March, 2024; v1 submitted 4 March, 2024;
originally announced March 2024.
-
Towards a Peer-to-Peer Data Distribution Layer for Efficient and Collaborative Resource Optimization of Distributed Dataflow Applications
Authors:
Dominik Scheinert,
Soeren Becker,
Jonathan Will,
Luis Englaender,
Lauritz Thamsen
Abstract:
Performance modeling can help to improve the resource efficiency of clusters and distributed dataflow applications, yet the available modeling data is often limited. Collaborative approaches to performance modeling, characterized by the sharing of performance data or models, have been shown to improve resource efficiency, but there has been little focus on actual data sharing strategies and implem…
▽ More
Performance modeling can help to improve the resource efficiency of clusters and distributed dataflow applications, yet the available modeling data is often limited. Collaborative approaches to performance modeling, characterized by the sharing of performance data or models, have been shown to improve resource efficiency, but there has been little focus on actual data sharing strategies and implementation in production environments. This missing building block holds back the realization of proposed collaborative solutions.
In this paper, we envision, design, and evaluate a peer-to-peer performance data sharing approach for collaborative performance modeling of distributed dataflow applications. Our proposed data distribution layer enables access to performance data in a decentralized manner, thereby facilitating collaborative modeling approaches and allowing for improved prediction capabilities and hence increased resource efficiency. In our evaluation, we assess our approach with regard to deployment, data replication, and data validation, through experiments with a prototype implementation and simulation, demonstrating feasibility and allowing discussion of potential limitations and next steps.
△ Less
Submitted 23 January, 2024; v1 submitted 24 November, 2023;
originally announced November 2023.
-
Karasu: A Collaborative Approach to Efficient Cluster Configuration for Big Data Analytics
Authors:
Dominik Scheinert,
Philipp Wiesner,
Thorsten Wittkopp,
Lauritz Thamsen,
Jonathan Will,
Odej Kao
Abstract:
Selecting the right resources for big data analytics jobs is hard because of the wide variety of configuration options like machine type and cluster size. As poor choices can have a significant impact on resource efficiency, cost, and energy usage, automated approaches are gaining popularity. Most existing methods rely on profiling recurring workloads to find near-optimal solutions over time. Due…
▽ More
Selecting the right resources for big data analytics jobs is hard because of the wide variety of configuration options like machine type and cluster size. As poor choices can have a significant impact on resource efficiency, cost, and energy usage, automated approaches are gaining popularity. Most existing methods rely on profiling recurring workloads to find near-optimal solutions over time. Due to the cold-start problem, this often leads to lengthy and costly profiling phases. However, big data analytics jobs across users can share many common properties: they often operate on similar infrastructure, using similar algorithms implemented in similar frameworks. The potential in sharing aggregated profiling runs to collaboratively address the cold start problem is largely unexplored.
We present Karasu, an approach to more efficient resource configuration profiling that promotes data sharing among users working with similar infrastructures, frameworks, algorithms, or datasets. Karasu trains lightweight performance models using aggregated runtime information of collaborators and combines them into an ensemble method to exploit inherent knowledge of the configuration search space. Moreover, Karasu allows the optimization of multiple objectives simultaneously. Our evaluation is based on performance data from diverse workload executions in a public cloud environment. We show that Karasu is able to significantly boost existing methods in terms of performance, search time, and cost, even when few comparable profiling runs are available that share only partial common characteristics with the target job.
△ Less
Submitted 23 November, 2023; v1 submitted 22 August, 2023;
originally announced August 2023.
-
Evaluation of Data Enrichment Methods for Distributed Stream Processing Systems
Authors:
Dominik Scheinert,
Fabian Casares,
Morgan K. Geldenhuys,
Kevin Styp-Rekowski,
Odej Kao
Abstract:
Stream processing has become a critical component in the architecture of modern applications. With the exponential growth of data generation from sources such as the Internet of Things, business intelligence, and telecommunications, real-time processing of unbounded data streams has become a necessity. DSP systems provide a solution to this challenge, offering high horizontal scalability, fault-to…
▽ More
Stream processing has become a critical component in the architecture of modern applications. With the exponential growth of data generation from sources such as the Internet of Things, business intelligence, and telecommunications, real-time processing of unbounded data streams has become a necessity. DSP systems provide a solution to this challenge, offering high horizontal scalability, fault-tolerant execution, and the ability to process data streams from multiple sources in a single DSP job. Often enough though, data streams need to be enriched with extra information for correct processing, which introduces additional dependencies and potential bottlenecks.
In this paper, we present an in-depth evaluation of data enrichment methods for DSP systems and identify the different use cases for stream processing in modern systems. Using a representative DSP system and conducting the evaluation in a realistic cloud environment, we found that outsourcing enrichment data to the DSP system can improve performance for specific use cases. However, this increased resource consumption highlights the need for stream processing solutions specifically designed for the performance-intensive workloads of cloud-based applications.
△ Less
Submitted 23 November, 2023; v1 submitted 26 July, 2023;
originally announced July 2023.
-
Selecting Efficient Cluster Resources for Data Analytics: When and How to Allocate for In-Memory Processing?
Authors:
Jonathan Will,
Lauritz Thamsen,
Dominik Scheinert,
Odej Kao
Abstract:
Distributed dataflow systems such as Apache Spark or Apache Flink enable parallel, in-memory data processing on large clusters of commodity hardware. Consequently, the appropriate amount of memory to allocate to the cluster is a crucial consideration.
In this paper, we analyze the challenge of efficient resource allocation for distributed data processing, focusing on memory. We emphasize that in…
▽ More
Distributed dataflow systems such as Apache Spark or Apache Flink enable parallel, in-memory data processing on large clusters of commodity hardware. Consequently, the appropriate amount of memory to allocate to the cluster is a crucial consideration.
In this paper, we analyze the challenge of efficient resource allocation for distributed data processing, focusing on memory. We emphasize that in-memory processing with in-memory data processing frameworks can undermine resource efficiency. Based on the findings of our trace data analysis, we compile requirements towards an automated solution for efficient cluster resource allocation.
△ Less
Submitted 7 June, 2023; v1 submitted 6 June, 2023;
originally announced June 2023.
-
PULL: Reactive Log Anomaly Detection Based On Iterative PU Learning
Authors:
Thorsten Wittkopp,
Dominik Scheinert,
Philipp Wiesner,
Alexander Acker,
Odej Kao
Abstract:
Due to the complexity of modern IT services, failures can be manifold, occur at any stage, and are hard to detect. For this reason, anomaly detection applied to monitoring data such as logs allows gaining relevant insights to improve IT services steadily and eradicate failures. However, existing anomaly detection methods that provide high accuracy often rely on labeled training data, which are tim…
▽ More
Due to the complexity of modern IT services, failures can be manifold, occur at any stage, and are hard to detect. For this reason, anomaly detection applied to monitoring data such as logs allows gaining relevant insights to improve IT services steadily and eradicate failures. However, existing anomaly detection methods that provide high accuracy often rely on labeled training data, which are time-consuming to obtain in practice. Therefore, we propose PULL, an iterative log analysis method for reactive anomaly detection based on estimated failure time windows provided by monitoring systems instead of labeled data. Our attention-based model uses a novel objective function for weak supervision deep learning that accounts for imbalanced data and applies an iterative learning strategy for positive and unknown samples (PU learning) to identify anomalous logs. Our evaluation shows that PULL consistently outperforms ten benchmark baselines across three different datasets and detects anomalous log messages with an F1-score of more than 0.99 even within imprecise failure time windows.
△ Less
Submitted 25 January, 2023;
originally announced January 2023.
-
Probabilistic Time Series Forecasting for Adaptive Monitoring in Edge Computing Environments
Authors:
Dominik Scheinert,
Babak Sistani Zadeh Aghdam,
Soeren Becker,
Odej Kao,
Lauritz Thamsen
Abstract:
With increasingly more computation being shifted to the edge of the network, monitoring of critical infrastructures, such as intermediate processing nodes in autonomous driving, is further complicated due to the typically resource-constrained environments. In order to reduce the resource overhead on the network link imposed by monitoring, various methods have been discussed that either follow a fi…
▽ More
With increasingly more computation being shifted to the edge of the network, monitoring of critical infrastructures, such as intermediate processing nodes in autonomous driving, is further complicated due to the typically resource-constrained environments. In order to reduce the resource overhead on the network link imposed by monitoring, various methods have been discussed that either follow a filtering approach for data-emitting devices or conduct dynamic sampling based on employed prediction models. Still, existing methods are mainly requiring adaptive monitoring on edge devices, which demands device reconfigurations, utilizes additional resources, and limits the sophistication of employed models.
In this paper, we propose a sampling-based and cloud-located approach that internally utilizes probabilistic forecasts and hence provides means of quantifying model uncertainties, which can be used for contextualized adaptations of sampling frequencies and consequently relieves constrained network resources. We evaluate our prototype implementation for the monitoring pipeline on a publicly available streaming dataset and demonstrate its positive impact on resource efficiency in a method comparison.
△ Less
Submitted 30 January, 2023; v1 submitted 24 November, 2022;
originally announced November 2022.
-
Perona: Robust Infrastructure Fingerprinting for Resource-Efficient Big Data Analytics
Authors:
Dominik Scheinert,
Soeren Becker,
Jonathan Bader,
Lauritz Thamsen,
Jonathan Will,
Odej Kao
Abstract:
Choosing a good resource configuration for big data analytics applications can be challenging, especially in cloud environments. Automated approaches are desirable as poor decisions can reduce performance and raise costs. The majority of existing automated approaches either build performance models from previous workload executions or conduct iterative resource configuration profiling until a near…
▽ More
Choosing a good resource configuration for big data analytics applications can be challenging, especially in cloud environments. Automated approaches are desirable as poor decisions can reduce performance and raise costs. The majority of existing automated approaches either build performance models from previous workload executions or conduct iterative resource configuration profiling until a near-optimal solution has been found. In doing so, they only obtain an implicit understanding of the underlying infrastructure, which is difficult to transfer to alternative infrastructures and, thus, profiling and modeling insights are not sustained beyond very specific situations.
We present Perona, a novel approach to robust infrastructure fingerprinting for usage in the context of big data analytics. Perona employs common sets and configurations of benchmarking tools for target resources, so that resulting benchmark metrics are directly comparable and ranking is enabled. Insignificant benchmark metrics are discarded by learning a low-dimensional representation of the input metric vector, and previous benchmark executions are taken into consideration for context-awareness as well, allowing to detect resource degradation. We evaluate our approach both on data gathered from our own experiments as well as within related works for resource configuration optimization, demonstrating that Perona captures the characteristics from benchmark runs in a compact manner and produces representations that can be used directly.
△ Less
Submitted 30 January, 2023; v1 submitted 15 November, 2022;
originally announced November 2022.
-
Ruya: Memory-Aware Iterative Optimization of Cluster Configurations for Big Data Processing
Authors:
Jonathan Will,
Lauritz Thamsen,
Jonathan Bader,
Dominik Scheinert,
Odej Kao
Abstract:
Selecting appropriate computational resources for data processing jobs on large clusters is difficult, even for expert users like data engineers. Inadequate choices can result in vastly increased costs, without significantly improving performance. One crucial aspect of selecting an efficient resource configuration is avoiding memory bottlenecks. By knowing the required memory of a job in advance,…
▽ More
Selecting appropriate computational resources for data processing jobs on large clusters is difficult, even for expert users like data engineers. Inadequate choices can result in vastly increased costs, without significantly improving performance. One crucial aspect of selecting an efficient resource configuration is avoiding memory bottlenecks. By knowing the required memory of a job in advance, the search space for an optimal resource configuration can be greatly reduced.
Therefore, we present Ruya, a method for memory-aware optimization of data processing cluster configurations based on iteratively exploring a narrowed-down search space. First, we perform job profiling runs with small samples of the dataset on just a single machine to model the job's memory usage patterns. Second, we prioritize cluster configurations with a suitable amount of total memory and within this reduced search space, we iteratively search for the best cluster configuration with Bayesian optimization. This search process stops once it converges on a configuration that is believed to be optimal for the given job. In our evaluation on a dataset with 1031 Spark and Hadoop jobs, we see a reduction of search iterations to find an optimal configuration by around half, compared to the baseline.
△ Less
Submitted 3 February, 2023; v1 submitted 8 November, 2022;
originally announced November 2022.
-
Reshi: Recommending Resources for Scientific Workflow Tasks on Heterogeneous Infrastructures
Authors:
Jonathan Bader,
Fabian Lehmann,
Alexander Groth,
Lauritz Thamsen,
Dominik Scheinert,
Jonathan Will,
Ulf Leser,
Odej Kao
Abstract:
Scientific workflows typically comprise a multitude of different processing steps which often are executed in parallel on different partitions of the input data. These executions, in turn, must be scheduled on the compute nodes of the computational infrastructure at hand. This assignment is complicated by the facts that (a) tasks typically have highly heterogeneous resource requirements and (b) in…
▽ More
Scientific workflows typically comprise a multitude of different processing steps which often are executed in parallel on different partitions of the input data. These executions, in turn, must be scheduled on the compute nodes of the computational infrastructure at hand. This assignment is complicated by the facts that (a) tasks typically have highly heterogeneous resource requirements and (b) in many infrastructures, compute nodes offer highly heterogeneous resources. In consequence, predictions of the runtime of a given task on a given node, as required by many scheduling algorithms, are often rather imprecise, which can lead to sub-optimal scheduling decisions.
We propose Reshi, a method for recommending task-node assignments during workflow execution that can cope with heterogeneous tasks and heterogeneous nodes. Reshi approaches the problem as a regression task, where task-node pairs are modeled as feature vectors over the results of dedicated micro benchmarks and past task executions. Based on these features, Reshi trains a regression tree model to rank and recommend nodes for each ready-to-run task, which can be used as input to a scheduler. For our evaluation, we benchmarked 27 AWS machine types using three representative workflows. We compare Reshi's recommendations with three state-of-the-art schedulers. Our evaluation shows that Reshi outperforms HEFT by a mean makespan reduction of 7.18% and 18.01% assuming a mean task runtime prediction error of 15%.
△ Less
Submitted 17 October, 2022; v1 submitted 16 August, 2022;
originally announced August 2022.
-
Magpie: Automatically Tuning Static Parameters for Distributed File Systems using Deep Reinforcement Learning
Authors:
Houkun Zhu,
Dominik Scheinert,
Lauritz Thamsen,
Kordian Gontarska,
Odej Kao
Abstract:
Distributed file systems are widely used nowadays, yet using their default configurations is often not optimal. At the same time, tuning configuration parameters is typically challenging and time-consuming. It demands expertise and tuning operations can also be expensive. This is especially the case for static parameters, where changes take effect only after a restart of the system or workloads. W…
▽ More
Distributed file systems are widely used nowadays, yet using their default configurations is often not optimal. At the same time, tuning configuration parameters is typically challenging and time-consuming. It demands expertise and tuning operations can also be expensive. This is especially the case for static parameters, where changes take effect only after a restart of the system or workloads. We propose a novel approach, Magpie, which utilizes deep reinforcement learning to tune static parameters by strategically exploring and exploiting configuration parameter spaces. To boost the tuning of the static parameters, our method employs both server and client metrics of distributed file systems to understand the relationship between static parameters and performance. Our empirical evaluation results show that Magpie can noticeably improve the performance of the distributed file system Lustre, where our approach on average achieves 91.8% throughput gains against default configuration after tuning towards single performance indicator optimization, while it reaches 39.7% more throughput gains against the baseline.
△ Less
Submitted 22 July, 2022; v1 submitted 19 July, 2022;
originally announced July 2022.
-
Get Your Memory Right: The Crispy Resource Allocation Assistant for Large-Scale Data Processing
Authors:
Jonathan Will,
Lauritz Thamsen,
Jonathan Bader,
Dominik Scheinert,
Odej Kao
Abstract:
Distributed dataflow systems like Apache Spark and Apache Hadoop enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs -- that neither lead to bottlenecks nor to low resource utilization -- is often challenging, even for expert users such as data engineers. Further, existing automated approaches to resource selection rel…
▽ More
Distributed dataflow systems like Apache Spark and Apache Hadoop enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs -- that neither lead to bottlenecks nor to low resource utilization -- is often challenging, even for expert users such as data engineers. Further, existing automated approaches to resource selection rely on the assumption that a job is recurring to learn from previous runs or to warrant the cost of full test runs to learn from. However, this assumption often does not hold since many jobs are too unique.
Therefore, we present Crispy, a method for optimizing data processing cluster configurations based on job profiling runs with small samples of the dataset on just a single machine. Crispy attempts to extrapolate the memory usage for the full dataset to then choose a cluster configuration with enough total memory. In our evaluation on a dataset with 1031 Spark and Hadoop jobs, we see a reduction of job execution costs by 56% compared to the baseline, while on average spending less than ten minutes on profiling runs per job on a consumer-grade laptop.
△ Less
Submitted 10 January, 2023; v1 submitted 28 June, 2022;
originally announced June 2022.
-
Phoebe: QoS-Aware Distributed Stream Processing through Anticipating Dynamic Workloads
Authors:
Morgan K. Geldenhuys,
Dominik Scheinert,
Odej Kao,
Lauritz Thamsen
Abstract:
Distributed Stream Processing systems have become an essential part of big data processing platforms. They are characterized by the high-throughput processing of near to real-time event streams with the goal of delivering low-latency results and thus enabling time-sensitive decision making. At the same time, results are expected to be consistent even in the presence of partial failures where exact…
▽ More
Distributed Stream Processing systems have become an essential part of big data processing platforms. They are characterized by the high-throughput processing of near to real-time event streams with the goal of delivering low-latency results and thus enabling time-sensitive decision making. At the same time, results are expected to be consistent even in the presence of partial failures where exactly-once processing guarantees are required for correctness. Stream processing workloads are oftentimes dynamic in nature which makes static configurations highly inefficient as time goes by. Static resource allocations will almost certainly either negatively impact upon the Quality of Service and/or result in higher operational costs.
In this paper we present Phoebe, a proactive approach to system auto-tuning for Distributed Stream Processing jobs executing on dynamic workloads. Our approach makes use of parallel profiling runs, QoS modeling, and runtime optimization to provide a general solution whereby configuration parameters are automatically tuned to ensure a stable service as well as alignment with recovery time Quality of Service targets. Phoebe makes use of Time Series Forecasting to gain an insight into future workload requirements thereby delivering scaling decisions which are accurate, long-lived, and reliable. Our experiments demonstrate that Phoebe is able to deliver a stable service while at the same time reducing resource over-provisioning.
△ Less
Submitted 20 June, 2022;
originally announced June 2022.
-
Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview
Authors:
Lauritz Thamsen,
Dominik Scheinert,
Jonathan Will,
Jonathan Bader,
Odej Kao
Abstract:
Many organizations routinely analyze large datasets using systems for distributed data-parallel processing and clusters of commodity resources. Yet, users need to configure adequate resources for their data processing jobs. This requires significant insights into expected job runtimes and scaling behavior, resource characteristics, input data distributions, and other factors. Unable to estimate pe…
▽ More
Many organizations routinely analyze large datasets using systems for distributed data-parallel processing and clusters of commodity resources. Yet, users need to configure adequate resources for their data processing jobs. This requires significant insights into expected job runtimes and scaling behavior, resource characteristics, input data distributions, and other factors. Unable to estimate performance accurately, users frequently overprovision resources for their jobs, leading to low resource utilization and high costs. In this paper, we present major building blocks towards a collaborative approach for optimization of data processing cluster configurations based on runtime data and performance models. We believe that runtime data can be shared and used for performance models across different execution contexts, significantly reducing the reliance on the recurrence of individual processing jobs or, else, dedicated job profiling. For this, we describe how the similarity of processing jobs and cluster infrastructures can be employed to combine suitable data points from local and global job executions into accurate performance models. Furthermore, we outline approaches to performance prediction via more context-aware and reusable models. Finally, we lay out how metrics from previous executions can be combined with runtime monitoring to effectively re-configure models and clusters dynamically.
△ Less
Submitted 1 June, 2022;
originally announced June 2022.
-
Cucumber: Renewable-Aware Admission Control for Delay-Tolerant Cloud and Edge Workloads
Authors:
Philipp Wiesner,
Dominik Scheinert,
Thorsten Wittkopp,
Lauritz Thamsen,
Odej Kao
Abstract:
The growing electricity demand of cloud and edge computing increases operational costs and will soon have a considerable impact on the environment. A possible countermeasure is equip** IT infrastructure directly with on-site renewable energy sources. Yet, particularly smaller data centers may not be able to use all generated power directly at all times, while feeding it into the public grid or e…
▽ More
The growing electricity demand of cloud and edge computing increases operational costs and will soon have a considerable impact on the environment. A possible countermeasure is equip** IT infrastructure directly with on-site renewable energy sources. Yet, particularly smaller data centers may not be able to use all generated power directly at all times, while feeding it into the public grid or energy storage is often not an option. To maximize the usage of renewable excess energy, we propose Cucumber, an admission control policy that accepts delay-tolerant workloads only if they can be computed within their deadlines without the use of grid energy. Using probabilistic forecasting of computational load, energy consumption, and energy production, Cucumber can be configured towards more optimistic or conservative admission. We evaluate our approach on two scenarios using real solar production forecasts for Berlin, Mexico City, and Cape Town in a simulation environment. For scenarios where excess energy was actually available, our results show that Cucumber's default configuration achieves acceptance rates close to the optimal case and causes 97.0% of accepted workloads to be powered using excess energy, while more conservative admission results in 18.5% reduced acceptance at almost zero grid power usage.
△ Less
Submitted 27 August, 2022; v1 submitted 5 May, 2022;
originally announced May 2022.
-
Efficient Runtime Profiling for Black-box Machine Learning Services on Sensor Streams
Authors:
Soeren Becker,
Dominik Scheinert,
Florian Schmidt,
Odej Kao
Abstract:
In highly distributed environments such as cloud, edge and fog computing, the application of machine learning for automating and optimizing processes is on the rise. Machine learning jobs are frequently applied in streaming conditions, where models are used to analyze data streams originating from e.g. video streams or sensory data. Often the results for particular data samples need to be provided…
▽ More
In highly distributed environments such as cloud, edge and fog computing, the application of machine learning for automating and optimizing processes is on the rise. Machine learning jobs are frequently applied in streaming conditions, where models are used to analyze data streams originating from e.g. video streams or sensory data. Often the results for particular data samples need to be provided in time before the arrival of next data. Thus, enough resources must be provided to ensure the just-in-time processing for the specific data stream. This paper focuses on proposing a runtime modeling strategy for containerized machine learning jobs, which enables the optimization and adaptive adjustment of resources per job and component. Our black-box approach assembles multiple techniques into an efficient runtime profiling method, while making no assumptions about underlying hardware, data streams, or applied machine learning jobs. The results show that our method is able to capture the general runtime behaviour of different machine learning jobs already after a short profiling phase.
△ Less
Submitted 10 March, 2022;
originally announced March 2022.
-
A Taxonomy of Anomalies in Log Data
Authors:
Thorsten Wittkopp,
Philipp Wiesner,
Dominik Scheinert,
Odej Kao
Abstract:
Log data anomaly detection is a core component in the area of artificial intelligence for IT operations. However, the large amount of existing methods makes it hard to choose the right approach for a specific system. A better understanding of different kinds of anomalies, and which algorithms are suitable for detecting them, would support researchers and IT operators. Although a common taxonomy fo…
▽ More
Log data anomaly detection is a core component in the area of artificial intelligence for IT operations. However, the large amount of existing methods makes it hard to choose the right approach for a specific system. A better understanding of different kinds of anomalies, and which algorithms are suitable for detecting them, would support researchers and IT operators. Although a common taxonomy for anomalies already exists, it has not yet been applied specifically to log data, pointing out the characteristics and peculiarities in this domain.
In this paper, we present a taxonomy for different kinds of log data anomalies and introduce a method for analyzing such anomalies in labeled datasets. We applied our taxonomy to the three common benchmark datasets Thunderbird, Spirit, and BGL, and trained five state-of-the-art unsupervised anomaly detection algorithms to evaluate their performance in detecting different kinds of anomalies. Our results show, that the most common anomaly type is also the easiest to predict. Moreover, deep learning-based approaches outperform data mining-based approaches in all anomaly types, but especially when it comes to detecting contextual anomalies.
△ Less
Submitted 26 November, 2021;
originally announced November 2021.
-
On the Potential of Execution Traces for Batch Processing Workload Optimization in Public Clouds
Authors:
Dominik Scheinert,
Alireza Alamgiralem,
Jonathan Bader,
Jonathan Will,
Thorsten Wittkopp,
Lauritz Thamsen
Abstract:
With the growing amount of data, data processing workloads and the management of their resource usage becomes increasingly important. Since managing a dedicated infrastructure is in many situations infeasible or uneconomical, users progressively execute their respective workloads in the cloud. As the configuration of workloads and resources is often challenging, various methods have been proposed…
▽ More
With the growing amount of data, data processing workloads and the management of their resource usage becomes increasingly important. Since managing a dedicated infrastructure is in many situations infeasible or uneconomical, users progressively execute their respective workloads in the cloud. As the configuration of workloads and resources is often challenging, various methods have been proposed that either quickly profile towards a good configuration or determine one based on data from previous runs. Still, performance data to train such methods is often lacking and must be costly collected.
In this paper, we propose a collaborative approach for sharing anonymized workload execution traces among users, mining them for general patterns, and exploiting clusters of historical workloads for future optimizations. We evaluate our prototype implementation for mining workload execution graphs on a publicly available trace dataset and demonstrate the predictive value of workload clusters determined through traces only.
△ Less
Submitted 16 January, 2022; v1 submitted 16 November, 2021;
originally announced November 2021.
-
Training Data Reduction for Performance Models of Data Analytics Jobs in the Cloud
Authors:
Jonathan Will,
Onur Arslan,
Jonathan Bader,
Dominik Scheinert,
Lauritz Thamsen
Abstract:
Distributed dataflow systems like Apache Flink and Apache Spark simplify processing large amounts of data on clusters in a data-parallel manner. However, choosing suitable cluster resources for distributed dataflow jobs in both type and number is difficult, especially for users who do not have access to previous performance metrics. One approach to overcoming this issue is to have users share runt…
▽ More
Distributed dataflow systems like Apache Flink and Apache Spark simplify processing large amounts of data on clusters in a data-parallel manner. However, choosing suitable cluster resources for distributed dataflow jobs in both type and number is difficult, especially for users who do not have access to previous performance metrics. One approach to overcoming this issue is to have users share runtime metrics to train context-aware performance models that help find a suitable configuration for the job at hand. A problem when sharing runtime data instead of trained models or model parameters is that the data size can grow substantially over time.
This paper examines several clustering techniques to minimize training data size while kee** the associated performance models accurate. Our results indicate that efficiency gains in data transfer, storage, and model training can be achieved through training data reduction. In the evaluation of our solution on a dataset of runtime data from 930 unique distributed dataflow jobs, we observed that, on average, a 75% data reduction only increases prediction errors by one percentage point.
△ Less
Submitted 11 March, 2022; v1 submitted 15 November, 2021;
originally announced November 2021.
-
LogLAB: Attention-Based Labeling of Log Data Anomalies via Weak Supervision
Authors:
Thorsten Wittkopp,
Philipp Wiesner,
Dominik Scheinert,
Alexander Acker
Abstract:
With increasing scale and complexity of cloud operations, automated detection of anomalies in monitoring data such as logs will be an essential part of managing future IT infrastructures. However, many methods based on artificial intelligence, such as supervised deep learning models, require large amounts of labeled training data to perform well. In practice, this data is rarely available because…
▽ More
With increasing scale and complexity of cloud operations, automated detection of anomalies in monitoring data such as logs will be an essential part of managing future IT infrastructures. However, many methods based on artificial intelligence, such as supervised deep learning models, require large amounts of labeled training data to perform well. In practice, this data is rarely available because labeling log data is expensive, time-consuming, and requires a deep understanding of the underlying system. We present LogLAB, a novel modeling approach for automated labeling of log messages without requiring manual work by experts. Our method relies on estimated failure time windows provided by monitoring systems to produce precise labeled datasets in retrospect. It is based on the attention mechanism and uses a custom objective function for weak supervision deep learning techniques that accounts for imbalanced data. Our evaluation shows that LogLAB consistently outperforms nine benchmark approaches across three different datasets and maintains an F1-score of more than 0.98 even at large failure time windows.
△ Less
Submitted 25 November, 2021; v1 submitted 2 November, 2021;
originally announced November 2021.
-
Let's Wait Awhile: How Temporal Workload Shifting Can Reduce Carbon Emissions in the Cloud
Authors:
Philipp Wiesner,
Ilja Behnke,
Dominik Scheinert,
Kordian Gontarska,
Lauritz Thamsen
Abstract:
Depending on energy sources and demand, the carbon intensity of the public power grid fluctuates over time. Exploiting this variability is an important factor in reducing the emissions caused by data centers. However, regional differences in the availability of low-carbon energy sources make it hard to provide general best practices for when to consume electricity. Moreover, existing research in t…
▽ More
Depending on energy sources and demand, the carbon intensity of the public power grid fluctuates over time. Exploiting this variability is an important factor in reducing the emissions caused by data centers. However, regional differences in the availability of low-carbon energy sources make it hard to provide general best practices for when to consume electricity. Moreover, existing research in this domain focuses mostly on carbon-aware workload migration across geo-distributed data centers, or addresses demand response purely from the perspective of power grid stability and costs.
In this paper, we examine the potential impact of shifting computational workloads towards times where the energy supply is expected to be less carbon-intensive. To this end, we identify characteristics of delay-tolerant workloads and analyze the potential for temporal workload shifting in Germany, Great Britain, France, and California over the year 2020. Furthermore, we experimentally evaluate two workload shifting scenarios in a simulation to investigate the influence of time constraints, scheduling strategies, and the accuracy of carbon intensity forecasts. To accelerate research in the domain of carbon-aware computing and to support the evaluation of novel scheduling algorithms, our simulation framework and datasets are publicly available.
△ Less
Submitted 25 October, 2021;
originally announced October 2021.
-
A2Log: Attentive Augmented Log Anomaly Detection
Authors:
Thorsten Wittkopp,
Alexander Acker,
Sasho Nedelkoski,
Jasmin Bogatinovski,
Dominik Scheinert,
Wu Fan,
Odej Kao
Abstract:
Anomaly detection becomes increasingly important for the dependability and serviceability of IT services. As log lines record events during the execution of IT services, they are a primary source for diagnostics. Thereby, unsupervised methods provide a significant benefit since not all anomalies can be known at training time. Existing unsupervised methods need anomaly examples to obtain a suitable…
▽ More
Anomaly detection becomes increasingly important for the dependability and serviceability of IT services. As log lines record events during the execution of IT services, they are a primary source for diagnostics. Thereby, unsupervised methods provide a significant benefit since not all anomalies can be known at training time. Existing unsupervised methods need anomaly examples to obtain a suitable decision boundary required for the anomaly detection task. This requirement poses practical limitations. Therefore, we develop A2Log, which is an unsupervised anomaly detection method consisting of two steps: Anomaly scoring and anomaly decision. First, we utilize a self-attention neural network to perform the scoring for each log message. Second, we set the decision boundary based on data augmentation of the available normal training data. The method is evaluated on three publicly available datasets and one industry dataset. We show that our approach outperforms existing methods. Furthermore, we utilize available anomaly examples to set optimal decision boundaries to acquire strong baselines. We show that our approach, which determines decision boundaries without utilizing anomaly examples, can reach scores of the strong baselines.
△ Less
Submitted 20 September, 2021;
originally announced September 2021.
-
Khaos: Dynamically Optimizing Checkpointing for Dependable Distributed Stream Processing
Authors:
Morgan K. Geldenhuys,
Benjamin J. J. Pfister,
Dominik Scheinert,
Lauritz Thamsen,
Odej Kao
Abstract:
Distributed Stream Processing systems are becoming an increasingly essential part of Big Data processing platforms as users grow ever more reliant on their ability to provide fast access to new results. As such, making timely decisions based on these results is dependent on a system's ability to tolerate failure. Typically, these systems achieve fault tolerance and the ability to recover automatic…
▽ More
Distributed Stream Processing systems are becoming an increasingly essential part of Big Data processing platforms as users grow ever more reliant on their ability to provide fast access to new results. As such, making timely decisions based on these results is dependent on a system's ability to tolerate failure. Typically, these systems achieve fault tolerance and the ability to recover automatically from partial failures by implementing checkpoint and rollback recovery. However, owing to the statistical probability of partial failures occurring in these distributed environments and the variability of workloads upon which jobs are expected to operate, static configurations will often not meet Quality of Service constraints with low overhead.
In this paper we present Khaos, a new approach which utilizes the parallel processing capabilities of virtual cloud automation technologies for the automatic runtime optimization of fault tolerance configurations in Distributed Stream Processing jobs. Our approach employs three subsequent phases which borrows from the principles of Chaos Engineering: establish the steady-state processing conditions, conduct experiments to better understand how the system performs under failure, and use this knowledge to continuously minimize Quality of Service violations. We implemented Khaos prototypically together with Apache Flink and demonstrate its usefulness experimentally.
△ Less
Submitted 26 January, 2023; v1 submitted 6 September, 2021;
originally announced September 2021.
-
Enel: Context-Aware Dynamic Scaling of Distributed Dataflow Jobs using Graph Propagation
Authors:
Dominik Scheinert,
Houkun Zhu,
Lauritz Thamsen,
Morgan K. Geldenhuys,
Jonathan Will,
Alexander Acker,
Odej Kao
Abstract:
Distributed dataflow systems like Spark and Flink enable the use of clusters for scalable data analytics. While runtime prediction models can be used to initially select appropriate cluster resources given target runtimes, the actual runtime performance of dataflow jobs depends on several factors and varies over time. Yet, in many situations, dynamic scaling can be used to meet formulated runtime…
▽ More
Distributed dataflow systems like Spark and Flink enable the use of clusters for scalable data analytics. While runtime prediction models can be used to initially select appropriate cluster resources given target runtimes, the actual runtime performance of dataflow jobs depends on several factors and varies over time. Yet, in many situations, dynamic scaling can be used to meet formulated runtime targets despite significant performance variance.
This paper presents Enel, a novel dynamic scaling approach that uses message propagation on an attributed graph to model dataflow jobs and, thus, allows for deriving effective rescaling decisions. For this, Enel incorporates descriptive properties that capture the respective execution context, considers statistics from individual dataflow tasks, and propagates predictions through the job graph to eventually find an optimized new scale-out. Our evaluation of Enel with four iterative Spark jobs shows that our approach is able to identify effective rescaling actions, reacting for instance to node failures, and can be reused across different execution contexts.
△ Less
Submitted 26 January, 2022; v1 submitted 27 August, 2021;
originally announced August 2021.
-
Evaluation of Load Prediction Techniques for Distributed Stream Processing
Authors:
Kordian Gontarska,
Morgan Geldenhuys,
Dominik Scheinert,
Philipp Wiesner,
Andreas Polze,
Lauritz Thamsen
Abstract:
Distributed Stream Processing (DSP) systems enable processing large streams of continuous data to produce results in near to real time. They are an essential part of many data-intensive applications and analytics platforms. The rate at which events arrive at DSP systems can vary considerably over time, which may be due to trends, cyclic, and seasonal patterns within the data streams. A priori know…
▽ More
Distributed Stream Processing (DSP) systems enable processing large streams of continuous data to produce results in near to real time. They are an essential part of many data-intensive applications and analytics platforms. The rate at which events arrive at DSP systems can vary considerably over time, which may be due to trends, cyclic, and seasonal patterns within the data streams. A priori knowledge of incoming workloads enables proactive approaches to resource management and optimization tasks such as dynamic scaling, live migration of resources, and the tuning of configuration parameters during run-times, thus leading to a potentially better Quality of Service.
In this paper we conduct a comprehensive evaluation of different load prediction techniques for DSP jobs. We identify three use-cases and formulate requirements for making load predictions specific to DSP jobs. Automatically optimized classical and Deep Learning methods are being evaluated on nine different datasets from typical DSP domains, i.e. the IoT, Web 2.0, and cluster monitoring. We compare model performance with respect to overall accuracy and training duration. Our results show that the Deep Learning methods provide the most accurate load predictions for the majority of the evaluated datasets.
△ Less
Submitted 10 August, 2021;
originally announced August 2021.
-
Bellamy: Reusing Performance Models for Distributed Dataflow Jobs Across Contexts
Authors:
Dominik Scheinert,
Lauritz Thamsen,
Houkun Zhu,
Jonathan Will,
Alexander Acker,
Thorsten Wittkopp,
Odej Kao
Abstract:
Distributed dataflow systems enable the use of clusters for scalable data analytics. However, selecting appropriate cluster resources for a processing job is often not straightforward. Performance models trained on historical executions of a concrete job are helpful in such situations, yet they are usually bound to a specific job execution context (e.g. node type, software versions, job parameters…
▽ More
Distributed dataflow systems enable the use of clusters for scalable data analytics. However, selecting appropriate cluster resources for a processing job is often not straightforward. Performance models trained on historical executions of a concrete job are helpful in such situations, yet they are usually bound to a specific job execution context (e.g. node type, software versions, job parameters) due to the few considered input parameters. Even in case of slight context changes, such supportive models need to be retrained and cannot benefit from historical execution data from related contexts.
This paper presents Bellamy, a novel modeling approach that combines scale-outs, dataset sizes, and runtimes with additional descriptive properties of a dataflow job. It is thereby able to capture the context of a job execution. Moreover, Bellamy is realizing a two-step modeling approach. First, a general model is trained on all the available data for a specific scalable analytics algorithm, hereby incorporating data from different contexts. Subsequently, the general model is optimized for the specific situation at hand, based on the available data for the concrete context. We evaluate our approach on two publicly available datasets consisting of execution data from various dataflow jobs carried out in different environments, showing that Bellamy outperforms state-of-the-art methods.
△ Less
Submitted 17 October, 2021; v1 submitted 29 July, 2021;
originally announced July 2021.
-
C3O: Collaborative Cluster Configuration Optimization for Distributed Data Processing in Public Clouds
Authors:
Jonathan Will,
Lauritz Thamsen,
Dominik Scheinert,
Jonathan Bader,
Odej Kao
Abstract:
Distributed dataflow systems enable data-parallel processing of large datasets on clusters. Public cloud providers offer a large variety and quantity of resources that can be used for such clusters. Yet, selecting appropriate cloud resources for dataflow jobs - that neither lead to bottlenecks nor to low resource utilization - is often challenging, even for expert users such as data engineers.
W…
▽ More
Distributed dataflow systems enable data-parallel processing of large datasets on clusters. Public cloud providers offer a large variety and quantity of resources that can be used for such clusters. Yet, selecting appropriate cloud resources for dataflow jobs - that neither lead to bottlenecks nor to low resource utilization - is often challenging, even for expert users such as data engineers.
We present C3O, a collaborative system for optimizing data processing cluster configurations in public clouds based on shared historical runtime data. The shared data is utilized for predicting the runtimes of data processing jobs on different possible cluster configurations, using specialized regression models. These models take the diverse execution contexts of different users into account and exhibit mean absolute errors below 3% in our experimental evaluation with 930 unique Spark jobs.
△ Less
Submitted 1 December, 2021; v1 submitted 28 July, 2021;
originally announced July 2021.
-
Learning Dependencies in Distributed Cloud Applications to Identify and Localize Anomalies
Authors:
Dominik Scheinert,
Alexander Acker,
Lauritz Thamsen,
Morgan K. Geldenhuys,
Odej Kao
Abstract:
Operation and maintenance of large distributed cloud applications can quickly become unmanageably complex, putting human operators under immense stress when problems occur. Utilizing machine learning for identification and localization of anomalies in such systems supports human experts and enables fast mitigation. However, due to the various inter-dependencies of system components, anomalies do n…
▽ More
Operation and maintenance of large distributed cloud applications can quickly become unmanageably complex, putting human operators under immense stress when problems occur. Utilizing machine learning for identification and localization of anomalies in such systems supports human experts and enables fast mitigation. However, due to the various inter-dependencies of system components, anomalies do not only affect their origin but propagate through the distributed system. Taking this into account, we present Arvalus and its variant D-Arvalus, a neural graph transformation method that models system components as nodes and their dependencies and placement as edges to improve the identification and localization of anomalies. Given a series of metric KPIs, our method predicts the most likely system state - either normal or an anomaly class - and performs localization when an anomaly is detected. During our experiments, we simulate a distributed cloud application deployment and synthetically inject anomalies. The evaluation shows the generally good prediction performance of Arvalus and reveals the advantage of D-Arvalus which incorporates information about system component dependencies.
△ Less
Submitted 9 September, 2021; v1 submitted 9 March, 2021;
originally announced March 2021.
-
TELESTO: A Graph Neural Network Model for Anomaly Classification in Cloud Services
Authors:
Dominik Scheinert,
Alexander Acker
Abstract:
Deployment, operation and maintenance of large IT systems becomes increasingly complex and puts human experts under extreme stress when problems occur. Therefore, utilization of machine learning (ML) and artificial intelligence (AI) is applied on IT system operation and maintenance - summarized in the term AIOps. One specific direction aims at the recognition of re-occurring anomaly types to enabl…
▽ More
Deployment, operation and maintenance of large IT systems becomes increasingly complex and puts human experts under extreme stress when problems occur. Therefore, utilization of machine learning (ML) and artificial intelligence (AI) is applied on IT system operation and maintenance - summarized in the term AIOps. One specific direction aims at the recognition of re-occurring anomaly types to enable remediation automation. However, due to IT system specific properties, especially their frequent changes (e.g. software updates, reconfiguration or hardware modernization), recognition of reoccurring anomaly types is challenging. Current methods mainly assume a static dimensionality of provided data. We propose a method that is invariant to dimensionality changes of given data. Resource metric data such as CPU utilization, allocated memory and others are modelled as multivariate time series. The extraction of temporal and spatial features together with the subsequent anomaly classification is realized by utilizing TELESTO, our novel graph convolutional neural network (GCNN) architecture. The experimental evaluation is conducted in a real-world cloud testbed deployment that is hosting two applications. Classification results of injected anomalies on a cassandra database node show that TELESTO outperforms the alternative GCNNs and achieves an overall classification accuracy of 85.1%. Classification results for the other nodes show accuracy values between 85% and 60%.
△ Less
Submitted 29 July, 2021; v1 submitted 25 February, 2021;
originally announced February 2021.