Search | arXiv e-print repository

Privacy-Preserving Sharing of Data Analytics Runtime Metrics for Performance Modeling

Authors: Jonathan Will, Dominik Scheinert, Jan Bode, Cedric Kring, Seraphin Zunzer, Lauritz Thamsen

Abstract: Performance modeling for large-scale data analytics workloads can improve the efficiency of cluster resource allocations and job scheduling. However, the performance of these workloads is influenced by numerous factors, such as job inputs and the assigned cluster resources. As a result, performance models require significant amounts of training data. This data can be obtained by exchanging runtime… ▽ More Performance modeling for large-scale data analytics workloads can improve the efficiency of cluster resource allocations and job scheduling. However, the performance of these workloads is influenced by numerous factors, such as job inputs and the assigned cluster resources. As a result, performance models require significant amounts of training data. This data can be obtained by exchanging runtime metrics between collaborating organizations. Yet, not all organizations may be inclined to publicly disclose such metadata. We present a privacy-preserving approach for sharing runtime metrics based on differential privacy and data synthesis. Our evaluation on performance data from 736 Spark job executions indicates that fully anonymized training data largely maintains performance prediction accuracy, particularly when there is minimal original data available. With 30 or fewer available original data samples, the use of synthetic training data resulted only in a one percent reduction in performance model accuracy on average. △ Less

Submitted 13 March, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

Comments: 4 pages, 4 figures, presented at the WOSP-C workshop at ICPE 2024

arXiv:2403.02129 [pdf, other]

Demeter: Resource-Efficient Distributed Stream Processing under Dynamic Loads with Multi-Configuration Optimization

Authors: Morgan Geldenhuys, Dominik Scheinert, Odej Kao, Lauritz Thamsen

Abstract: Distributed Stream Processing (DSP) focuses on the near real-time processing of large streams of unbounded data. To increase processing capacities, DSP systems are able to dynamically scale across a cluster of commodity nodes, ensuring a good Quality of Service despite variable workloads. However, selecting scaleout configurations which maximize resource utilization remains a challenge. This is es… ▽ More Distributed Stream Processing (DSP) focuses on the near real-time processing of large streams of unbounded data. To increase processing capacities, DSP systems are able to dynamically scale across a cluster of commodity nodes, ensuring a good Quality of Service despite variable workloads. However, selecting scaleout configurations which maximize resource utilization remains a challenge. This is especially true in environments where workloads change over time and node failures are all but inevitable. Furthermore, configuration parameters such as memory allocation and checkpointing intervals impact performance and resource usage as well. Sub-optimal configurations easily lead to high operational costs, poor performance, or unacceptable loss of service. In this paper, we present Demeter, a method for dynamically optimizing key DSP system configuration parameters for resource efficiency. Demeter uses Time Series Forecasting to predict future workloads and Multi-Objective Bayesian Optimization to model runtime behaviors in relation to parameter settings and workload rates. Together, these techniques allow us to determine whether or not enough is known about the predicted workload rate to proactively initiate short-lived parallel profiling runs for data gathering. Once trained, the models guide the adjustment of multiple, potentially dependent system configuration parameters ensuring optimized performance and resource usage in response to changing workload rates. Our experiments on a commodity cluster using Apache Flink demonstrate that Demeter significantly improves the operational efficiency of long-running benchmark jobs. △ Less

Submitted 4 March, 2024; originally announced March 2024.

Comments: 12 pages, 14 figures, published at ICPE 2024

arXiv:2311.15929 [pdf, other]

doi 10.1145/3624062.3626283

The Common Workflow Scheduler Interface: Status Quo and Future Plans

Authors: Fabian Lehmann, Jonathan Bader, Lauritz Thamsen, Ulf Leser

Abstract: Nowadays, many scientific workflows from different domains, such as Remote Sensing, Astronomy, and Bioinformatics, are executed on large computing infrastructures managed by resource managers. Scientific workflow management systems (SWMS) support the workflow execution and communicate with the infrastructures' resource managers. However, the communication between SWMS and resource managers is comp… ▽ More Nowadays, many scientific workflows from different domains, such as Remote Sensing, Astronomy, and Bioinformatics, are executed on large computing infrastructures managed by resource managers. Scientific workflow management systems (SWMS) support the workflow execution and communicate with the infrastructures' resource managers. However, the communication between SWMS and resource managers is complicated by a) inconsistent interfaces between SMWS and resource managers and b) the lack of support for workflow dependencies and workflow-specific properties. To tackle these issues, we developed the Common Workflow Scheduler Interface (CWSI), a simple yet powerful interface to exchange workflow-related information between a SWMS and a resource manager, making the resource manager workflow-aware. The first prototype implementations show that the CWSI can reduce the makespan already with simple but workflow-aware strategies up to 25%. In this paper, we show how existing workflow resource management research can be integrated into the CWSI. △ Less

Submitted 27 November, 2023; originally announced November 2023.

Journal ref: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W 2023)

arXiv:2311.14600 [pdf, other]

doi 10.1109/BigData59044.2023.10386195

Towards a Peer-to-Peer Data Distribution Layer for Efficient and Collaborative Resource Optimization of Distributed Dataflow Applications

Authors: Dominik Scheinert, Soeren Becker, Jonathan Will, Luis Englaender, Lauritz Thamsen

Abstract: Performance modeling can help to improve the resource efficiency of clusters and distributed dataflow applications, yet the available modeling data is often limited. Collaborative approaches to performance modeling, characterized by the sharing of performance data or models, have been shown to improve resource efficiency, but there has been little focus on actual data sharing strategies and implem… ▽ More Performance modeling can help to improve the resource efficiency of clusters and distributed dataflow applications, yet the available modeling data is often limited. Collaborative approaches to performance modeling, characterized by the sharing of performance data or models, have been shown to improve resource efficiency, but there has been little focus on actual data sharing strategies and implementation in production environments. This missing building block holds back the realization of proposed collaborative solutions. In this paper, we envision, design, and evaluate a peer-to-peer performance data sharing approach for collaborative performance modeling of distributed dataflow applications. Our proposed data distribution layer enables access to performance data in a decentralized manner, thereby facilitating collaborative modeling approaches and allowing for improved prediction capabilities and hence increased resource efficiency. In our evaluation, we assess our approach with regard to deployment, data replication, and data validation, through experiments with a prototype implementation and simulation, demonstrating feasibility and allowing discussion of potential limitations and next steps. △ Less

Submitted 23 January, 2024; v1 submitted 24 November, 2023; originally announced November 2023.

Comments: 7 pages, 4 figures, 2 tables

Journal ref: IEEE BigData (2023) 2339-2345

arXiv:2311.08185 [pdf, other]

Predicting Dynamic Memory Requirements for Scientific Workflow Tasks

Authors: Jonathan Bader, Nils Diedrich, Lauritz Thamsen, Odej Kao

Abstract: With the increasing amount of data available to scientists in disciplines as diverse as bioinformatics, physics, and remote sensing, scientific workflow systems are becoming increasingly important for composing and executing scalable data analysis pipelines. When writing such workflows, users need to specify the resources to be reserved for tasks so that sufficient resources are allocated on the t… ▽ More With the increasing amount of data available to scientists in disciplines as diverse as bioinformatics, physics, and remote sensing, scientific workflow systems are becoming increasingly important for composing and executing scalable data analysis pipelines. When writing such workflows, users need to specify the resources to be reserved for tasks so that sufficient resources are allocated on the target cluster infrastructure. Crucially, underestimating a task's memory requirements can result in task failures. Therefore, users often resort to overprovisioning, resulting in significant resource wastage and decreased throughput. In this paper, we propose a novel online method that uses monitoring time series data to predict task memory usage in order to reduce the memory wastage of scientific workflow tasks. Our method predicts a task's runtime, divides it into k equally-sized segments, and learns the peak memory value for each segment depending on the total file input size. We evaluate the prototype implementation of our method using workflows from the publicly available nf-core repository, showing an average memory wastage reduction of 29.48% compared to the best state-of-the-art approach. △ Less

Submitted 19 March, 2024; v1 submitted 14 November, 2023; originally announced November 2023.

Comments: Paper accepted in 2023 IEEE International Conference on Big Data

arXiv:2309.06918 [pdf, other]

doi 10.1016/j.future.2023.08.022

Lotaru: Locally Predicting Workflow Task Runtimes for Resource Management on Heterogeneous Infrastructures

Authors: Jonathan Bader, Fabian Lehmann, Lauritz Thamsen, Ulf Leser, Odej Kao

Abstract: Many resource management techniques for task scheduling, energy and carbon efficiency, and cost optimization in workflows rely on a-priori task runtime knowledge. Building runtime prediction models on historical data is often not feasible in practice as workflows, their input data, and the cluster infrastructure change. Online methods, on the other hand, which estimate task runtimes on specific ma… ▽ More Many resource management techniques for task scheduling, energy and carbon efficiency, and cost optimization in workflows rely on a-priori task runtime knowledge. Building runtime prediction models on historical data is often not feasible in practice as workflows, their input data, and the cluster infrastructure change. Online methods, on the other hand, which estimate task runtimes on specific machines while the workflow is running, have to cope with a lack of measurements during start-up. Frequently, scientific workflows are executed on heterogeneous infrastructures consisting of machines with different CPU, I/O, and memory configurations, further complicating predicting runtimes due to different task runtimes on different machine types. This paper presents Lotaru, a method for locally predicting the runtimes of scientific workflow tasks before they are executed on heterogeneous compute clusters. Crucially, our approach does not rely on historical data and copes with a lack of training data during the start-up. To this end, we use microbenchmarks, reduce the input data to quickly profile the workflow locally, and predict a task's runtime with a Bayesian linear regression based on the gathered data points from the local workflow execution and the microbenchmarks. Due to its Bayesian approach, Lotaru provides uncertainty estimates that can be used for advanced scheduling methods on distributed cluster infrastructures. In our evaluation with five real-world scientific workflows, our method outperforms two state-of-the-art runtime prediction baselines and decreases the absolute prediction error by more than 12.5%. In a second set of experiments, the prediction performance of our method, using the predicted runtimes for state-of-the-art scheduling, carbon reduction, and cost prediction, enables results close to those achieved with perfect prior knowledge of runtimes. △ Less

Submitted 13 September, 2023; originally announced September 2023.

Journal ref: Future Generation Computer Systems, Volume 150, January 2024, Pages 171-185

arXiv:2308.11792 [pdf, other]

doi 10.1109/IPCCC59175.2023.10253884

Karasu: A Collaborative Approach to Efficient Cluster Configuration for Big Data Analytics

Authors: Dominik Scheinert, Philipp Wiesner, Thorsten Wittkopp, Lauritz Thamsen, Jonathan Will, Odej Kao

Abstract: Selecting the right resources for big data analytics jobs is hard because of the wide variety of configuration options like machine type and cluster size. As poor choices can have a significant impact on resource efficiency, cost, and energy usage, automated approaches are gaining popularity. Most existing methods rely on profiling recurring workloads to find near-optimal solutions over time. Due… ▽ More Selecting the right resources for big data analytics jobs is hard because of the wide variety of configuration options like machine type and cluster size. As poor choices can have a significant impact on resource efficiency, cost, and energy usage, automated approaches are gaining popularity. Most existing methods rely on profiling recurring workloads to find near-optimal solutions over time. Due to the cold-start problem, this often leads to lengthy and costly profiling phases. However, big data analytics jobs across users can share many common properties: they often operate on similar infrastructure, using similar algorithms implemented in similar frameworks. The potential in sharing aggregated profiling runs to collaboratively address the cold start problem is largely unexplored. We present Karasu, an approach to more efficient resource configuration profiling that promotes data sharing among users working with similar infrastructures, frameworks, algorithms, or datasets. Karasu trains lightweight performance models using aggregated runtime information of collaborators and combines them into an ensemble method to exploit inherent knowledge of the configuration search space. Moreover, Karasu allows the optimization of multiple objectives simultaneously. Our evaluation is based on performance data from diverse workload executions in a public cloud environment. We show that Karasu is able to significantly boost existing methods in terms of performance, search time, and cost, even when few comparable profiling runs are available that share only partial common characteristics with the target job. △ Less

Submitted 23 November, 2023; v1 submitted 22 August, 2023; originally announced August 2023.

Comments: 10 pages, 9 figures

Journal ref: IEEE IPCCC (2023) 403-412

arXiv:2306.03672 [pdf, other]

doi 10.1145/3603719.3603733

Selecting Efficient Cluster Resources for Data Analytics: When and How to Allocate for In-Memory Processing?

Authors: Jonathan Will, Lauritz Thamsen, Dominik Scheinert, Odej Kao

Abstract: Distributed dataflow systems such as Apache Spark or Apache Flink enable parallel, in-memory data processing on large clusters of commodity hardware. Consequently, the appropriate amount of memory to allocate to the cluster is a crucial consideration. In this paper, we analyze the challenge of efficient resource allocation for distributed data processing, focusing on memory. We emphasize that in… ▽ More Distributed dataflow systems such as Apache Spark or Apache Flink enable parallel, in-memory data processing on large clusters of commodity hardware. Consequently, the appropriate amount of memory to allocate to the cluster is a crucial consideration. In this paper, we analyze the challenge of efficient resource allocation for distributed data processing, focusing on memory. We emphasize that in-memory processing with in-memory data processing frameworks can undermine resource efficiency. Based on the findings of our trace data analysis, we compile requirements towards an automated solution for efficient cluster resource allocation. △ Less

Submitted 7 June, 2023; v1 submitted 6 June, 2023; originally announced June 2023.

Comments: 4 pages, 3 Figures; ACM SSDBM 2023

ACM Class: C.2.4; C.4; I.2.8; H.2.8; H.2.4

arXiv:2305.15092 [pdf, other]

doi 10.1145/3632775.3639589

FedZero: Leveraging Renewable Excess Energy in Federated Learning

Authors: Philipp Wiesner, Ramin Khalili, Dennis Grinwald, Pratik Agrawal, Lauritz Thamsen, Odej Kao

Abstract: Federated Learning (FL) is an emerging machine learning technique that enables distributed model training across data silos or edge devices without data sharing. Yet, FL inevitably introduces inefficiencies compared to centralized model training, which will further increase the already high energy usage and associated carbon emissions of machine learning in the future. One idea to reduce FL's carb… ▽ More Federated Learning (FL) is an emerging machine learning technique that enables distributed model training across data silos or edge devices without data sharing. Yet, FL inevitably introduces inefficiencies compared to centralized model training, which will further increase the already high energy usage and associated carbon emissions of machine learning in the future. One idea to reduce FL's carbon footprint is to schedule training jobs based on the availability of renewable excess energy that can occur at certain times and places in the grid. However, in the presence of such volatile and unreliable resources, existing FL schedulers cannot always ensure fast, efficient, and fair training. We propose FedZero, an FL system that operates exclusively on renewable excess energy and spare capacity of compute infrastructure to effectively reduce a training's operational carbon emissions to zero. Using energy and load forecasts, FedZero leverages the spatio-temporal availability of excess resources by selecting clients for fast convergence and fair participation. Our evaluation, based on real solar and load traces, shows that FedZero converges significantly faster than existing approaches under the mentioned constraints while consuming less energy. Furthermore, it is robust to forecasting errors and scalable to tens of thousands of clients. △ Less

Submitted 10 January, 2024; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: Accepted for publication at ACM e-Energy '24

arXiv:2305.01985 [pdf, other]

doi 10.1016/j.sysarc.2023.102891

Towards a Real-Time IoT: Approaches for Incoming Packet Processing in Cyber-Physical Systems

Authors: Ilja Behnke, Christoph Blumschein, Robert Danicki, Philipp Wiesner, Lauritz Thamsen, Odej Kao

Abstract: Embedded real-time devices for monitoring, controlling, and collaboration purposes in cyber-physical systems are now commonly equipped with IP networking capabilities. However, the reception and processing of IP packets generates workloads in unpredictable frequencies as networks are outside of a developer's control and difficult to anticipate, especially when networks are connected to the interne… ▽ More Embedded real-time devices for monitoring, controlling, and collaboration purposes in cyber-physical systems are now commonly equipped with IP networking capabilities. However, the reception and processing of IP packets generates workloads in unpredictable frequencies as networks are outside of a developer's control and difficult to anticipate, especially when networks are connected to the internet. As of now, embedded network controllers and IP stacks are not designed for real-time capabilities, even when used in real-time environments and operating systems. Our work focuses on real-time aware packet reception from open network connections, without a real-time networking infrastructure. This article presents two experimentally evaluated modifications to the IP processing subsystem and embedded network interface controllers of constrained IoT devices. The first, our software approach, introduces early packet classification and priority-aware processing in the network driver. In our experiments this allowed the network subsystem to remain active at a seven-fold increase in network traffic load before disabling the receive interrupts as a last resort. The second, our hardware approach, makes changes to the network interface controller, applying interrupt moderation based on real-time priorities to minimize the number of network-generated interrupts. Furthermore, this article provides an outlook on how the software and hardware approaches can be combined in a co-designed packet receive architecture. △ Less

Submitted 3 May, 2023; originally announced May 2023.

Comments: arXiv admin note: text overlap with arXiv:2204.08846

Journal ref: Journal of Systems Architecture. 140 (2023)

arXiv:2304.06414 [pdf, other]

doi 10.1109/IC2E55432.2022.00011

Towards Energy Consumption and Carbon Footprint Testing for AI-driven IoT Services

Authors: Demetris Trihinas, Lauritz Thamsen, Jossekin Beilharz, Moysis Symeonides

Abstract: Energy consumption and carbon emissions are expected to be crucial factors for Internet of Things (IoT) applications. Both the scale and the geo-distribution keep increasing, while Artificial Intelligence (AI) further penetrates the "edge" in order to satisfy the need for highly-responsive and intelligent services. To date, several edge/fog emulators are catering for IoT testing by supporting the… ▽ More Energy consumption and carbon emissions are expected to be crucial factors for Internet of Things (IoT) applications. Both the scale and the geo-distribution keep increasing, while Artificial Intelligence (AI) further penetrates the "edge" in order to satisfy the need for highly-responsive and intelligent services. To date, several edge/fog emulators are catering for IoT testing by supporting the deployment and execution of AI-driven IoT services in consolidated test environments. These tools enable the configuration of infrastructures so that they closely resemble edge devices and IoT networks. However, energy consumption and carbon emissions estimations during the testing of AI services are still missing from the current state of IoT testing suites. This study highlights important questions that developers of AI-driven IoT services are in need of answers, along with a set of observations and challenges, aiming to help researchers designing IoT testing and benchmarking suites to cater to user needs. △ Less

Submitted 13 April, 2023; originally announced April 2023.

Comments: Presented at the 2nd International Workshop on Testing Distributed Internet of Things Systems (TDIS 2022)

Journal ref: 2022 IEEE International Conference on Cloud Engineering (IC2E 2022)

arXiv:2302.07652 [pdf, other]

doi 10.1109/CCGrid57682.2023.00025

How Workflow Engines Should Talk to Resource Managers: A Proposal for a Common Workflow Scheduling Interface

Authors: Fabian Lehmann, Jonathan Bader, Friedrich Tschirpke, Lauritz Thamsen, Ulf Leser

Abstract: Scientific workflow management systems (SWMSs) and resource managers together ensure that tasks are scheduled on provisioned resources so that all dependencies are obeyed, and some optimization goal, such as makespan minimization, is achieved. In practice, however, there is no clear separation of scheduling responsibilities between an SWMS and a resource manager because there exists no agreed-upon… ▽ More Scientific workflow management systems (SWMSs) and resource managers together ensure that tasks are scheduled on provisioned resources so that all dependencies are obeyed, and some optimization goal, such as makespan minimization, is achieved. In practice, however, there is no clear separation of scheduling responsibilities between an SWMS and a resource manager because there exists no agreed-upon separation of concerns between their different components. This has two consequences. First, the lack of a standardized API to exchange scheduling information between SWMSs and resource managers hinders portability. It incurs costly adaptations when a component should be replaced by a different one (e.g., an SWMS with another SWMS on the same resource manager). Second, due to overlap** functionalities, current installations often actually have two schedulers, both making partial scheduling decisions under incomplete information, leading to suboptimal workflow scheduling. In this paper, we propose a simple REST interface between SWMSs and resource managers, which allows any SWMS to pass dynamic workflow information to a resource manager, enabling maximally informed scheduling decisions. We provide an implementation of this API as an example, using Nextflow as an SWMS and Kubernetes as a resource manager. Our experiments with nine real-world workflows show that this strategy reduces makespan by up to 25.1% and 10.8% on average compared to the standard Nextflow/Kubernetes configuration. Furthermore, a more widespread implementation of this API would enable leaner code bases, a simpler exchange of components of workflow systems, and a unified place to implement new scheduling algorithms. △ Less

Submitted 13 July, 2023; v1 submitted 15 February, 2023; originally announced February 2023.

Journal ref: 2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

arXiv:2211.13729 [pdf, other]

doi 10.1109/BigData55660.2022.10021129

Probabilistic Time Series Forecasting for Adaptive Monitoring in Edge Computing Environments

Authors: Dominik Scheinert, Babak Sistani Zadeh Aghdam, Soeren Becker, Odej Kao, Lauritz Thamsen

Abstract: With increasingly more computation being shifted to the edge of the network, monitoring of critical infrastructures, such as intermediate processing nodes in autonomous driving, is further complicated due to the typically resource-constrained environments. In order to reduce the resource overhead on the network link imposed by monitoring, various methods have been discussed that either follow a fi… ▽ More With increasingly more computation being shifted to the edge of the network, monitoring of critical infrastructures, such as intermediate processing nodes in autonomous driving, is further complicated due to the typically resource-constrained environments. In order to reduce the resource overhead on the network link imposed by monitoring, various methods have been discussed that either follow a filtering approach for data-emitting devices or conduct dynamic sampling based on employed prediction models. Still, existing methods are mainly requiring adaptive monitoring on edge devices, which demands device reconfigurations, utilizes additional resources, and limits the sophistication of employed models. In this paper, we propose a sampling-based and cloud-located approach that internally utilizes probabilistic forecasts and hence provides means of quantifying model uncertainties, which can be used for contextualized adaptations of sampling frequencies and consequently relieves constrained network resources. We evaluate our prototype implementation for the monitoring pipeline on a publicly available streaming dataset and demonstrate its positive impact on resource efficiency in a method comparison. △ Less

Submitted 30 January, 2023; v1 submitted 24 November, 2022; originally announced November 2022.

Comments: 6 pages, 5 figures, 2 tables

Journal ref: IEEE BigData (2022) 4583-4588

arXiv:2211.08227 [pdf, other]

doi 10.1109/BigData55660.2022.10020860

Perona: Robust Infrastructure Fingerprinting for Resource-Efficient Big Data Analytics

Authors: Dominik Scheinert, Soeren Becker, Jonathan Bader, Lauritz Thamsen, Jonathan Will, Odej Kao

Abstract: Choosing a good resource configuration for big data analytics applications can be challenging, especially in cloud environments. Automated approaches are desirable as poor decisions can reduce performance and raise costs. The majority of existing automated approaches either build performance models from previous workload executions or conduct iterative resource configuration profiling until a near… ▽ More Choosing a good resource configuration for big data analytics applications can be challenging, especially in cloud environments. Automated approaches are desirable as poor decisions can reduce performance and raise costs. The majority of existing automated approaches either build performance models from previous workload executions or conduct iterative resource configuration profiling until a near-optimal solution has been found. In doing so, they only obtain an implicit understanding of the underlying infrastructure, which is difficult to transfer to alternative infrastructures and, thus, profiling and modeling insights are not sustained beyond very specific situations. We present Perona, a novel approach to robust infrastructure fingerprinting for usage in the context of big data analytics. Perona employs common sets and configurations of benchmarking tools for target resources, so that resulting benchmark metrics are directly comparable and ranking is enabled. Insignificant benchmark metrics are discarded by learning a low-dimensional representation of the input metric vector, and previous benchmark executions are taken into consideration for context-awareness as well, allowing to detect resource degradation. We evaluate our approach both on data gathered from our own experiments as well as within related works for resource configuration optimization, demonstrating that Perona captures the characteristics from benchmark runs in a compact manner and produces representations that can be used directly. △ Less

Submitted 30 January, 2023; v1 submitted 15 November, 2022; originally announced November 2022.

Comments: 8 pages, 5 figures, 3 tables

Journal ref: IEEE BigData (2022) 209-216

arXiv:2211.04240 [pdf, other]

doi 10.1109/BigData55660.2022.10020295

Ruya: Memory-Aware Iterative Optimization of Cluster Configurations for Big Data Processing

Authors: Jonathan Will, Lauritz Thamsen, Jonathan Bader, Dominik Scheinert, Odej Kao

Abstract: Selecting appropriate computational resources for data processing jobs on large clusters is difficult, even for expert users like data engineers. Inadequate choices can result in vastly increased costs, without significantly improving performance. One crucial aspect of selecting an efficient resource configuration is avoiding memory bottlenecks. By knowing the required memory of a job in advance,… ▽ More Selecting appropriate computational resources for data processing jobs on large clusters is difficult, even for expert users like data engineers. Inadequate choices can result in vastly increased costs, without significantly improving performance. One crucial aspect of selecting an efficient resource configuration is avoiding memory bottlenecks. By knowing the required memory of a job in advance, the search space for an optimal resource configuration can be greatly reduced. Therefore, we present Ruya, a method for memory-aware optimization of data processing cluster configurations based on iteratively exploring a narrowed-down search space. First, we perform job profiling runs with small samples of the dataset on just a single machine to model the job's memory usage patterns. Second, we prioritize cluster configurations with a suitable amount of total memory and within this reduced search space, we iteratively search for the best cluster configuration with Bayesian optimization. This search process stops once it converges on a configuration that is believed to be optimal for the given job. In our evaluation on a dataset with 1031 Spark and Hadoop jobs, we see a reduction of search iterations to find an optimal configuration by around half, compared to the baseline. △ Less

Submitted 3 February, 2023; v1 submitted 8 November, 2022; originally announced November 2022.

Comments: 9 pages, 5 Figures, 3 Tables; IEEE BigData 2022. arXiv admin note: substantial text overlap with arXiv:2206.13852

ACM Class: C.2.4; I.2.8; I.2.6

Journal ref: 2022 IEEE International Conference on Big Data (Big Data) pp. 161-169

arXiv:2208.07905 [pdf, other]

doi 10.1109/IPCCC55026.2022.9894299

Reshi: Recommending Resources for Scientific Workflow Tasks on Heterogeneous Infrastructures

Authors: Jonathan Bader, Fabian Lehmann, Alexander Groth, Lauritz Thamsen, Dominik Scheinert, Jonathan Will, Ulf Leser, Odej Kao

Abstract: Scientific workflows typically comprise a multitude of different processing steps which often are executed in parallel on different partitions of the input data. These executions, in turn, must be scheduled on the compute nodes of the computational infrastructure at hand. This assignment is complicated by the facts that (a) tasks typically have highly heterogeneous resource requirements and (b) in… ▽ More Scientific workflows typically comprise a multitude of different processing steps which often are executed in parallel on different partitions of the input data. These executions, in turn, must be scheduled on the compute nodes of the computational infrastructure at hand. This assignment is complicated by the facts that (a) tasks typically have highly heterogeneous resource requirements and (b) in many infrastructures, compute nodes offer highly heterogeneous resources. In consequence, predictions of the runtime of a given task on a given node, as required by many scheduling algorithms, are often rather imprecise, which can lead to sub-optimal scheduling decisions. We propose Reshi, a method for recommending task-node assignments during workflow execution that can cope with heterogeneous tasks and heterogeneous nodes. Reshi approaches the problem as a regression task, where task-node pairs are modeled as feature vectors over the results of dedicated micro benchmarks and past task executions. Based on these features, Reshi trains a regression tree model to rank and recommend nodes for each ready-to-run task, which can be used as input to a scheduler. For our evaluation, we benchmarked 27 AWS machine types using three representative workflows. We compare Reshi's recommendations with three state-of-the-art schedulers. Our evaluation shows that Reshi outperforms HEFT by a mean makespan reduction of 7.18% and 18.01% assuming a mean task runtime prediction error of 15%. △ Less

Submitted 17 October, 2022; v1 submitted 16 August, 2022; originally announced August 2022.

Comments: Paper accepted in 41st IEEE International Performance Computing and Communications Conference (IPCCC 2022)

arXiv:2207.09298 [pdf, other]

Magpie: Automatically Tuning Static Parameters for Distributed File Systems using Deep Reinforcement Learning

Authors: Houkun Zhu, Dominik Scheinert, Lauritz Thamsen, Kordian Gontarska, Odej Kao

Abstract: Distributed file systems are widely used nowadays, yet using their default configurations is often not optimal. At the same time, tuning configuration parameters is typically challenging and time-consuming. It demands expertise and tuning operations can also be expensive. This is especially the case for static parameters, where changes take effect only after a restart of the system or workloads. W… ▽ More Distributed file systems are widely used nowadays, yet using their default configurations is often not optimal. At the same time, tuning configuration parameters is typically challenging and time-consuming. It demands expertise and tuning operations can also be expensive. This is especially the case for static parameters, where changes take effect only after a restart of the system or workloads. We propose a novel approach, Magpie, which utilizes deep reinforcement learning to tune static parameters by strategically exploring and exploiting configuration parameter spaces. To boost the tuning of the static parameters, our method employs both server and client metrics of distributed file systems to understand the relationship between static parameters and performance. Our empirical evaluation results show that Magpie can noticeably improve the performance of the distributed file system Lustre, where our approach on average achieves 91.8% throughput gains against default configuration after tuning towards single performance indicator optimization, while it reaches 39.7% more throughput gains against the baseline. △ Less

Submitted 22 July, 2022; v1 submitted 19 July, 2022; originally announced July 2022.

Comments: Accepted at The IEEE International Conference on Cloud Engineering (IC2E) conference 2022

arXiv:2206.13852 [pdf, other]

doi 10.1109/IC2E55432.2022.00014

Get Your Memory Right: The Crispy Resource Allocation Assistant for Large-Scale Data Processing

Authors: Jonathan Will, Lauritz Thamsen, Jonathan Bader, Dominik Scheinert, Odej Kao

Abstract: Distributed dataflow systems like Apache Spark and Apache Hadoop enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs -- that neither lead to bottlenecks nor to low resource utilization -- is often challenging, even for expert users such as data engineers. Further, existing automated approaches to resource selection rel… ▽ More Distributed dataflow systems like Apache Spark and Apache Hadoop enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs -- that neither lead to bottlenecks nor to low resource utilization -- is often challenging, even for expert users such as data engineers. Further, existing automated approaches to resource selection rely on the assumption that a job is recurring to learn from previous runs or to warrant the cost of full test runs to learn from. However, this assumption often does not hold since many jobs are too unique. Therefore, we present Crispy, a method for optimizing data processing cluster configurations based on job profiling runs with small samples of the dataset on just a single machine. Crispy attempts to extrapolate the memory usage for the full dataset to then choose a cluster configuration with enough total memory. In our evaluation on a dataset with 1031 Spark and Hadoop jobs, we see a reduction of job execution costs by 56% compared to the baseline, while on average spending less than ten minutes on profiling runs per job on a consumer-grade laptop. △ Less

Submitted 10 January, 2023; v1 submitted 28 June, 2022; originally announced June 2022.

Comments: 9 pages, 3 figures, 2 tables, IEEE IC2E 2022

ACM Class: C.2.4; I.2.8; I.2.6

Journal ref: 2022 IEEE International Conference on Cloud Engineering (IC2E), pp. 58-66

arXiv:2206.09679 [pdf, other]

Phoebe: QoS-Aware Distributed Stream Processing through Anticipating Dynamic Workloads

Authors: Morgan K. Geldenhuys, Dominik Scheinert, Odej Kao, Lauritz Thamsen

Abstract: Distributed Stream Processing systems have become an essential part of big data processing platforms. They are characterized by the high-throughput processing of near to real-time event streams with the goal of delivering low-latency results and thus enabling time-sensitive decision making. At the same time, results are expected to be consistent even in the presence of partial failures where exact… ▽ More Distributed Stream Processing systems have become an essential part of big data processing platforms. They are characterized by the high-throughput processing of near to real-time event streams with the goal of delivering low-latency results and thus enabling time-sensitive decision making. At the same time, results are expected to be consistent even in the presence of partial failures where exactly-once processing guarantees are required for correctness. Stream processing workloads are oftentimes dynamic in nature which makes static configurations highly inefficient as time goes by. Static resource allocations will almost certainly either negatively impact upon the Quality of Service and/or result in higher operational costs. In this paper we present Phoebe, a proactive approach to system auto-tuning for Distributed Stream Processing jobs executing on dynamic workloads. Our approach makes use of parallel profiling runs, QoS modeling, and runtime optimization to provide a general solution whereby configuration parameters are automatically tuned to ensure a stable service as well as alignment with recovery time Quality of Service targets. Phoebe makes use of Time Series Forecasting to gain an insight into future workload requirements thereby delivering scaling decisions which are accurate, long-lived, and reliable. Our experiments demonstrate that Phoebe is able to deliver a stable service while at the same time reducing resource over-provisioning. △ Less

Submitted 20 June, 2022; originally announced June 2022.

Comments: 10 pages, ICWS2022

arXiv:2206.00429 [pdf, other]

doi 10.1007/s13222-022-00416-z

Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview

Authors: Lauritz Thamsen, Dominik Scheinert, Jonathan Will, Jonathan Bader, Odej Kao

Abstract: Many organizations routinely analyze large datasets using systems for distributed data-parallel processing and clusters of commodity resources. Yet, users need to configure adequate resources for their data processing jobs. This requires significant insights into expected job runtimes and scaling behavior, resource characteristics, input data distributions, and other factors. Unable to estimate pe… ▽ More Many organizations routinely analyze large datasets using systems for distributed data-parallel processing and clusters of commodity resources. Yet, users need to configure adequate resources for their data processing jobs. This requires significant insights into expected job runtimes and scaling behavior, resource characteristics, input data distributions, and other factors. Unable to estimate performance accurately, users frequently overprovision resources for their jobs, leading to low resource utilization and high costs. In this paper, we present major building blocks towards a collaborative approach for optimization of data processing cluster configurations based on runtime data and performance models. We believe that runtime data can be shared and used for performance models across different execution contexts, significantly reducing the reliance on the recurrence of individual processing jobs or, else, dedicated job profiling. For this, we describe how the similarity of processing jobs and cluster infrastructures can be employed to combine suitable data points from local and global job executions into accurate performance models. Furthermore, we outline approaches to performance prediction via more context-aware and reusable models. Finally, we lay out how metrics from previous executions can be combined with runtime monitoring to effectively re-configure models and clusters dynamically. △ Less

Submitted 1 June, 2022; originally announced June 2022.

arXiv:2205.11181 [pdf, other]

doi 10.1145/3538712.3538739

Lotaru: Locally Estimating Runtimes of Scientific Workflow Tasks in Heterogeneous Clusters

Authors: Jonathan Bader, Fabian Lehmann, Lauritz Thamsen, Jonathan Will, Ulf Leser, Odej Kao

Abstract: Many scientific workflow scheduling algorithms need to be informed about task runtimes a-priori to conduct efficient scheduling. In heterogeneous cluster infrastructures, this problem becomes aggravated because these runtimes are required for each task-node pair. Using historical data is often not feasible as logs are typically not retained indefinitely and workloads as well as infrastructure chan… ▽ More Many scientific workflow scheduling algorithms need to be informed about task runtimes a-priori to conduct efficient scheduling. In heterogeneous cluster infrastructures, this problem becomes aggravated because these runtimes are required for each task-node pair. Using historical data is often not feasible as logs are typically not retained indefinitely and workloads as well as infrastructure changes. In contrast, online methods, which predict task runtimes on specific nodes while the workflow is running, have to cope with the lack of example runs, especially during the start-up. In this paper, we present Lotaru, a novel online method for locally estimating task runtimes in scientific workflows on heterogeneous clusters. Lotaru first profiles all nodes of a cluster with a set of short-running and uniform microbenchmarks. Next, it runs the workflow to be scheduled on the user's local machine with drastically reduced data to determine important task characteristics. Based on these measurements, Lotaru learns a Bayesian linear regression model to predict a task's runtime given the input size and finally adjusts the predicted runtime specifically for each task-node pair in the cluster based on the micro-benchmark results. Due to its Bayesian approach, Lotaru can also compute robust uncertainty estimates and provides them as an input for advanced scheduling methods. Our evaluation with five real-world scientific workflows and different datasets shows that Lotaru significantly outperforms the baselines in terms of prediction errors for homogeneous and heterogeneous clusters. △ Less

Submitted 23 May, 2022; originally announced May 2022.

Comments: paper accepted in 34th International Conference on Scientific and Statistical Database Management (SSDBM 2022)

arXiv:2205.02895 [pdf, other]

doi 10.1007/978-3-031-12597-3_14

Cucumber: Renewable-Aware Admission Control for Delay-Tolerant Cloud and Edge Workloads

Authors: Philipp Wiesner, Dominik Scheinert, Thorsten Wittkopp, Lauritz Thamsen, Odej Kao

Abstract: The growing electricity demand of cloud and edge computing increases operational costs and will soon have a considerable impact on the environment. A possible countermeasure is equip** IT infrastructure directly with on-site renewable energy sources. Yet, particularly smaller data centers may not be able to use all generated power directly at all times, while feeding it into the public grid or e… ▽ More The growing electricity demand of cloud and edge computing increases operational costs and will soon have a considerable impact on the environment. A possible countermeasure is equip** IT infrastructure directly with on-site renewable energy sources. Yet, particularly smaller data centers may not be able to use all generated power directly at all times, while feeding it into the public grid or energy storage is often not an option. To maximize the usage of renewable excess energy, we propose Cucumber, an admission control policy that accepts delay-tolerant workloads only if they can be computed within their deadlines without the use of grid energy. Using probabilistic forecasting of computational load, energy consumption, and energy production, Cucumber can be configured towards more optimistic or conservative admission. We evaluate our approach on two scenarios using real solar production forecasts for Berlin, Mexico City, and Cape Town in a simulation environment. For scenarios where excess energy was actually available, our results show that Cucumber's default configuration achieves acceptance rates close to the optimal case and causes 97.0% of accepted workloads to be powered using excess energy, while more conservative admission results in 18.5% reduced acceptance at almost zero grid power usage. △ Less

Submitted 27 August, 2022; v1 submitted 5 May, 2022; originally announced May 2022.

Comments: Accepted at Euro-Par 2022. GitHub repository: https://github.com/dos-group/cucumber

arXiv:2204.08846 [pdf, other]

Differentiating Network Flows for Priority-Aware Scheduling of Incoming Packets in Real-Time IoT Systems

Authors: Christoph Blumschein, Ilja Behnke, Lauritz Thamsen, Odej Kao

Abstract: When IP-packet processing is unconditionally carried out on behalf of an operating system kernel thread, processing systems can experience overload in high incoming traffic scenarios. This is especially worrying for embedded real-time devices controlling their physical environment in industrial IoT scenarios and automotive systems. We propose an embedded real-time aware IP stack adaption with an e… ▽ More When IP-packet processing is unconditionally carried out on behalf of an operating system kernel thread, processing systems can experience overload in high incoming traffic scenarios. This is especially worrying for embedded real-time devices controlling their physical environment in industrial IoT scenarios and automotive systems. We propose an embedded real-time aware IP stack adaption with an early demultiplexing scheme for incoming packets and subsequent per-flow aperiodic scheduling. By instrumenting existing embedded IP stacks, rigid prioritization with minimal latency is deployed without the need of further task resources. Simple mitigation techniques can be applied to individual flows, causing hardly measurable overhead while at the same time protecting the system from overload conditions. Our IP stack adaption is able to reduce the low-priority packet processing time by over 86% compared to an unmodified stack. The network subsystem can thereby remain active at a 7x higher general traffic load before disabling the receive IRQ as a last resort to assure deadlines. △ Less

Submitted 19 April, 2022; originally announced April 2022.

Comments: 25th International Symposium on Real-Time Distributed Computing

arXiv:2203.14801 [pdf, other]

doi 10.1145/3517206.3526275

SyncMesh: Improving Data Locality for Function-as-a-Service in Meshed Edge Networks

Authors: Daniel Habenicht, Kevin Kreutz, Soeren Becker, Jonathan Bader, Lauritz Thamsen, Odej Kao

Abstract: The increasing use of Internet of Things devices coincides with more communication and data movement in networks, which can exceed existing network capabilities. These devices often process sensor or user information, where data privacy and latency are a major concern. Therefore, traditional approaches like cloud computing do not fit well, yet new architectures such as edge computing address this… ▽ More The increasing use of Internet of Things devices coincides with more communication and data movement in networks, which can exceed existing network capabilities. These devices often process sensor or user information, where data privacy and latency are a major concern. Therefore, traditional approaches like cloud computing do not fit well, yet new architectures such as edge computing address this gap. In addition, the Function-as-a-Service (FaaS) paradigm gains in prevalence as a workload execution platform, however the decoupling of storage results in further challenges for highly distributed edge environments. To address this, we propose SyncMesh, a system to manage, query, and transform data in a scalable and stateless manner by leveraging the capabilities of Function-as-a-Service and at the same time enabling data locality. Furthermore, we provide a prototypical implementation and evaluate it against established centralized and decentralized systems in regard to traffic usage and request times. The preliminary results indicate that SyncMesh is able to exonerate the network layer and accelerate the transmission of data to clients, while simultaneously improving local data processing. △ Less

Submitted 28 March, 2022; originally announced March 2022.

arXiv:2201.00594 [pdf, other]

doi 10.1145/3477314.3507165

A Priority-Aware Multiqueue NIC Design

Authors: Ilja Behnke, Philipp Wiesner, Robert Danicki, Lauritz Thamsen

Abstract: Low-level embedded systems are used to control cyber-phyiscal systems in industrial and autonomous applications. They need to meet hard real-time requirements as unanticipated controller delays on moving machines can have devastating effects. Modern developments such as the industrial Internet of Things and autonomous machines require these devices to connect to large IP networks. Since Network In… ▽ More Low-level embedded systems are used to control cyber-phyiscal systems in industrial and autonomous applications. They need to meet hard real-time requirements as unanticipated controller delays on moving machines can have devastating effects. Modern developments such as the industrial Internet of Things and autonomous machines require these devices to connect to large IP networks. Since Network Interface Controllers (NICs) trigger interrupts for incoming packets, real-time embedded systems are subject to unpredictable preemptions when connected to such networks. In this work, we propose a priority-aware NIC design to moderate network-generated interrupts by map** IP flows to processes and based on that, consolidates their packets into different queues. These queues apply priority-dependent interrupt moderation. First experimental evaluations show that 93% of interrupts can be saved leading to an 80% decrease of processing delay of critical tasks in the configurations investigated. △ Less

Submitted 3 January, 2022; originally announced January 2022.

Comments: The 37th ACM/SIGAPP Symposium on Applied Computing (SAC '22)

ACM Class: C.2.4; B.4.1; D.4.4

arXiv:2112.09580 [pdf, ps, other]

Continuously Testing Distributed IoT Systems: An Overview of the State of the Art

Authors: Jossekin Beilharz, Philipp Wiesner, Arne Boockmeyer, Lukas Pirl, Dirk Friedenberger, Florian Brokhausen, Ilja Behnke, Andreas Polze, Lauritz Thamsen

Abstract: The continuous testing of small changes to systems has proven to be useful and is widely adopted in the development of software systems. For this, software is tested in environments that are as close as possible to the production environments. When testing IoT systems, this approach is met with unique challenges that stem from the typically large scale of the deployments, heterogeneity of nodes,… ▽ More The continuous testing of small changes to systems has proven to be useful and is widely adopted in the development of software systems. For this, software is tested in environments that are as close as possible to the production environments. When testing IoT systems, this approach is met with unique challenges that stem from the typically large scale of the deployments, heterogeneity of nodes, challenging network characteristics, and tight integration with the environment among others. IoT test environments present a possible solution to these challenges by emulating the nodes, networks, and possibly domain environments in which IoT applications can be executed. This paper gives an overview of the state of the art in IoT testing. We derive desirable characteristics of IoT test environments, compare 18 tools that can be used in this respect, and give a research outlook of future trends in this area. △ Less

Submitted 17 December, 2021; originally announced December 2021.

arXiv:2111.08759 [pdf, other]

doi 10.1109/BigData52589.2021.9671275

On the Potential of Execution Traces for Batch Processing Workload Optimization in Public Clouds

Authors: Dominik Scheinert, Alireza Alamgiralem, Jonathan Bader, Jonathan Will, Thorsten Wittkopp, Lauritz Thamsen

Abstract: With the growing amount of data, data processing workloads and the management of their resource usage becomes increasingly important. Since managing a dedicated infrastructure is in many situations infeasible or uneconomical, users progressively execute their respective workloads in the cloud. As the configuration of workloads and resources is often challenging, various methods have been proposed… ▽ More With the growing amount of data, data processing workloads and the management of their resource usage becomes increasingly important. Since managing a dedicated infrastructure is in many situations infeasible or uneconomical, users progressively execute their respective workloads in the cloud. As the configuration of workloads and resources is often challenging, various methods have been proposed that either quickly profile towards a good configuration or determine one based on data from previous runs. Still, performance data to train such methods is often lacking and must be costly collected. In this paper, we propose a collaborative approach for sharing anonymized workload execution traces among users, mining them for general patterns, and exploiting clusters of historical workloads for future optimizations. We evaluate our prototype implementation for mining workload execution graphs on a publicly available trace dataset and demonstrate the predictive value of workload clusters determined through traces only. △ Less

Submitted 16 January, 2022; v1 submitted 16 November, 2021; originally announced November 2021.

Comments: 6 pages, 5 figures, 1 table

Journal ref: IEEE BigData (2021) 3113-3118

arXiv:2111.07904 [pdf, other]

doi 10.1109/BigData52589.2021.9671742

Training Data Reduction for Performance Models of Data Analytics Jobs in the Cloud

Authors: Jonathan Will, Onur Arslan, Jonathan Bader, Dominik Scheinert, Lauritz Thamsen

Abstract: Distributed dataflow systems like Apache Flink and Apache Spark simplify processing large amounts of data on clusters in a data-parallel manner. However, choosing suitable cluster resources for distributed dataflow jobs in both type and number is difficult, especially for users who do not have access to previous performance metrics. One approach to overcoming this issue is to have users share runt… ▽ More Distributed dataflow systems like Apache Flink and Apache Spark simplify processing large amounts of data on clusters in a data-parallel manner. However, choosing suitable cluster resources for distributed dataflow jobs in both type and number is difficult, especially for users who do not have access to previous performance metrics. One approach to overcoming this issue is to have users share runtime metrics to train context-aware performance models that help find a suitable configuration for the job at hand. A problem when sharing runtime data instead of trained models or model parameters is that the data size can grow substantially over time. This paper examines several clustering techniques to minimize training data size while kee** the associated performance models accurate. Our results indicate that efficiency gains in data transfer, storage, and model training can be achieved through training data reduction. In the evaluation of our solution on a dataset of runtime data from 930 unique distributed dataflow jobs, we observed that, on average, a 75% data reduction only increases prediction errors by one percentage point. △ Less

Submitted 11 March, 2022; v1 submitted 15 November, 2021; originally announced November 2021.

Comments: 6 pages, 5 figures, Accepted for the BPOD Workshop at IEEE Big Data 2021

ACM Class: C.2.4; I.2.8; I.2.6

Journal ref: IEEE Big Data (2021) 3141-3146

arXiv:2111.05167 [pdf, other]

doi 10.1109/BigData52589.2021.9671519

Tarema: Adaptive Resource Allocation for Scalable Scientific Workflows in Heterogeneous Clusters

Authors: Jonathan Bader, Lauritz Thamsen, Svetlana Kulagina, Jonathan Will, Henning Meyerhenke, Odej Kao

Abstract: Scientific workflow management systems like Nextflow support large-scale data analysis by abstracting away the details of scientific workflows. In these systems, workflows consist of several abstract tasks, of which instances are run in parallel and transform input partitions into output partitions. Resource managers like Kubernetes execute such workflow tasks on cluster infrastructures. However,… ▽ More Scientific workflow management systems like Nextflow support large-scale data analysis by abstracting away the details of scientific workflows. In these systems, workflows consist of several abstract tasks, of which instances are run in parallel and transform input partitions into output partitions. Resource managers like Kubernetes execute such workflow tasks on cluster infrastructures. However, these resource managers only consider the number of CPUs and the amount of available memory when assigning tasks to resources; they do not consider hardware differences beyond these numbers, while computational speed and memory access rates can differ significantly. We propose Tarema, a system for allocating task instances to heterogeneous cluster resources during the execution of scalable scientific workflows. First, Tarema profiles the available infrastructure with a set of benchmark programs and groups cluster nodes with similar performance. Second, Tarema uses online monitoring data of tasks, assigning labels to tasks depending on their resource usage. Third, Tarema uses the node groups and task labels to dynamically assign task instances evenly to resources based on resource demand. Our evaluation of a prototype implementation for Kubernetes, using five real-world Nextflow workflows from the popular nf-core framework and two 15-node clusters consisting of different virtual machines, shows a mean reduction of isolated job runtimes by 19.8% compared to popular schedulers in widely-used resource managers and 4.54% compared to the heuristic SJFN, while providing a better cluster usage. Moreover, executing two long-running workflows in parallel and on restricted resources shows that Tarema is able to reduce the runtimes even more while providing a fair cluster usage. △ Less

Submitted 19 January, 2022; v1 submitted 9 November, 2021; originally announced November 2021.

Journal ref: IEEE Big Data (2021), 65-75

arXiv:2110.13234 [pdf, other]

doi 10.1145/3464298.3493399

Let's Wait Awhile: How Temporal Workload Shifting Can Reduce Carbon Emissions in the Cloud

Authors: Philipp Wiesner, Ilja Behnke, Dominik Scheinert, Kordian Gontarska, Lauritz Thamsen

Abstract: Depending on energy sources and demand, the carbon intensity of the public power grid fluctuates over time. Exploiting this variability is an important factor in reducing the emissions caused by data centers. However, regional differences in the availability of low-carbon energy sources make it hard to provide general best practices for when to consume electricity. Moreover, existing research in t… ▽ More Depending on energy sources and demand, the carbon intensity of the public power grid fluctuates over time. Exploiting this variability is an important factor in reducing the emissions caused by data centers. However, regional differences in the availability of low-carbon energy sources make it hard to provide general best practices for when to consume electricity. Moreover, existing research in this domain focuses mostly on carbon-aware workload migration across geo-distributed data centers, or addresses demand response purely from the perspective of power grid stability and costs. In this paper, we examine the potential impact of shifting computational workloads towards times where the energy supply is expected to be less carbon-intensive. To this end, we identify characteristics of delay-tolerant workloads and analyze the potential for temporal workload shifting in Germany, Great Britain, France, and California over the year 2020. Furthermore, we experimentally evaluate two workload shifting scenarios in a simulation to investigate the influence of time constraints, scheduling strategies, and the accuracy of carbon intensity forecasts. To accelerate research in the domain of carbon-aware computing and to support the evaluation of novel scheduling algorithms, our simulation framework and datasets are publicly available. △ Less

Submitted 25 October, 2021; originally announced October 2021.

Comments: To be published in the proceedings of the 22nd International Middleware Conference (Middleware '21), December 6-10, 2021, Virtual Event, Canada

arXiv:2109.13009 [pdf, other]

LOS: Local-Optimistic Scheduling of Periodic Model Training For Anomaly Detection on Sensor Data Streams in Meshed Edge Networks

Authors: Soeren Becker, Florian Schmidt, Lauritz Thamsen, Ana Juan Ferrer, Odej Kao

Abstract: Anomaly detection is increasingly important to handle the amount of sensor data in Edge and Fog environments, Smart Cities, as well as in Industry 4.0. To ensure good results, the utilized ML models need to be updated periodically to adapt to seasonal changes and concept drifts in the sensor data. Although the increasing resource availability at the edge can allow for in-situ execution of model tr… ▽ More Anomaly detection is increasingly important to handle the amount of sensor data in Edge and Fog environments, Smart Cities, as well as in Industry 4.0. To ensure good results, the utilized ML models need to be updated periodically to adapt to seasonal changes and concept drifts in the sensor data. Although the increasing resource availability at the edge can allow for in-situ execution of model training directly on the devices, it is still often offloaded to fog devices or the cloud. In this paper, we propose Local-Optimistic Scheduling (LOS), a method for executing periodic ML model training jobs in close proximity to the data sources, without overloading lightweight edge devices. Training jobs are offloaded to nearby neighbor nodes as necessary and the resource consumption is optimized to meet the training period while still ensuring enough resources for further training executions. This scheduling is accomplished in a decentralized, collaborative and opportunistic manner, without full knowledge of the infrastructure and workload. We evaluated our method in an edge computing testbed on real-world datasets. The experimental results show that LOS places the training executions close to the input sensor streams, decreases the deviation between training time and training period by up to 40% and increases the amount of successfully scheduled training jobs compared to an in-situ execution. △ Less

Submitted 27 September, 2021; originally announced September 2021.

Comments: 2nd IEEE International Conference on Autonomic Computing and Self-Organizing Systems - ACSOS 2021

arXiv:2109.02340 [pdf, other]

Khaos: Dynamically Optimizing Checkpointing for Dependable Distributed Stream Processing

Authors: Morgan K. Geldenhuys, Benjamin J. J. Pfister, Dominik Scheinert, Lauritz Thamsen, Odej Kao

Abstract: Distributed Stream Processing systems are becoming an increasingly essential part of Big Data processing platforms as users grow ever more reliant on their ability to provide fast access to new results. As such, making timely decisions based on these results is dependent on a system's ability to tolerate failure. Typically, these systems achieve fault tolerance and the ability to recover automatic… ▽ More Distributed Stream Processing systems are becoming an increasingly essential part of Big Data processing platforms as users grow ever more reliant on their ability to provide fast access to new results. As such, making timely decisions based on these results is dependent on a system's ability to tolerate failure. Typically, these systems achieve fault tolerance and the ability to recover automatically from partial failures by implementing checkpoint and rollback recovery. However, owing to the statistical probability of partial failures occurring in these distributed environments and the variability of workloads upon which jobs are expected to operate, static configurations will often not meet Quality of Service constraints with low overhead. In this paper we present Khaos, a new approach which utilizes the parallel processing capabilities of virtual cloud automation technologies for the automatic runtime optimization of fault tolerance configurations in Distributed Stream Processing jobs. Our approach employs three subsequent phases which borrows from the principles of Chaos Engineering: establish the steady-state processing conditions, conduct experiments to better understand how the system performs under failure, and use this knowledge to continuously minimize Quality of Service violations. We implemented Khaos prototypically together with Apache Flink and demonstrate its usefulness experimentally. △ Less

Submitted 26 January, 2023; v1 submitted 6 September, 2021; originally announced September 2021.

arXiv:2109.00294 [pdf, other]

GRAL: Localization of Floating Wireless Sensors in Pipe Networks

Authors: Martin Haug, Felix Lorenz, Lauritz Thamsen

Abstract: Mobile wireless sensors are increasingly recognized as a valuable tool for monitoring critical infrastructures. An important use case is the discovery of leaks and inflows in pipe networks using a swarm of floating sensor nodes. While passively drifting along, the devices must track their individual positions so critical points can later be located. Since pipelines are often situated in inaccessib… ▽ More Mobile wireless sensors are increasingly recognized as a valuable tool for monitoring critical infrastructures. An important use case is the discovery of leaks and inflows in pipe networks using a swarm of floating sensor nodes. While passively drifting along, the devices must track their individual positions so critical points can later be located. Since pipelines are often situated in inaccessible places, large portions of the network can be shielded from radio and satellite signals, rendering conventional positioning systems ineffective. In this paper, we propose a novel algorithm for assigning location estimates to recorded measurements once the sensor node leaves the inaccessible area and transmits them via a gateway. The solution is range-free and makes use of a priori information about the target pipeline network. We further describe two extended variants of our algorithm which use data of encounters with other sensor nodes to improve accuracy. Finally, we evaluate all variants with respect to various network topologies and different numbers of mobile nodes in a simulation. The results show that our algorithm localizes measurements with an average accuracy between 4.81% and 7.58%, depending on the variability of flow speed and the sparsity of reference points. △ Less

Submitted 1 September, 2021; originally announced September 2021.

Comments: to be presented at the 1st International Workshop on Testing Distributed Internet of Things Systems; associated implementation code can be found at https://github.com/reknih/gral/

arXiv:2108.13222 [pdf, other]

doi 10.1002/spe.3058

AuctionWhisk: Using an Auction-Inspired Approach for Function Placement in Serverless Fog Platforms

Authors: David Bermbach, Jonathan Bader, Jonathan Hasenburg, Tobias Pfandzelter, Lauritz Thamsen

Abstract: The Function-as-a-Service (FaaS) paradigm has a lot of potential as a computing model for fog environments comprising both cloud and edge nodes, as compute requests can be scheduled across the entire fog continuum in a fine-grained manner. When the request rate exceeds capacity limits at the resource-constrained edge, some functions need to be offloaded towards the cloud. In this paper, we prese… ▽ More The Function-as-a-Service (FaaS) paradigm has a lot of potential as a computing model for fog environments comprising both cloud and edge nodes, as compute requests can be scheduled across the entire fog continuum in a fine-grained manner. When the request rate exceeds capacity limits at the resource-constrained edge, some functions need to be offloaded towards the cloud. In this paper, we present an auction-inspired approach in which application developers bid on resources while fog nodes decide locally which functions to execute and which to offload in order to maximize revenue. Unlike many current approaches to function placement in the fog, our approach can work in an online and decentralized manner. We also present our proof-of-concept prototype AuctionWhisk that illustrates how such an approach can be implemented in a real FaaS platform. Through a number of simulation runs and system experiments, we show that revenue for overloaded nodes can be maximized without drop** function requests. △ Less

Submitted 23 November, 2021; v1 submitted 30 August, 2021; originally announced August 2021.

Comments: Wiley - Software: Practice and Experience

arXiv:2108.12211 [pdf, other]

doi 10.1109/IPCCC51483.2021.9679361

Enel: Context-Aware Dynamic Scaling of Distributed Dataflow Jobs using Graph Propagation

Authors: Dominik Scheinert, Houkun Zhu, Lauritz Thamsen, Morgan K. Geldenhuys, Jonathan Will, Alexander Acker, Odej Kao

Abstract: Distributed dataflow systems like Spark and Flink enable the use of clusters for scalable data analytics. While runtime prediction models can be used to initially select appropriate cluster resources given target runtimes, the actual runtime performance of dataflow jobs depends on several factors and varies over time. Yet, in many situations, dynamic scaling can be used to meet formulated runtime… ▽ More Distributed dataflow systems like Spark and Flink enable the use of clusters for scalable data analytics. While runtime prediction models can be used to initially select appropriate cluster resources given target runtimes, the actual runtime performance of dataflow jobs depends on several factors and varies over time. Yet, in many situations, dynamic scaling can be used to meet formulated runtime targets despite significant performance variance. This paper presents Enel, a novel dynamic scaling approach that uses message propagation on an attributed graph to model dataflow jobs and, thus, allows for deriving effective rescaling decisions. For this, Enel incorporates descriptive properties that capture the respective execution context, considers statistics from individual dataflow tasks, and propagates predictions through the job graph to eventually find an optimized new scale-out. Our evaluation of Enel with four iterative Spark jobs shows that our approach is able to identify effective rescaling actions, reacting for instance to node failures, and can be reused across different execution contexts. △ Less

Submitted 26 January, 2022; v1 submitted 27 August, 2021; originally announced August 2021.

Comments: 8 pages, 5 figures, 3 tables

Journal ref: IEEE IPCCC (2021) 1-8

arXiv:2108.10721 [pdf, other]

Dependable IoT Data Stream Processing for Monitoring and Control of Urban Infrastructures

Authors: Morgan K. Geldenhuys, Jonathan Will, Benjamin J. J. Pfister, Martin Haug, Alexander Scharmann, Lauritz Thamsen

Abstract: The Internet of Things describes a network of physical devices interacting and producing vast streams of sensor data. At present there are a number of general challenges which exist while develo** solutions for use cases involving the monitoring and control of urban infrastructures. These include the need for a dependable method for extracting value from these high volume streams of time sensiti… ▽ More The Internet of Things describes a network of physical devices interacting and producing vast streams of sensor data. At present there are a number of general challenges which exist while develo** solutions for use cases involving the monitoring and control of urban infrastructures. These include the need for a dependable method for extracting value from these high volume streams of time sensitive data which is adaptive to changing workloads. Low-latency access to the current state for live monitoring is a necessity as well as the ability to perform queries on historical data. At the same time, many design choices need to be made and the number of possible technology options available further adds to the complexity. In this paper we present a dependable IoT data processing platform for the monitoring and control of urban infrastructures. We define requirements in terms of dependability and then select a number of mature open-source technologies to match these requirements. We examine the disparate parts necessary for delivering a holistic overall architecture and describe the dataflows between each of these components. We likewise present generalizable methods for the enrichment and analysis of sensor data applicable across various application areas. We demonstrate the usefulness of this approach by providing an exemplary prototype platform executing on top of Kubernetes and evaluate the effectiveness of jobs processing sensor data in this environment. △ Less

Submitted 24 August, 2021; originally announced August 2021.

arXiv:2108.08685 [pdf, other]

On the Future of Cloud Engineering

Authors: David Bermbach, Abhishek Chandra, Chandra Krintz, Aniruddha Gokhale, Aleksander Slominski, Lauritz Thamsen, Everton Cavalcante, Tian Guo, Ivona Brandic, Rich Wolski

Abstract: Ever since the commercial offerings of the Cloud started appearing in 2006, the landscape of cloud computing has been undergoing remarkable changes with the emergence of many different types of service offerings, developer productivity enhancement tools, and new application classes as well as the manifestation of cloud functionality closer to the user at the edge. The notion of utility computing,… ▽ More Ever since the commercial offerings of the Cloud started appearing in 2006, the landscape of cloud computing has been undergoing remarkable changes with the emergence of many different types of service offerings, developer productivity enhancement tools, and new application classes as well as the manifestation of cloud functionality closer to the user at the edge. The notion of utility computing, however, has remained constant throughout its evolution, which means that cloud users always seek to save costs of leasing cloud resources while maximizing their use. On the other hand, cloud providers try to maximize their profits while assuring service-level objectives of the cloud-hosted applications and kee** operational costs low. All these outcomes require systematic and sound cloud engineering principles. The aim of this paper is to highlight the importance of cloud engineering, survey the landscape of best practices in cloud engineering and its evolution, discuss many of the existing cloud engineering advances, and identify both the inherent technical challenges and research opportunities for the future of cloud computing in general and cloud engineering in particular. △ Less

Submitted 19 August, 2021; originally announced August 2021.

Comments: author copy/preprint of a paper published in the IEEE International Conference on Cloud Engineering (IC2E 2021)

arXiv:2108.04749 [pdf, other]

Evaluation of Load Prediction Techniques for Distributed Stream Processing

Authors: Kordian Gontarska, Morgan Geldenhuys, Dominik Scheinert, Philipp Wiesner, Andreas Polze, Lauritz Thamsen

Abstract: Distributed Stream Processing (DSP) systems enable processing large streams of continuous data to produce results in near to real time. They are an essential part of many data-intensive applications and analytics platforms. The rate at which events arrive at DSP systems can vary considerably over time, which may be due to trends, cyclic, and seasonal patterns within the data streams. A priori know… ▽ More Distributed Stream Processing (DSP) systems enable processing large streams of continuous data to produce results in near to real time. They are an essential part of many data-intensive applications and analytics platforms. The rate at which events arrive at DSP systems can vary considerably over time, which may be due to trends, cyclic, and seasonal patterns within the data streams. A priori knowledge of incoming workloads enables proactive approaches to resource management and optimization tasks such as dynamic scaling, live migration of resources, and the tuning of configuration parameters during run-times, thus leading to a potentially better Quality of Service. In this paper we conduct a comprehensive evaluation of different load prediction techniques for DSP jobs. We identify three use-cases and formulate requirements for making load predictions specific to DSP jobs. Automatically optimized classical and Deep Learning methods are being evaluated on nine different datasets from typical DSP domains, i.e. the IoT, Web 2.0, and cluster monitoring. We compare model performance with respect to overall accuracy and training duration. Our results show that the Deep Learning methods provide the most accurate load predictions for the majority of the evaluated datasets. △ Less

Submitted 10 August, 2021; originally announced August 2021.

arXiv:2107.13921 [pdf, other]

doi 10.1109/Cluster48925.2021.00052

Bellamy: Reusing Performance Models for Distributed Dataflow Jobs Across Contexts

Authors: Dominik Scheinert, Lauritz Thamsen, Houkun Zhu, Jonathan Will, Alexander Acker, Thorsten Wittkopp, Odej Kao

Abstract: Distributed dataflow systems enable the use of clusters for scalable data analytics. However, selecting appropriate cluster resources for a processing job is often not straightforward. Performance models trained on historical executions of a concrete job are helpful in such situations, yet they are usually bound to a specific job execution context (e.g. node type, software versions, job parameters… ▽ More Distributed dataflow systems enable the use of clusters for scalable data analytics. However, selecting appropriate cluster resources for a processing job is often not straightforward. Performance models trained on historical executions of a concrete job are helpful in such situations, yet they are usually bound to a specific job execution context (e.g. node type, software versions, job parameters) due to the few considered input parameters. Even in case of slight context changes, such supportive models need to be retrained and cannot benefit from historical execution data from related contexts. This paper presents Bellamy, a novel modeling approach that combines scale-outs, dataset sizes, and runtimes with additional descriptive properties of a dataflow job. It is thereby able to capture the context of a job execution. Moreover, Bellamy is realizing a two-step modeling approach. First, a general model is trained on all the available data for a specific scalable analytics algorithm, hereby incorporating data from different contexts. Subsequently, the general model is optimized for the specific situation at hand, based on the available data for the concrete context. We evaluate our approach on two publicly available datasets consisting of execution data from various dataflow jobs carried out in different environments, showing that Bellamy outperforms state-of-the-art methods. △ Less

Submitted 17 October, 2021; v1 submitted 29 July, 2021; originally announced July 2021.

Comments: 10 pages, 8 figures, 2 tables

Journal ref: IEEE CLUSTER (2021) 261-270

arXiv:2107.13317 [pdf, other]

doi 10.1109/IC2E52221.2021.00018

C3O: Collaborative Cluster Configuration Optimization for Distributed Data Processing in Public Clouds

Authors: Jonathan Will, Lauritz Thamsen, Dominik Scheinert, Jonathan Bader, Odej Kao

Abstract: Distributed dataflow systems enable data-parallel processing of large datasets on clusters. Public cloud providers offer a large variety and quantity of resources that can be used for such clusters. Yet, selecting appropriate cloud resources for dataflow jobs - that neither lead to bottlenecks nor to low resource utilization - is often challenging, even for expert users such as data engineers. W… ▽ More Distributed dataflow systems enable data-parallel processing of large datasets on clusters. Public cloud providers offer a large variety and quantity of resources that can be used for such clusters. Yet, selecting appropriate cloud resources for dataflow jobs - that neither lead to bottlenecks nor to low resource utilization - is often challenging, even for expert users such as data engineers. We present C3O, a collaborative system for optimizing data processing cluster configurations in public clouds based on shared historical runtime data. The shared data is utilized for predicting the runtimes of data processing jobs on different possible cluster configurations, using specialized regression models. These models take the diverse execution contexts of different users into account and exhibit mean absolute errors below 3% in our experimental evaluation with 930 unique Spark jobs. △ Less

Submitted 1 December, 2021; v1 submitted 28 July, 2021; originally announced July 2021.

Comments: 10 pages, 5 figures, IEEE IC2E 2021. arXiv admin note: text overlap with arXiv:2011.07965

ACM Class: C.2.4; I.2.8; I.2.6

Journal ref: IEEE IC2E (2021) 43-52

arXiv:2104.10085 [pdf, other]

Predicting Medical Interventions from Vital Parameters: Towards a Decision Support System for Remote Patient Monitoring

Authors: Kordian Gontarska, Weronika Wrazen, Jossekin Beilharz, Robert Schmid, Lauritz Thamsen, Andreas Polze

Abstract: Cardiovascular diseases and heart failures in particular are the main cause of non-communicable disease mortality in the world. Constant patient monitoring enables better medical treatment as it allows practitioners to react on time and provide the appropriate treatment. Telemedicine can provide constant remote monitoring so patients can stay in their homes, only requiring medical sensing equipmen… ▽ More Cardiovascular diseases and heart failures in particular are the main cause of non-communicable disease mortality in the world. Constant patient monitoring enables better medical treatment as it allows practitioners to react on time and provide the appropriate treatment. Telemedicine can provide constant remote monitoring so patients can stay in their homes, only requiring medical sensing equipment and network connections. A limiting factor for telemedical centers is the amount of patients that can be monitored simultaneously. We aim to increase this amount by implementing a decision support system. This paper investigates a machine learning model to estimate a risk score based on patient vital parameters that allows sorting all cases every day to help practitioners focus their limited capacities on the most severe cases. The model we propose reaches an AUCROC of 0.84, whereas the baseline rule-based model reaches an AUCROC of 0.73. Our results indicate that the usage of deep learning to improve the efficiency of telemedical centers is feasible. This way more patients could benefit from better health-care through remote monitoring. △ Less

Submitted 20 April, 2021; originally announced April 2021.

arXiv:2104.02393 [pdf, other]

doi 10.1145/3434770.3459733

Detecting and Mitigating Network Packet Overloads on Real-Time Devices in IoT Systems

Authors: Robert Danicki, Martin Haug, Ilja Behnke, Laurenz Mädje, Lauritz Thamsen

Abstract: Manufacturing, automotive, and aerospace environments use embedded systems for control and automation and need to fulfill strict real-time guarantees. To facilitate more efficient business processes and remote control, such devices are being connected to IP networks. Due to the difficulty in predicting network packets and the interrelated workloads of interrupt handlers and drivers, devices contro… ▽ More Manufacturing, automotive, and aerospace environments use embedded systems for control and automation and need to fulfill strict real-time guarantees. To facilitate more efficient business processes and remote control, such devices are being connected to IP networks. Due to the difficulty in predicting network packets and the interrelated workloads of interrupt handlers and drivers, devices controlling time critical processes stand under the risk of missing process deadlines when under high network loads. Additionally, devices at the edge of large networks and the internet are subject to a high risk of load spikes and network packet overloads. In this paper, we investigate strategies to detect network packet overloads in real-time and present four approaches to adaptively mitigate local deadline misses. In addition to two strategies mitigating network bursts with and without hysteresis, we present and discuss two novel mitigation algorithms, called Budget and Queue Mitigation. In an experimental evaluation, all algorithms showed mitigating effects, with the Queue Mitigation strategy enabling most packet processing while preventing lateness of critical tasks. △ Less

Submitted 6 April, 2021; originally announced April 2021.

Comments: EdgeSys '21

arXiv:2103.06026 [pdf, other]

Towards a Cognitive Compute Continuum: An Architecture for Ad-Hoc Self-Managed Swarms

Authors: Ana Juan Ferrer, Soeren Becker, Florian Schmidt, Lauritz Thamsen, Odej Kao

Abstract: In this paper we introduce our vision of a Cognitive Computing Continuum to address the changing IT service provisioning towards a distributed, opportunistic, self-managed collaboration between heterogeneous devices outside the traditional data center boundaries. The focal point of this continuum are cognitive devices, which have to make decisions autonomously using their on-board computation and… ▽ More In this paper we introduce our vision of a Cognitive Computing Continuum to address the changing IT service provisioning towards a distributed, opportunistic, self-managed collaboration between heterogeneous devices outside the traditional data center boundaries. The focal point of this continuum are cognitive devices, which have to make decisions autonomously using their on-board computation and storage capacity based on information sensed from their environment. Such devices are moving and cannot rely on fixed infrastructure elements, but instead realise on-the-fly networking and thus frequently join and leave temporal swarms. All this creates novel demands for the underlying architecture and resource management, which must bridge the gap from edge to cloud environments, while kee** the QoS parameters within required boundaries. The paper presents an initial architecture and a resource management framework for the implementation of this type of IT service provisioning. △ Less

Submitted 10 March, 2021; originally announced March 2021.

Comments: 8 pages, CCGrid 2021 Cloud2Things Workshop

arXiv:2103.05245 [pdf, other]

doi 10.1109/CloudIntelligence52565.2021.00011

Learning Dependencies in Distributed Cloud Applications to Identify and Localize Anomalies

Authors: Dominik Scheinert, Alexander Acker, Lauritz Thamsen, Morgan K. Geldenhuys, Odej Kao

Abstract: Operation and maintenance of large distributed cloud applications can quickly become unmanageably complex, putting human operators under immense stress when problems occur. Utilizing machine learning for identification and localization of anomalies in such systems supports human experts and enables fast mitigation. However, due to the various inter-dependencies of system components, anomalies do n… ▽ More Operation and maintenance of large distributed cloud applications can quickly become unmanageably complex, putting human operators under immense stress when problems occur. Utilizing machine learning for identification and localization of anomalies in such systems supports human experts and enables fast mitigation. However, due to the various inter-dependencies of system components, anomalies do not only affect their origin but propagate through the distributed system. Taking this into account, we present Arvalus and its variant D-Arvalus, a neural graph transformation method that models system components as nodes and their dependencies and placement as edges to improve the identification and localization of anomalies. Given a series of metric KPIs, our method predicts the most likely system state - either normal or an anomaly class - and performs localization when an anomaly is detected. During our experiments, we simulate a distributed cloud application deployment and synthetically inject anomalies. The evaluation shows the generally good prediction performance of Arvalus and reveals the advantage of D-Arvalus which incorporates information about system component dependencies. △ Less

Submitted 9 September, 2021; v1 submitted 9 March, 2021; originally announced March 2021.

Comments: 6 pages, 5 figures, 3 tables

Journal ref: IEEE/ACM CloudIntelligence (2021) 7-12

arXiv:2103.01170 [pdf, other]

LEAF: Simulating Large Energy-Aware Fog Computing Environments

Authors: Philipp Wiesner, Lauritz Thamsen

Abstract: Despite constant improvements in efficiency, today's data centers and networks consume enormous amounts of energy and this demand is expected to rise even further. An important research question is whether and how fog computing can curb this trend. As real-life deployments of fog infrastructure are still rare, a significant part of research relies on simulations. However, existing power models usu… ▽ More Despite constant improvements in efficiency, today's data centers and networks consume enormous amounts of energy and this demand is expected to rise even further. An important research question is whether and how fog computing can curb this trend. As real-life deployments of fog infrastructure are still rare, a significant part of research relies on simulations. However, existing power models usually only target particular components such as compute nodes or battery-constrained edge devices. Combining analytical and discrete-event modeling, we develop a holistic but granular energy consumption model that can determine the power usage of compute nodes as well as network traffic and applications over time. Simulations can incorporate thousands of devices that execute complex application graphs on a distributed, heterogeneous, and resource-constrained infrastructure. We evaluated our publicly available prototype LEAF within a smart city traffic scenario, demonstrating that it enables research on energy-conserving fog computing architectures and can be used to assess dynamic task placement strategies and other energy-saving mechanisms. △ Less

Submitted 1 March, 2021; originally announced March 2021.

Comments: To appear in the Proceedings of the 5th IEEE International Conference on Fog and Edge Computing 2021

arXiv:2102.11623 [pdf, other]

doi 10.1145/3447545.3451189

PIERES: A Playground for Network Interrupt Experiments on Real-Time Embedded Systems in the IoT

Authors: Franz Bender, Jan Jonas Brune, Nick Lauritz Keutel, Ilja Behnke, Lauritz Thamsen

Abstract: IoT devices have become an integral part of our lives and the industry. Many of these devices run real-time systems or are used as part of them. As these devices receive network packets over IP networks, the network interface informs the CPU about their arrival using interrupts that might preempt critical processes. Therefore, the question arises whether network interrupts pose a threat to the rea… ▽ More IoT devices have become an integral part of our lives and the industry. Many of these devices run real-time systems or are used as part of them. As these devices receive network packets over IP networks, the network interface informs the CPU about their arrival using interrupts that might preempt critical processes. Therefore, the question arises whether network interrupts pose a threat to the real-timeness of these devices. However, there are few tools to investigate this issue. We present a playground which enables researchers to conduct experiments in the context of network interrupt simulation. The playground comprises different network interface controller implementations, load generators and timing utilities. It forms a flexible and easy to use foundation for future network interrupt research. We conduct two verification experiments and two real world examples. The latter give insight into the impact of the interrupt handling strategy parameters and the influence of different load types on the execution time with respect to these parameters. △ Less

Submitted 23 February, 2021; originally announced February 2021.

Comments: The Ninth International Workshop on Load Testing and Benchmarking of Software Systems (LTB 2021)

ACM Class: I.6.7; B.8.1

Journal ref: 2021 Companion of the ACM/SPEC International Conference on Performance Engineering (ICPE '21), 81-84

arXiv:2102.07199 [pdf, other]

doi 10.1007/978-3-030-48340-1_40

Hugo: A Cluster Scheduler that Efficiently Learns to Select Complementary Data-Parallel Jobs

Authors: Lauritz Thamsen, Ilya Verbitskiy, Sasho Nedelkoski, Vinh Thuy Tran, Vinicius Meyer, Miguel G. Xavier, Odej Kao, Cesar A. F. De Rose

Abstract: Distributed data processing systems like MapReduce, Spark, and Flink are popular tools for analysis of large datasets with cluster resources. Yet, users often overprovision resources for their data processing jobs, while the resource usage of these jobs also typically fluctuates considerably. Therefore, multiple jobs usually get scheduled onto the same shared resources to increase the resource uti… ▽ More Distributed data processing systems like MapReduce, Spark, and Flink are popular tools for analysis of large datasets with cluster resources. Yet, users often overprovision resources for their data processing jobs, while the resource usage of these jobs also typically fluctuates considerably. Therefore, multiple jobs usually get scheduled onto the same shared resources to increase the resource utilization and throughput of clusters. However, job runtimes and the utilization of shared resources can vary significantly depending on the specific combinations of co-located jobs. This paper presents Hugo, a cluster scheduler that continuously learns how efficiently jobs share resources, considering metrics for the resource utilization and interference among co-located jobs. The scheduler combines offline grou** of jobs with online reinforcement learning to provide a scheduling mechanism that efficiently generalizes from specific monitored job combinations yet also adapts to changes in workloads. Our evaluation of a prototype shows that the approach can reduce the runtimes of exemplary Spark jobs on a YARN cluster by up to 12.5%, while resource utilization is increased and waiting times can be bounded. △ Less

Submitted 14 February, 2021; originally announced February 2021.

arXiv:2102.06170 [pdf, other]

Chiron: Optimizing Fault Tolerance in QoS-aware Distributed Stream Processing Jobs

Authors: Morgan Geldenhuys, Lauritz Thamsen, Odej Kao

Abstract: Fault tolerance is a property which needs deeper consideration when dealing with streaming jobs requiring high levels of availability and low-latency processing even in case of failures where Quality-of-Service constraints must be adhered to. Typically, systems achieve fault tolerance and the ability to recover automatically from partial failures by implementing Checkpoint and Rollback Recovery. H… ▽ More Fault tolerance is a property which needs deeper consideration when dealing with streaming jobs requiring high levels of availability and low-latency processing even in case of failures where Quality-of-Service constraints must be adhered to. Typically, systems achieve fault tolerance and the ability to recover automatically from partial failures by implementing Checkpoint and Rollback Recovery. However, this is an expensive operation which impacts negatively on the overall performance of the system and manually optimizing fault tolerance for specific jobs is a difficult and time consuming task. In this paper we introduce Chiron, an approach for automatically optimizing the frequency with which checkpoints are performed in streaming jobs. For any chosen job, parallel profiling runs are performed, each containing a variant of the configurations, with the resulting metrics used to model the impact of checkpoint-based fault tolerance on performance and availability. Understanding these relationships is key to minimizing performance objectives and meeting strict Quality-of-Service constraints. We implemented Chiron prototypically together with Apache Flink and demonstrate its usefulness experimentally. △ Less

Submitted 11 February, 2021; originally announced February 2021.

arXiv:2102.06094 [pdf, other]

Effectively Testing System Configurations of Critical IoT Analytics Pipelines

Authors: Morgan Geldenhuys, Lauritz Thamsen, Kain Kordian Gontarska, Felix Lorenz, Odej Kao

Abstract: The emergence of the Internet of Things has seen the introduction of numerous connected devices used for the monitoring and control of even Critical Infrastructures. Distributed stream processing has become key to analyzing data generated by these connected devices and improving our ability to make decisions. However, optimizing these systems towards specific Quality of Service targets is a diffic… ▽ More The emergence of the Internet of Things has seen the introduction of numerous connected devices used for the monitoring and control of even Critical Infrastructures. Distributed stream processing has become key to analyzing data generated by these connected devices and improving our ability to make decisions. However, optimizing these systems towards specific Quality of Service targets is a difficult and time-consuming task, due to the large-scale distributed systems involved, the existence of so many configuration parameters, and the inability to easily determine the impact of tuning these parameters. In this paper we present an approach for the effective testing of system configurations for critical IoT analytics pipelines. We demonstrate our approach with a prototype that we called Timon which is integrated with Kubernetes. This tool allows pipelines to be easily replicated in parallel and evaluated to determine the optimal configuration for specific applications. We demonstrate the usefulness of our approach by investigating different configurations of an exemplary geographically-based traffic monitoring application implemented in Apache Flink. △ Less

Submitted 25 February, 2021; v1 submitted 11 February, 2021; originally announced February 2021.

arXiv:2102.05559 [pdf, other]

doi 10.1109/IPCCC50635.2020.9391536

Interrupting Real-Time IoT Tasks: How Bad Can It Be to Connect Your Critical Embedded System to the Internet?

Authors: Ilja Behnke, Lukas Pirl, Lauritz Thamsen, Robert Danicki, Andreas Polze, Odej Kao

Abstract: Embedded systems have been used to control physical environments for decades. Usually, such use cases require low latencies between commands and actions as well as a high predictability of the expected worst-case delay. To achieve this on small, low-powered microcontrollers, Real-Time Operating Systems (RTOSs) are used to manage the different tasks on these machines as deterministically as possibl… ▽ More Embedded systems have been used to control physical environments for decades. Usually, such use cases require low latencies between commands and actions as well as a high predictability of the expected worst-case delay. To achieve this on small, low-powered microcontrollers, Real-Time Operating Systems (RTOSs) are used to manage the different tasks on these machines as deterministically as possible. However, with the advent of the Internet of Things (IoT) in industrial applications, the same embedded systems are now equipped with networking capabilities, possibly endangering critical real-time systems through an open gate to interrupts. This paper presents our initial study of the impact network connections can have on real-time embedded systems. Specifically, we look at three aspects: The impact of network-generated interrupts, the overhead of the related networking tasks, and the feasibility of sharing computing resources between networking and real-time tasks. We conducted experiments on two setups: One treating NICs and drivers as black boxes and one simulating network interrupts on the machines. The preliminary results show that a critical task performance loss of up to 6.67% per received packet per second could be induced where lateness impacts of 1% per packet per second can be attributed exclusively to ISR-generated delays. △ Less

Submitted 13 April, 2021; v1 submitted 10 February, 2021; originally announced February 2021.

Comments: IPCCC 2020: 39th International Performance Computing and Communications Conference

Journal ref: 39th International Performance Computing and Communications Conference (IPCCC), IEEE, 2020, pp. 1-6

Showing 1–50 of 54 results for author: Thamsen, L