Skip to main content

Showing 1–50 of 70 results for author: Kao, O

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.05354  [pdf, other

    cs.AR cs.AI cs.DC

    Investigating Memory Failure Prediction Across CPU Architectures

    Authors: Qiao Yu, Wengui Zhang, Min Zhou, Jialiang Yu, Zhenli Sheng, Jasmin Bogatinovski, Jorge Cardoso, Odej Kao

    Abstract: Large-scale datacenters often experience memory failures, where Uncorrectable Errors (UEs) highlight critical malfunction in Dual Inline Memory Modules (DIMMs). Existing approaches primarily utilize Correctable Errors (CEs) to predict UEs, yet they typically neglect how these errors vary between different CPU architectures, especially in terms of Error Correction Code (ECC) applicability. In this… ▽ More

    Submitted 8 June, 2024; originally announced June 2024.

    Comments: Accepted by 2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Industry Track

  2. arXiv:2405.13599  [pdf, other

    cs.LG

    LogRCA: Log-based Root Cause Analysis for Distributed Services

    Authors: Thorsten Wittkopp, Philipp Wiesner, Odej Kao

    Abstract: To assist IT service developers and operators in managing their increasingly complex service landscapes, there is a growing effort to leverage artificial intelligence in operations. To speed up troubleshooting, log anomaly detection has received much attention in particular, dealing with the identification of log events that indicate the reasons for a system failure. However, faults often propagat… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: Accepted at Euro-Par 2024 as a fullpaper

  3. arXiv:2404.16446  [pdf, other

    cs.DC

    On Software Ageing Indicators in OpenStack

    Authors: Yevhen Yazvinskyi, Jasmin Bogatinovski, Jorge Cardoso, Odej Kao

    Abstract: Distributed systems in general and cloud systems in particular, are susceptible to failures that can lead to substantial economic and data losses, security breaches, and even potential threats to human safety. Software ageing is an example of one such vulnerability. It emerges due to routine re-usage of computational systems units which induce fatigue within the components, resulting in an increas… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

  4. arXiv:2403.02129  [pdf, other

    cs.DC

    Demeter: Resource-Efficient Distributed Stream Processing under Dynamic Loads with Multi-Configuration Optimization

    Authors: Morgan Geldenhuys, Dominik Scheinert, Odej Kao, Lauritz Thamsen

    Abstract: Distributed Stream Processing (DSP) focuses on the near real-time processing of large streams of unbounded data. To increase processing capacities, DSP systems are able to dynamically scale across a cluster of commodity nodes, ensuring a good Quality of Service despite variable workloads. However, selecting scaleout configurations which maximize resource utilization remains a challenge. This is es… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

    Comments: 12 pages, 14 figures, published at ICPE 2024

  5. arXiv:2403.02093  [pdf, other

    cs.DC

    Daedalus: Self-Adaptive Horizontal Autoscaling for Resource Efficiency of Distributed Stream Processing Systems

    Authors: Benjamin J. J. Pfister, Dominik Scheinert, Morgan K. Geldenhuys, Odej Kao

    Abstract: Distributed Stream Processing (DSP) systems are capable of processing large streams of unbounded data, offering high throughput and low latencies. To maintain a stable Quality of Service (QoS), these systems require a sufficient allocation of resources. At the same time, over-provisioning can result in wasted energy and high operating costs. Therefore, to maximize resource utilization, autoscaling… ▽ More

    Submitted 5 March, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

    Comments: 12 pages, 11 figures, 1 table

  6. arXiv:2312.14748  [pdf, other

    cs.LG cs.SE

    Progressing from Anomaly Detection to Automated Log Labeling and Pioneering Root Cause Analysis

    Authors: Thorsten Wittkopp, Alexander Acker, Odej Kao

    Abstract: The realm of AIOps is transforming IT landscapes with the power of AI and ML. Despite the challenge of limited labeled data, supervised models show promise, emphasizing the importance of leveraging labels for training, especially in deep learning contexts. This study enhances the field by introducing a taxonomy for log anomalies and exploring automated data labeling to mitigate labeling challenges… ▽ More

    Submitted 22 December, 2023; originally announced December 2023.

    Comments: accepted at AIOPS workshop @ICDM 2023

  7. arXiv:2312.02855  [pdf, other

    cs.AR cs.AI cs.DC cs.LG

    Exploring Error Bits for Memory Failure Prediction: An In-Depth Correlative Study

    Authors: Qiao Yu, Wengui Zhang, Jorge Cardoso, Odej Kao

    Abstract: In large-scale datacenters, memory failure is a common cause of server crashes, with Uncorrectable Errors (UEs) being a major indicator of Dual Inline Memory Module (DIMM) defects. Existing approaches primarily focus on predicting UEs using Correctable Errors (CEs), without fully considering the information provided by error bits. However, error bit patterns have a strong correlation with the occu… ▽ More

    Submitted 18 December, 2023; v1 submitted 5 December, 2023; originally announced December 2023.

    Comments: Published at ICCAD 2023

    Journal ref: 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), San Francisco, CA, USA, 2023, pp. 01-09

  8. arXiv:2311.08185  [pdf, other

    cs.DC

    Predicting Dynamic Memory Requirements for Scientific Workflow Tasks

    Authors: Jonathan Bader, Nils Diedrich, Lauritz Thamsen, Odej Kao

    Abstract: With the increasing amount of data available to scientists in disciplines as diverse as bioinformatics, physics, and remote sensing, scientific workflow systems are becoming increasingly important for composing and executing scalable data analysis pipelines. When writing such workflows, users need to specify the resources to be reserved for tasks so that sufficient resources are allocated on the t… ▽ More

    Submitted 19 March, 2024; v1 submitted 14 November, 2023; originally announced November 2023.

    Comments: Paper accepted in 2023 IEEE International Conference on Big Data

  9. Offloading Real-Time Tasks in IIoT Environments under Consideration of Networking Uncertainties

    Authors: Ilja Behnke, Philipp Wiesner, Paul Voelker, Odej Kao

    Abstract: Offloading is a popular way to overcome the resource and power constraints of networked embedded devices, which are increasingly found in industrial environments. It involves moving resource-intensive computational tasks to a more powerful device on the network, often in close proximity to enable wireless communication. However, many Industrial Internet of Things (IIoT) applications have real-time… ▽ More

    Submitted 31 October, 2023; originally announced October 2023.

    Comments: 2nd International Workshop on Middleware for the Edge (MiddleWEdge '23). 2023. ACM

    ACM Class: C.2.4; C.3

  10. arXiv:2310.18718  [pdf, other

    cs.DC

    Carbon-Awareness in CI/CD

    Authors: Henrik Claßen, Jonas Thierfeldt, Julian Tochman-Szewc, Philipp Wiesner, Odej Kao

    Abstract: While the environmental impact of digitalization is becoming more and more evident, the climate crisis has become a major issue for society. For instance, data centers alone account for 2.7% of Europe's energy consumption today. A considerable part of this load is accounted for by cloud-based services for automated software development, such as continuous integration and delivery (CI/CD) workflows… ▽ More

    Submitted 28 October, 2023; originally announced October 2023.

    Comments: 21st International Conference on Service-Oriented Computing (ICSOC '24) Workshops

  11. arXiv:2310.03848  [pdf, other

    cs.CV cs.LG

    OpenIncrement: A Unified Framework for Open Set Recognition and Deep Class-Incremental Learning

    Authors: Jiawen Xu, Claas Grohnfeldt, Odej Kao

    Abstract: In most works on deep incremental learning research, it is assumed that novel samples are pre-identified for neural network retraining. However, practical deep classifiers often misidentify these samples, leading to erroneous predictions. Such misclassifications can degrade model performance. Techniques like open set recognition offer a means to detect these novel samples, representing a significa… ▽ More

    Submitted 5 October, 2023; originally announced October 2023.

    Journal ref: 1st Workshop on Visual Continual Learning in conjunction with ICCV 2023

  12. Lotaru: Locally Predicting Workflow Task Runtimes for Resource Management on Heterogeneous Infrastructures

    Authors: Jonathan Bader, Fabian Lehmann, Lauritz Thamsen, Ulf Leser, Odej Kao

    Abstract: Many resource management techniques for task scheduling, energy and carbon efficiency, and cost optimization in workflows rely on a-priori task runtime knowledge. Building runtime prediction models on historical data is often not feasible in practice as workflows, their input data, and the cluster infrastructure change. Online methods, on the other hand, which estimate task runtimes on specific ma… ▽ More

    Submitted 13 September, 2023; originally announced September 2023.

    Journal ref: Future Generation Computer Systems, Volume 150, January 2024, Pages 171-185

  13. Karasu: A Collaborative Approach to Efficient Cluster Configuration for Big Data Analytics

    Authors: Dominik Scheinert, Philipp Wiesner, Thorsten Wittkopp, Lauritz Thamsen, Jonathan Will, Odej Kao

    Abstract: Selecting the right resources for big data analytics jobs is hard because of the wide variety of configuration options like machine type and cluster size. As poor choices can have a significant impact on resource efficiency, cost, and energy usage, automated approaches are gaining popularity. Most existing methods rely on profiling recurring workloads to find near-optimal solutions over time. Due… ▽ More

    Submitted 23 November, 2023; v1 submitted 22 August, 2023; originally announced August 2023.

    Comments: 10 pages, 9 figures

    Journal ref: IEEE IPCCC (2023) 403-412

  14. arXiv:2308.08270  [pdf, other

    cs.DC cs.PF

    Towards Benchmarking Power-Performance Characteristics of Federated Learning Clients

    Authors: Pratik Agrawal, Philipp Wiesner, Odej Kao

    Abstract: Federated Learning (FL) is a decentralized machine learning approach where local models are trained on distributed clients, allowing privacy-preserving collaboration by sharing model updates instead of raw data. However, the added communication overhead and increased training time caused by heterogenous data distributions results in higher energy consumption and carbon emissions for achieving simi… ▽ More

    Submitted 16 August, 2023; originally announced August 2023.

    Comments: Machine Learning and Networking Workshop, NetSys 2023

  15. Evaluation of Data Enrichment Methods for Distributed Stream Processing Systems

    Authors: Dominik Scheinert, Fabian Casares, Morgan K. Geldenhuys, Kevin Styp-Rekowski, Odej Kao

    Abstract: Stream processing has become a critical component in the architecture of modern applications. With the exponential growth of data generation from sources such as the Internet of Things, business intelligence, and telecommunications, real-time processing of unbounded data streams has become a necessity. DSP systems provide a solution to this challenge, offering high horizontal scalability, fault-to… ▽ More

    Submitted 23 November, 2023; v1 submitted 26 July, 2023; originally announced July 2023.

    Comments: 10 pages, 13 figures, 2 tables

    Journal ref: IEEE IC2E (2023) 202-211

  16. arXiv:2306.09774  [pdf, other

    cs.DC eess.SY

    Vessim: A Testbed for Carbon-Aware Applications and Systems

    Authors: Philipp Wiesner, Ilja Behnke, Paul Kilian, Marvin Steinke, Odej Kao

    Abstract: To reduce the carbon footprint of computing and stabilize electricity grids, there is an increasing focus on approaches that align the power usage of IT infrastructure with the availability of clean energy. Unfortunately, research on energy-aware and carbon-aware applications, as well as the interfaces between computing and energy systems, remains complex due to the scarcity of available testing e… ▽ More

    Submitted 19 June, 2024; v1 submitted 16 June, 2023; originally announced June 2023.

    Comments: HotCarbon'24

  17. Selecting Efficient Cluster Resources for Data Analytics: When and How to Allocate for In-Memory Processing?

    Authors: Jonathan Will, Lauritz Thamsen, Dominik Scheinert, Odej Kao

    Abstract: Distributed dataflow systems such as Apache Spark or Apache Flink enable parallel, in-memory data processing on large clusters of commodity hardware. Consequently, the appropriate amount of memory to allocate to the cluster is a crucial consideration. In this paper, we analyze the challenge of efficient resource allocation for distributed data processing, focusing on memory. We emphasize that in… ▽ More

    Submitted 7 June, 2023; v1 submitted 6 June, 2023; originally announced June 2023.

    Comments: 4 pages, 3 Figures; ACM SSDBM 2023

    ACM Class: C.2.4; C.4; I.2.8; H.2.8; H.2.4

  18. FedZero: Leveraging Renewable Excess Energy in Federated Learning

    Authors: Philipp Wiesner, Ramin Khalili, Dennis Grinwald, Pratik Agrawal, Lauritz Thamsen, Odej Kao

    Abstract: Federated Learning (FL) is an emerging machine learning technique that enables distributed model training across data silos or edge devices without data sharing. Yet, FL inevitably introduces inefficiencies compared to centralized model training, which will further increase the already high energy usage and associated carbon emissions of machine learning in the future. One idea to reduce FL's carb… ▽ More

    Submitted 10 January, 2024; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: Accepted for publication at ACM e-Energy '24

  19. Towards a Real-Time IoT: Approaches for Incoming Packet Processing in Cyber-Physical Systems

    Authors: Ilja Behnke, Christoph Blumschein, Robert Danicki, Philipp Wiesner, Lauritz Thamsen, Odej Kao

    Abstract: Embedded real-time devices for monitoring, controlling, and collaboration purposes in cyber-physical systems are now commonly equipped with IP networking capabilities. However, the reception and processing of IP packets generates workloads in unpredictable frequencies as networks are outside of a developer's control and difficult to anticipate, especially when networks are connected to the interne… ▽ More

    Submitted 3 May, 2023; originally announced May 2023.

    Comments: arXiv admin note: text overlap with arXiv:2204.08846

    Journal ref: Journal of Systems Architecture. 140 (2023)

  20. arXiv:2301.10681  [pdf, other

    cs.LG

    PULL: Reactive Log Anomaly Detection Based On Iterative PU Learning

    Authors: Thorsten Wittkopp, Dominik Scheinert, Philipp Wiesner, Alexander Acker, Odej Kao

    Abstract: Due to the complexity of modern IT services, failures can be manifold, occur at any stage, and are hard to detect. For this reason, anomaly detection applied to monitoring data such as logs allows gaining relevant insights to improve IT services steadily and eradicate failures. However, existing anomaly detection methods that provide high accuracy often rely on labeled training data, which are tim… ▽ More

    Submitted 25 January, 2023; originally announced January 2023.

    Comments: published in the proceedings of the 56th Hawaii International Conference on System Sciences (HICSS 2023)

  21. arXiv:2212.10441  [pdf, other

    cs.DC

    First CE Matters: On the Importance of Long Term Properties on Memory Failure Prediction

    Authors: Jasmin Bogatinovski, Qiao Yu, Jorge Cardoso, Odej Kao

    Abstract: Dynamic random access memory failures are a threat to the reliability of data centres as they lead to data loss and system crashes. Timely predictions of memory failures allow for taking preventive measures such as server migration and memory replacement. Thereby, memory failure prediction prevents failures from externalizing, and it is a vital task to improve system reliability. In this paper, we… ▽ More

    Submitted 21 November, 2022; originally announced December 2022.

    Comments: This paper is accepted to appear in the proceedings of IEEE Big Data 2022. All publishing licenses belong to IEEE

  22. Probabilistic Time Series Forecasting for Adaptive Monitoring in Edge Computing Environments

    Authors: Dominik Scheinert, Babak Sistani Zadeh Aghdam, Soeren Becker, Odej Kao, Lauritz Thamsen

    Abstract: With increasingly more computation being shifted to the edge of the network, monitoring of critical infrastructures, such as intermediate processing nodes in autonomous driving, is further complicated due to the typically resource-constrained environments. In order to reduce the resource overhead on the network link imposed by monitoring, various methods have been discussed that either follow a fi… ▽ More

    Submitted 30 January, 2023; v1 submitted 24 November, 2022; originally announced November 2022.

    Comments: 6 pages, 5 figures, 2 tables

    Journal ref: IEEE BigData (2022) 4583-4588

  23. Towards Advanced Monitoring for Scientific Workflows

    Authors: Jonathan Bader, Joel Witzke, Soeren Becker, Ansgar Lößer, Fabian Lehmann, Leon Doehler, Anh Duc Vu, Odej Kao

    Abstract: Scientific workflows consist of thousands of highly parallelized tasks executed in a distributed environment involving many components. Automatic tracing and investigation of the components' and tasks' performance metrics, traces, and behavior are necessary to support the end user with a level of abstraction since the large amount of data cannot be analyzed manually. The execution and monitoring o… ▽ More

    Submitted 18 July, 2023; v1 submitted 23 November, 2022; originally announced November 2022.

    Comments: Paper accepted in 2022 IEEE International Conference on Big Data Workshop SCDM 2022

  24. Leveraging Reinforcement Learning for Task Resource Allocation in Scientific Workflows

    Authors: Jonathan Bader, Nicolas Zunker, Soeren Becker, Odej Kao

    Abstract: Scientific workflows are designed as directed acyclic graphs (DAGs) and consist of multiple dependent task definitions. They are executed over a large amount of data, often resulting in thousands of tasks with heterogeneous compute requirements and long runtimes, even on cluster infrastructures. In order to optimize the workflow performance, enough resources, e.g., CPU and memory, need to be provi… ▽ More

    Submitted 18 July, 2023; v1 submitted 22 November, 2022; originally announced November 2022.

    Comments: Paper accepted in 2022 IEEE International Conference on Big Data Workshop BPOD 2022

  25. Perona: Robust Infrastructure Fingerprinting for Resource-Efficient Big Data Analytics

    Authors: Dominik Scheinert, Soeren Becker, Jonathan Bader, Lauritz Thamsen, Jonathan Will, Odej Kao

    Abstract: Choosing a good resource configuration for big data analytics applications can be challenging, especially in cloud environments. Automated approaches are desirable as poor decisions can reduce performance and raise costs. The majority of existing automated approaches either build performance models from previous workload executions or conduct iterative resource configuration profiling until a near… ▽ More

    Submitted 30 January, 2023; v1 submitted 15 November, 2022; originally announced November 2022.

    Comments: 8 pages, 5 figures, 3 tables

    Journal ref: IEEE BigData (2022) 209-216

  26. arXiv:2211.07619  [pdf, other

    cs.LG

    Federated Learning for Autoencoder-based Condition Monitoring in the Industrial Internet of Things

    Authors: Soeren Becker, Kevin Styp-Rekowski, Oliver Vincent Leon Stoll, Odej Kao

    Abstract: Enabled by the increasing availability of sensor data monitored from production machinery, condition monitoring and predictive maintenance methods are key pillars for an efficient and robust manufacturing production cycle in the Industrial Internet of Things. The employment of machine learning models to detect and predict deteriorating behavior by analyzing a variety of data collected across sever… ▽ More

    Submitted 14 November, 2022; originally announced November 2022.

    Comments: Accepted for 2022 IEEE International Conference on Big Data (IEEE BigData 2022)

  27. Ruya: Memory-Aware Iterative Optimization of Cluster Configurations for Big Data Processing

    Authors: Jonathan Will, Lauritz Thamsen, Jonathan Bader, Dominik Scheinert, Odej Kao

    Abstract: Selecting appropriate computational resources for data processing jobs on large clusters is difficult, even for expert users like data engineers. Inadequate choices can result in vastly increased costs, without significantly improving performance. One crucial aspect of selecting an efficient resource configuration is avoiding memory bottlenecks. By knowing the required memory of a job in advance,… ▽ More

    Submitted 3 February, 2023; v1 submitted 8 November, 2022; originally announced November 2022.

    Comments: 9 pages, 5 Figures, 3 Tables; IEEE BigData 2022. arXiv admin note: substantial text overlap with arXiv:2206.13852

    ACM Class: C.2.4; I.2.8; I.2.6

    Journal ref: 2022 IEEE International Conference on Big Data (Big Data) pp. 161-169

  28. Macaw: The Machine Learning Magnetometer Calibration Workflow

    Authors: Jonathan Bader, Kevin Styp-Rekowski, Leon Doehler, Soeren Becker, Odej Kao

    Abstract: In Earth Systems Science, many complex data pipelines combine different data sources and apply data filtering and analysis steps. Typically, such data analysis processes are historically grown and implemented with many sequentially executed scripts. Scientific workflow management systems (SWMS) allow scientists to use their existing scripts and provide support for parallelization, reusability, mon… ▽ More

    Submitted 18 July, 2023; v1 submitted 17 October, 2022; originally announced October 2022.

    Comments: Paper accepted in 2022 IEEE International Conference on Data Mining Workshops (ICDMW)

  29. arXiv:2208.09270  [pdf, other

    cs.NI cs.DC

    IoTreeplay: Synchronous Distributed Traffic Replay in IoT Environments

    Authors: Markus Toll, Ilja Behnke, Odej Kao

    Abstract: Use-cases in the Internet of Things (IoT) typically involve a high number of interconnected, heterogeneous devices. Due to the criticality of many IoT scenarios, systems and applications need to be tested thoroughly before rollout. Existing staging environments and testing frameworks are able to emulate network properties but fail to deliver actual network-wide traffic control to test systems appl… ▽ More

    Submitted 19 August, 2022; originally announced August 2022.

    Comments: 2nd International Workshop on Testing Distributed Internet of Things Systems

  30. Reshi: Recommending Resources for Scientific Workflow Tasks on Heterogeneous Infrastructures

    Authors: Jonathan Bader, Fabian Lehmann, Alexander Groth, Lauritz Thamsen, Dominik Scheinert, Jonathan Will, Ulf Leser, Odej Kao

    Abstract: Scientific workflows typically comprise a multitude of different processing steps which often are executed in parallel on different partitions of the input data. These executions, in turn, must be scheduled on the compute nodes of the computational infrastructure at hand. This assignment is complicated by the facts that (a) tasks typically have highly heterogeneous resource requirements and (b) in… ▽ More

    Submitted 17 October, 2022; v1 submitted 16 August, 2022; originally announced August 2022.

    Comments: Paper accepted in 41st IEEE International Performance Computing and Communications Conference (IPCCC 2022)

  31. arXiv:2208.05862  [pdf, other

    cs.DC

    Network Emulation in Large-Scale Virtual Edge Testbeds: A Note of Caution and the Way Forward

    Authors: Soeren Becker, Tobias Pfandzelter, Nils Japke, David Bermbach, Odej Kao

    Abstract: The growing research and industry interest in the Internet of Things and the edge computing paradigm has increased the need for cost-efficient virtual testbeds for large-scale distributed applications. Researchers, students, and practitioners need to test and evaluate the interplay of hundreds or thousands of real software components and services connected with a realistic edge network without acc… ▽ More

    Submitted 11 August, 2022; originally announced August 2022.

    Comments: Accepted for 2nd International Workshop on Testing Distributed Internet of Things Systems (TDIS 2022)

  32. arXiv:2207.09298  [pdf, other

    cs.DC cs.AI

    Magpie: Automatically Tuning Static Parameters for Distributed File Systems using Deep Reinforcement Learning

    Authors: Houkun Zhu, Dominik Scheinert, Lauritz Thamsen, Kordian Gontarska, Odej Kao

    Abstract: Distributed file systems are widely used nowadays, yet using their default configurations is often not optimal. At the same time, tuning configuration parameters is typically challenging and time-consuming. It demands expertise and tuning operations can also be expensive. This is especially the case for static parameters, where changes take effect only after a restart of the system or workloads. W… ▽ More

    Submitted 22 July, 2022; v1 submitted 19 July, 2022; originally announced July 2022.

    Comments: Accepted at The IEEE International Conference on Cloud Engineering (IC2E) conference 2022

  33. arXiv:2207.03206  [pdf, other

    cs.AI

    Leveraging Log Instructions in Log-based Anomaly Detection

    Authors: Jasmin Bogatinovski, Gjorgji Madjarov, Sasho Nedelkoski, Jorge Cardoso, Odej Kao

    Abstract: Artificial Intelligence for IT Operations (AIOps) describes the process of maintaining and operating large IT systems using diverse AI-enabled methods and tools for, e.g., anomaly detection and root cause analysis, to support the remediation, optimization, and automatic initiation of self-stabilizing IT activities. The core step of any AIOps workflow is anomaly detection, typically performed on hi… ▽ More

    Submitted 7 July, 2022; originally announced July 2022.

    Comments: This paper has been accepted for publication in IEEE Service Computing Conference, 2022, Barcelona

  34. Get Your Memory Right: The Crispy Resource Allocation Assistant for Large-Scale Data Processing

    Authors: Jonathan Will, Lauritz Thamsen, Jonathan Bader, Dominik Scheinert, Odej Kao

    Abstract: Distributed dataflow systems like Apache Spark and Apache Hadoop enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs -- that neither lead to bottlenecks nor to low resource utilization -- is often challenging, even for expert users such as data engineers. Further, existing automated approaches to resource selection rel… ▽ More

    Submitted 10 January, 2023; v1 submitted 28 June, 2022; originally announced June 2022.

    Comments: 9 pages, 3 figures, 2 tables, IEEE IC2E 2022

    ACM Class: C.2.4; I.2.8; I.2.6

    Journal ref: 2022 IEEE International Conference on Cloud Engineering (IC2E), pp. 58-66

  35. arXiv:2206.09679  [pdf, other

    cs.DC

    Phoebe: QoS-Aware Distributed Stream Processing through Anticipating Dynamic Workloads

    Authors: Morgan K. Geldenhuys, Dominik Scheinert, Odej Kao, Lauritz Thamsen

    Abstract: Distributed Stream Processing systems have become an essential part of big data processing platforms. They are characterized by the high-throughput processing of near to real-time event streams with the goal of delivering low-latency results and thus enabling time-sensitive decision making. At the same time, results are expected to be consistent even in the presence of partial failures where exact… ▽ More

    Submitted 20 June, 2022; originally announced June 2022.

    Comments: 10 pages, ICWS2022

  36. Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview

    Authors: Lauritz Thamsen, Dominik Scheinert, Jonathan Will, Jonathan Bader, Odej Kao

    Abstract: Many organizations routinely analyze large datasets using systems for distributed data-parallel processing and clusters of commodity resources. Yet, users need to configure adequate resources for their data processing jobs. This requires significant insights into expected job runtimes and scaling behavior, resource characteristics, input data distributions, and other factors. Unable to estimate pe… ▽ More

    Submitted 1 June, 2022; originally announced June 2022.

  37. Lotaru: Locally Estimating Runtimes of Scientific Workflow Tasks in Heterogeneous Clusters

    Authors: Jonathan Bader, Fabian Lehmann, Lauritz Thamsen, Jonathan Will, Ulf Leser, Odej Kao

    Abstract: Many scientific workflow scheduling algorithms need to be informed about task runtimes a-priori to conduct efficient scheduling. In heterogeneous cluster infrastructures, this problem becomes aggravated because these runtimes are required for each task-node pair. Using historical data is often not feasible as logs are typically not retained indefinitely and workloads as well as infrastructure chan… ▽ More

    Submitted 23 May, 2022; originally announced May 2022.

    Comments: paper accepted in 34th International Conference on Scientific and Statistical Database Management (SSDBM 2022)

  38. Cucumber: Renewable-Aware Admission Control for Delay-Tolerant Cloud and Edge Workloads

    Authors: Philipp Wiesner, Dominik Scheinert, Thorsten Wittkopp, Lauritz Thamsen, Odej Kao

    Abstract: The growing electricity demand of cloud and edge computing increases operational costs and will soon have a considerable impact on the environment. A possible countermeasure is equip** IT infrastructure directly with on-site renewable energy sources. Yet, particularly smaller data centers may not be able to use all generated power directly at all times, while feeding it into the public grid or e… ▽ More

    Submitted 27 August, 2022; v1 submitted 5 May, 2022; originally announced May 2022.

    Comments: Accepted at Euro-Par 2022. GitHub repository: https://github.com/dos-group/cucumber

  39. arXiv:2204.08846  [pdf, other

    cs.NI cs.OS

    Differentiating Network Flows for Priority-Aware Scheduling of Incoming Packets in Real-Time IoT Systems

    Authors: Christoph Blumschein, Ilja Behnke, Lauritz Thamsen, Odej Kao

    Abstract: When IP-packet processing is unconditionally carried out on behalf of an operating system kernel thread, processing systems can experience overload in high incoming traffic scenarios. This is especially worrying for embedded real-time devices controlling their physical environment in industrial IoT scenarios and automotive systems. We propose an embedded real-time aware IP stack adaption with an e… ▽ More

    Submitted 19 April, 2022; originally announced April 2022.

    Comments: 25th International Symposium on Real-Time Distributed Computing

  40. arXiv:2204.02636  [pdf, other

    cs.SE cs.LG

    Failure Identification from Unstable Log Data using Deep Learning

    Authors: Jasmin Bogatinovski, Sasho Nedelkoski, Li Wu, Jorge Cardoso, Odej Kao

    Abstract: The reliability of cloud platforms is of significant relevance because society increasingly relies on complex software systems running on the cloud. To improve it, cloud providers are automating various maintenance tasks, with failure identification frequently being considered. The precondition for automation is the availability of observability tools, with system logs commonly being used. The foc… ▽ More

    Submitted 6 April, 2022; originally announced April 2022.

    Comments: This paper is accepted for publication at IEEE CCGrid 2022. For fairest citation, please use the original proceedings credentials

  41. Data-Driven Approach for Log Instruction Quality Assessment

    Authors: Jasmin Bogatinovski, Sasho Nedelkoski, Alexander Acker, Jorge Cardoso, Odej Kao

    Abstract: In the current IT world, developers write code while system operators run the code mostly as a black box. The connection between both worlds is typically established with log messages: the developer provides hints to the (unknown) operator, where the cause of an occurred issue is, and vice versa, the operator can report bugs during operation. To fulfil this purpose, developers write log instructio… ▽ More

    Submitted 6 April, 2022; originally announced April 2022.

    Comments: This paper is accepted for publication at the 30th International Conference on Program Comprehension under doi: 10.1145/3524610.3527906. The copyrights are handled following the corresponding agreement between the author and publisher

  42. SyncMesh: Improving Data Locality for Function-as-a-Service in Meshed Edge Networks

    Authors: Daniel Habenicht, Kevin Kreutz, Soeren Becker, Jonathan Bader, Lauritz Thamsen, Odej Kao

    Abstract: The increasing use of Internet of Things devices coincides with more communication and data movement in networks, which can exceed existing network capabilities. These devices often process sensor or user information, where data privacy and latency are a major concern. Therefore, traditional approaches like cloud computing do not fit well, yet new architectures such as edge computing address this… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

  43. arXiv:2203.05362  [pdf, other

    cs.DC

    Efficient Runtime Profiling for Black-box Machine Learning Services on Sensor Streams

    Authors: Soeren Becker, Dominik Scheinert, Florian Schmidt, Odej Kao

    Abstract: In highly distributed environments such as cloud, edge and fog computing, the application of machine learning for automating and optimizing processes is on the rise. Machine learning jobs are frequently applied in streaming conditions, where models are used to analyze data streams originating from e.g. video streams or sensory data. Often the results for particular data samples need to be provided… ▽ More

    Submitted 10 March, 2022; originally announced March 2022.

    Comments: Accepted as a short paper at the 6th IEEE International Conference on Fog and Edge Computing 2022

  44. arXiv:2111.13462  [pdf, other

    cs.DB cs.GL cs.LG

    A Taxonomy of Anomalies in Log Data

    Authors: Thorsten Wittkopp, Philipp Wiesner, Dominik Scheinert, Odej Kao

    Abstract: Log data anomaly detection is a core component in the area of artificial intelligence for IT operations. However, the large amount of existing methods makes it hard to choose the right approach for a specific system. A better understanding of different kinds of anomalies, and which algorithms are suitable for detecting them, would support researchers and IT operators. Although a common taxonomy fo… ▽ More

    Submitted 26 November, 2021; originally announced November 2021.

    Comments: Paper accepted and presented at AIOPS workshop 2021 co-located with ICSOC 2021

  45. Tarema: Adaptive Resource Allocation for Scalable Scientific Workflows in Heterogeneous Clusters

    Authors: Jonathan Bader, Lauritz Thamsen, Svetlana Kulagina, Jonathan Will, Henning Meyerhenke, Odej Kao

    Abstract: Scientific workflow management systems like Nextflow support large-scale data analysis by abstracting away the details of scientific workflows. In these systems, workflows consist of several abstract tasks, of which instances are run in parallel and transform input partitions into output partitions. Resource managers like Kubernetes execute such workflow tasks on cluster infrastructures. However,… ▽ More

    Submitted 19 January, 2022; v1 submitted 9 November, 2021; originally announced November 2021.

    Journal ref: IEEE Big Data (2021), 65-75

  46. arXiv:2109.13009  [pdf, other

    cs.DC

    LOS: Local-Optimistic Scheduling of Periodic Model Training For Anomaly Detection on Sensor Data Streams in Meshed Edge Networks

    Authors: Soeren Becker, Florian Schmidt, Lauritz Thamsen, Ana Juan Ferrer, Odej Kao

    Abstract: Anomaly detection is increasingly important to handle the amount of sensor data in Edge and Fog environments, Smart Cities, as well as in Industry 4.0. To ensure good results, the utilized ML models need to be updated periodically to adapt to seasonal changes and concept drifts in the sensor data. Although the increasing resource availability at the edge can allow for in-situ execution of model tr… ▽ More

    Submitted 27 September, 2021; originally announced September 2021.

    Comments: 2nd IEEE International Conference on Autonomic Computing and Self-Organizing Systems - ACSOS 2021

  47. EdgePier: P2P-based Container Image Distribution in Edge Computing Environments

    Authors: Soeren Becker, Florian Schmidt, Odej Kao

    Abstract: Edge and fog computing architectures utilize container technologies in order to offer a lightweight application deployment. Container images are stored in registry services and operated by orchestration platforms to download and start the respective applications on nodes of the infrastructure. During large application rollouts, the connection to the registry is prone to become a bottleneck, which… ▽ More

    Submitted 27 September, 2021; originally announced September 2021.

    Comments: 40th IEEE International Performance Computing and Communications Conference 2021

  48. arXiv:2109.09537  [pdf, other

    cs.LG

    A2Log: Attentive Augmented Log Anomaly Detection

    Authors: Thorsten Wittkopp, Alexander Acker, Sasho Nedelkoski, Jasmin Bogatinovski, Dominik Scheinert, Wu Fan, Odej Kao

    Abstract: Anomaly detection becomes increasingly important for the dependability and serviceability of IT services. As log lines record events during the execution of IT services, they are a primary source for diagnostics. Thereby, unsupervised methods provide a significant benefit since not all anomalies can be known at training time. Existing unsupervised methods need anomaly examples to obtain a suitable… ▽ More

    Submitted 20 September, 2021; originally announced September 2021.

    Comments: This paper has been accepted for HICSS 2022 and will appear in the conference proceedings

  49. arXiv:2109.02340  [pdf, other

    cs.DC

    Khaos: Dynamically Optimizing Checkpointing for Dependable Distributed Stream Processing

    Authors: Morgan K. Geldenhuys, Benjamin J. J. Pfister, Dominik Scheinert, Lauritz Thamsen, Odej Kao

    Abstract: Distributed Stream Processing systems are becoming an increasingly essential part of Big Data processing platforms as users grow ever more reliant on their ability to provide fast access to new results. As such, making timely decisions based on these results is dependent on a system's ability to tolerate failure. Typically, these systems achieve fault tolerance and the ability to recover automatic… ▽ More

    Submitted 26 January, 2023; v1 submitted 6 September, 2021; originally announced September 2021.

  50. Enel: Context-Aware Dynamic Scaling of Distributed Dataflow Jobs using Graph Propagation

    Authors: Dominik Scheinert, Houkun Zhu, Lauritz Thamsen, Morgan K. Geldenhuys, Jonathan Will, Alexander Acker, Odej Kao

    Abstract: Distributed dataflow systems like Spark and Flink enable the use of clusters for scalable data analytics. While runtime prediction models can be used to initially select appropriate cluster resources given target runtimes, the actual runtime performance of dataflow jobs depends on several factors and varies over time. Yet, in many situations, dynamic scaling can be used to meet formulated runtime… ▽ More

    Submitted 26 January, 2022; v1 submitted 27 August, 2021; originally announced August 2021.

    Comments: 8 pages, 5 figures, 3 tables

    Journal ref: IEEE IPCCC (2021) 1-8