Skip to main content

Showing 1–20 of 20 results for author: Will, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.05692  [pdf, other

    cs.DC

    Privacy-Preserving Sharing of Data Analytics Runtime Metrics for Performance Modeling

    Authors: Jonathan Will, Dominik Scheinert, Jan Bode, Cedric Kring, Seraphin Zunzer, Lauritz Thamsen

    Abstract: Performance modeling for large-scale data analytics workloads can improve the efficiency of cluster resource allocations and job scheduling. However, the performance of these workloads is influenced by numerous factors, such as job inputs and the assigned cluster resources. As a result, performance models require significant amounts of training data. This data can be obtained by exchanging runtime… ▽ More

    Submitted 13 March, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

    Comments: 4 pages, 4 figures, presented at the WOSP-C workshop at ICPE 2024

  2. Towards a Peer-to-Peer Data Distribution Layer for Efficient and Collaborative Resource Optimization of Distributed Dataflow Applications

    Authors: Dominik Scheinert, Soeren Becker, Jonathan Will, Luis Englaender, Lauritz Thamsen

    Abstract: Performance modeling can help to improve the resource efficiency of clusters and distributed dataflow applications, yet the available modeling data is often limited. Collaborative approaches to performance modeling, characterized by the sharing of performance data or models, have been shown to improve resource efficiency, but there has been little focus on actual data sharing strategies and implem… ▽ More

    Submitted 23 January, 2024; v1 submitted 24 November, 2023; originally announced November 2023.

    Comments: 7 pages, 4 figures, 2 tables

    Journal ref: IEEE BigData (2023) 2339-2345

  3. arXiv:2310.20168  [pdf, other

    cs.LG physics.ao-ph physics.flu-dyn

    Understanding and Visualizing Droplet Distributions in Simulations of Shallow Clouds

    Authors: Justus C. Will, Andrea M. Jenney, Kara D. Lamb, Michael S. Pritchard, Colleen Kaul, Po-Lun Ma, Kyle Pressel, Jacob Shpund, Marcus van Lier-Walqui, Stephan Mandt

    Abstract: Thorough analysis of local droplet-level interactions is crucial to better understand the microphysical processes in clouds and their effect on the global climate. High-accuracy simulations of relevant droplet size distributions from Large Eddy Simulations (LES) of bin microphysics challenge current analysis techniques due to their high dimensionality involving three spatial dimensions, time, and… ▽ More

    Submitted 31 October, 2023; originally announced October 2023.

    Comments: 4 pages, 3 figures, accepted at NeurIPS 2023 (Machine Learning and the Physical Sciences Workshop)

  4. Karasu: A Collaborative Approach to Efficient Cluster Configuration for Big Data Analytics

    Authors: Dominik Scheinert, Philipp Wiesner, Thorsten Wittkopp, Lauritz Thamsen, Jonathan Will, Odej Kao

    Abstract: Selecting the right resources for big data analytics jobs is hard because of the wide variety of configuration options like machine type and cluster size. As poor choices can have a significant impact on resource efficiency, cost, and energy usage, automated approaches are gaining popularity. Most existing methods rely on profiling recurring workloads to find near-optimal solutions over time. Due… ▽ More

    Submitted 23 November, 2023; v1 submitted 22 August, 2023; originally announced August 2023.

    Comments: 10 pages, 9 figures

    Journal ref: IEEE IPCCC (2023) 403-412

  5. arXiv:2306.08754  [pdf, other

    cs.LG physics.ao-ph

    ClimSim: A large multi-scale dataset for hybrid physics-ML climate emulation

    Authors: Sungduk Yu, Walter Hannah, Liran Peng, Jerry Lin, Mohamed Aziz Bhouri, Ritwik Gupta, Björn Lütjens, Justus Christopher Will, Gunnar Behrens, Julius Busecke, Nora Loose, Charles I Stern, Tom Beucler, Bryce Harrop, Benjamin R Hillman, Andrea Jenney, Savannah Ferretti, Nana Liu, Anima Anandkumar, Noah D Brenowitz, Veronika Eyring, Nicholas Geneva, Pierre Gentine, Stephan Mandt, Jaideep Pathak , et al. (31 additional authors not shown)

    Abstract: Modern climate projections lack adequate spatial and temporal resolution due to computational constraints. A consequence is inaccurate and imprecise predictions of critical processes such as storms. Hybrid methods that combine physics with machine learning (ML) have introduced a new generation of higher fidelity climate simulators that can sidestep Moore's Law by outsourcing compute-hungry, short,… ▽ More

    Submitted 6 February, 2024; v1 submitted 14 June, 2023; originally announced June 2023.

    Comments: NeurIPS 2023 Outstanding Datasets and Benchmarks Track Paper

  6. Selecting Efficient Cluster Resources for Data Analytics: When and How to Allocate for In-Memory Processing?

    Authors: Jonathan Will, Lauritz Thamsen, Dominik Scheinert, Odej Kao

    Abstract: Distributed dataflow systems such as Apache Spark or Apache Flink enable parallel, in-memory data processing on large clusters of commodity hardware. Consequently, the appropriate amount of memory to allocate to the cluster is a crucial consideration. In this paper, we analyze the challenge of efficient resource allocation for distributed data processing, focusing on memory. We emphasize that in… ▽ More

    Submitted 7 June, 2023; v1 submitted 6 June, 2023; originally announced June 2023.

    Comments: 4 pages, 3 Figures; ACM SSDBM 2023

    ACM Class: C.2.4; C.4; I.2.8; H.2.8; H.2.4

  7. Perona: Robust Infrastructure Fingerprinting for Resource-Efficient Big Data Analytics

    Authors: Dominik Scheinert, Soeren Becker, Jonathan Bader, Lauritz Thamsen, Jonathan Will, Odej Kao

    Abstract: Choosing a good resource configuration for big data analytics applications can be challenging, especially in cloud environments. Automated approaches are desirable as poor decisions can reduce performance and raise costs. The majority of existing automated approaches either build performance models from previous workload executions or conduct iterative resource configuration profiling until a near… ▽ More

    Submitted 30 January, 2023; v1 submitted 15 November, 2022; originally announced November 2022.

    Comments: 8 pages, 5 figures, 3 tables

    Journal ref: IEEE BigData (2022) 209-216

  8. Ruya: Memory-Aware Iterative Optimization of Cluster Configurations for Big Data Processing

    Authors: Jonathan Will, Lauritz Thamsen, Jonathan Bader, Dominik Scheinert, Odej Kao

    Abstract: Selecting appropriate computational resources for data processing jobs on large clusters is difficult, even for expert users like data engineers. Inadequate choices can result in vastly increased costs, without significantly improving performance. One crucial aspect of selecting an efficient resource configuration is avoiding memory bottlenecks. By knowing the required memory of a job in advance,… ▽ More

    Submitted 3 February, 2023; v1 submitted 8 November, 2022; originally announced November 2022.

    Comments: 9 pages, 5 Figures, 3 Tables; IEEE BigData 2022. arXiv admin note: substantial text overlap with arXiv:2206.13852

    ACM Class: C.2.4; I.2.8; I.2.6

    Journal ref: 2022 IEEE International Conference on Big Data (Big Data) pp. 161-169

  9. Reshi: Recommending Resources for Scientific Workflow Tasks on Heterogeneous Infrastructures

    Authors: Jonathan Bader, Fabian Lehmann, Alexander Groth, Lauritz Thamsen, Dominik Scheinert, Jonathan Will, Ulf Leser, Odej Kao

    Abstract: Scientific workflows typically comprise a multitude of different processing steps which often are executed in parallel on different partitions of the input data. These executions, in turn, must be scheduled on the compute nodes of the computational infrastructure at hand. This assignment is complicated by the facts that (a) tasks typically have highly heterogeneous resource requirements and (b) in… ▽ More

    Submitted 17 October, 2022; v1 submitted 16 August, 2022; originally announced August 2022.

    Comments: Paper accepted in 41st IEEE International Performance Computing and Communications Conference (IPCCC 2022)

  10. Get Your Memory Right: The Crispy Resource Allocation Assistant for Large-Scale Data Processing

    Authors: Jonathan Will, Lauritz Thamsen, Jonathan Bader, Dominik Scheinert, Odej Kao

    Abstract: Distributed dataflow systems like Apache Spark and Apache Hadoop enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs -- that neither lead to bottlenecks nor to low resource utilization -- is often challenging, even for expert users such as data engineers. Further, existing automated approaches to resource selection rel… ▽ More

    Submitted 10 January, 2023; v1 submitted 28 June, 2022; originally announced June 2022.

    Comments: 9 pages, 3 figures, 2 tables, IEEE IC2E 2022

    ACM Class: C.2.4; I.2.8; I.2.6

    Journal ref: 2022 IEEE International Conference on Cloud Engineering (IC2E), pp. 58-66

  11. Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview

    Authors: Lauritz Thamsen, Dominik Scheinert, Jonathan Will, Jonathan Bader, Odej Kao

    Abstract: Many organizations routinely analyze large datasets using systems for distributed data-parallel processing and clusters of commodity resources. Yet, users need to configure adequate resources for their data processing jobs. This requires significant insights into expected job runtimes and scaling behavior, resource characteristics, input data distributions, and other factors. Unable to estimate pe… ▽ More

    Submitted 1 June, 2022; originally announced June 2022.

  12. Lotaru: Locally Estimating Runtimes of Scientific Workflow Tasks in Heterogeneous Clusters

    Authors: Jonathan Bader, Fabian Lehmann, Lauritz Thamsen, Jonathan Will, Ulf Leser, Odej Kao

    Abstract: Many scientific workflow scheduling algorithms need to be informed about task runtimes a-priori to conduct efficient scheduling. In heterogeneous cluster infrastructures, this problem becomes aggravated because these runtimes are required for each task-node pair. Using historical data is often not feasible as logs are typically not retained indefinitely and workloads as well as infrastructure chan… ▽ More

    Submitted 23 May, 2022; originally announced May 2022.

    Comments: paper accepted in 34th International Conference on Scientific and Statistical Database Management (SSDBM 2022)

  13. On the Potential of Execution Traces for Batch Processing Workload Optimization in Public Clouds

    Authors: Dominik Scheinert, Alireza Alamgiralem, Jonathan Bader, Jonathan Will, Thorsten Wittkopp, Lauritz Thamsen

    Abstract: With the growing amount of data, data processing workloads and the management of their resource usage becomes increasingly important. Since managing a dedicated infrastructure is in many situations infeasible or uneconomical, users progressively execute their respective workloads in the cloud. As the configuration of workloads and resources is often challenging, various methods have been proposed… ▽ More

    Submitted 16 January, 2022; v1 submitted 16 November, 2021; originally announced November 2021.

    Comments: 6 pages, 5 figures, 1 table

    Journal ref: IEEE BigData (2021) 3113-3118

  14. Training Data Reduction for Performance Models of Data Analytics Jobs in the Cloud

    Authors: Jonathan Will, Onur Arslan, Jonathan Bader, Dominik Scheinert, Lauritz Thamsen

    Abstract: Distributed dataflow systems like Apache Flink and Apache Spark simplify processing large amounts of data on clusters in a data-parallel manner. However, choosing suitable cluster resources for distributed dataflow jobs in both type and number is difficult, especially for users who do not have access to previous performance metrics. One approach to overcoming this issue is to have users share runt… ▽ More

    Submitted 11 March, 2022; v1 submitted 15 November, 2021; originally announced November 2021.

    Comments: 6 pages, 5 figures, Accepted for the BPOD Workshop at IEEE Big Data 2021

    ACM Class: C.2.4; I.2.8; I.2.6

    Journal ref: IEEE Big Data (2021) 3141-3146

  15. Tarema: Adaptive Resource Allocation for Scalable Scientific Workflows in Heterogeneous Clusters

    Authors: Jonathan Bader, Lauritz Thamsen, Svetlana Kulagina, Jonathan Will, Henning Meyerhenke, Odej Kao

    Abstract: Scientific workflow management systems like Nextflow support large-scale data analysis by abstracting away the details of scientific workflows. In these systems, workflows consist of several abstract tasks, of which instances are run in parallel and transform input partitions into output partitions. Resource managers like Kubernetes execute such workflow tasks on cluster infrastructures. However,… ▽ More

    Submitted 19 January, 2022; v1 submitted 9 November, 2021; originally announced November 2021.

    Journal ref: IEEE Big Data (2021), 65-75

  16. Enel: Context-Aware Dynamic Scaling of Distributed Dataflow Jobs using Graph Propagation

    Authors: Dominik Scheinert, Houkun Zhu, Lauritz Thamsen, Morgan K. Geldenhuys, Jonathan Will, Alexander Acker, Odej Kao

    Abstract: Distributed dataflow systems like Spark and Flink enable the use of clusters for scalable data analytics. While runtime prediction models can be used to initially select appropriate cluster resources given target runtimes, the actual runtime performance of dataflow jobs depends on several factors and varies over time. Yet, in many situations, dynamic scaling can be used to meet formulated runtime… ▽ More

    Submitted 26 January, 2022; v1 submitted 27 August, 2021; originally announced August 2021.

    Comments: 8 pages, 5 figures, 3 tables

    Journal ref: IEEE IPCCC (2021) 1-8

  17. arXiv:2108.10721  [pdf, other

    cs.DC

    Dependable IoT Data Stream Processing for Monitoring and Control of Urban Infrastructures

    Authors: Morgan K. Geldenhuys, Jonathan Will, Benjamin J. J. Pfister, Martin Haug, Alexander Scharmann, Lauritz Thamsen

    Abstract: The Internet of Things describes a network of physical devices interacting and producing vast streams of sensor data. At present there are a number of general challenges which exist while develo** solutions for use cases involving the monitoring and control of urban infrastructures. These include the need for a dependable method for extracting value from these high volume streams of time sensiti… ▽ More

    Submitted 24 August, 2021; originally announced August 2021.

  18. Bellamy: Reusing Performance Models for Distributed Dataflow Jobs Across Contexts

    Authors: Dominik Scheinert, Lauritz Thamsen, Houkun Zhu, Jonathan Will, Alexander Acker, Thorsten Wittkopp, Odej Kao

    Abstract: Distributed dataflow systems enable the use of clusters for scalable data analytics. However, selecting appropriate cluster resources for a processing job is often not straightforward. Performance models trained on historical executions of a concrete job are helpful in such situations, yet they are usually bound to a specific job execution context (e.g. node type, software versions, job parameters… ▽ More

    Submitted 17 October, 2021; v1 submitted 29 July, 2021; originally announced July 2021.

    Comments: 10 pages, 8 figures, 2 tables

    Journal ref: IEEE CLUSTER (2021) 261-270

  19. C3O: Collaborative Cluster Configuration Optimization for Distributed Data Processing in Public Clouds

    Authors: Jonathan Will, Lauritz Thamsen, Dominik Scheinert, Jonathan Bader, Odej Kao

    Abstract: Distributed dataflow systems enable data-parallel processing of large datasets on clusters. Public cloud providers offer a large variety and quantity of resources that can be used for such clusters. Yet, selecting appropriate cloud resources for dataflow jobs - that neither lead to bottlenecks nor to low resource utilization - is often challenging, even for expert users such as data engineers. W… ▽ More

    Submitted 1 December, 2021; v1 submitted 28 July, 2021; originally announced July 2021.

    Comments: 10 pages, 5 figures, IEEE IC2E 2021. arXiv admin note: text overlap with arXiv:2011.07965

    ACM Class: C.2.4; I.2.8; I.2.6

    Journal ref: IEEE IC2E (2021) 43-52

  20. Towards Collaborative Optimization of Cluster Configurations for Distributed Dataflow Jobs

    Authors: Jonathan Will, Jonathan Bader, Lauritz Thamsen

    Abstract: Analyzing large datasets with distributed dataflow systems requires the use of clusters. Public cloud providers offer a large variety and quantity of resources that can be used for such clusters. However, picking the appropriate resources in both type and number can often be challenging, as the selected configuration needs to match a distributed dataflow job's resource demands and access patterns.… ▽ More

    Submitted 27 April, 2021; v1 submitted 16 November, 2020; originally announced November 2020.

    Comments: 6 pages, 7 figures, 1 table; Associated experiment results: https://github.com/dos-group/c3o-experiments ; Appearence in the Proceedings of the 2020 IEEE International Conference on Big Data (Big Data); Presentation at the 4th International Workshop on Benchmarking, Performance Tuning and Optimization for Big Data Applications (BPOD). IEEE. 2020

    ACM Class: C.2.4; I.2.8; I.2.6

    Journal ref: IEEE BigData (2020) 2851-2856