-
Artifact Evaluation for Distributed Systems: Current Practices and Beyond
Authors:
Mohammad Reza Saleh Sedghpour,
Alessandro Vittorio Papadopoulos,
Cristian Klein,
Johan Tordsson
Abstract:
Although repeatability and reproducibility are essential in science, failed attempts to replicate results across diverse fields made some scientists argue for a reproducibility crisis. In response, several high-profile venues within computing established artifact evaluation tracks, a systematic procedure for evaluating and badging research artifacts, with an increasing number of artifacts submitte…
▽ More
Although repeatability and reproducibility are essential in science, failed attempts to replicate results across diverse fields made some scientists argue for a reproducibility crisis. In response, several high-profile venues within computing established artifact evaluation tracks, a systematic procedure for evaluating and badging research artifacts, with an increasing number of artifacts submitted. This study compiles recent artifact evaluation procedures and guidelines to show how artifact evaluation in distributed systems research lags behind other computing disciplines and/or is less unified and more complex. We further argue that current artifact assessment criteria are uncoordinated and insufficient for the unique challenges of distributed systems research. We examine the current state of the practice for artifacts and their evaluation to provide recommendations to assist artifact authors, reviewers, and track chairs. We summarize the recommendations and best practices as checklists for artifact authors and evaluation committees. Although our recommendations alone will not resolve the repeatability and reproducibility crisis, we want to start a discussion in our community to increase the number of submitted artifacts and their quality over time.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Robust Online Epistemic Replanning of Multi-Robot Missions
Authors:
Lauren Bramblett,
Branko Miloradovic,
Patrick Sherman,
Alessandro V. Papadopoulos,
Nicola Bezzo
Abstract:
As Multi-Robot Systems (MRS) become more affordable and computing capabilities grow, they provide significant advantages for complex applications such as environmental monitoring, underwater inspections, or space exploration. However, accounting for potential communication loss or the unavailability of communication infrastructures in these application domains remains an open problem. Much of the…
▽ More
As Multi-Robot Systems (MRS) become more affordable and computing capabilities grow, they provide significant advantages for complex applications such as environmental monitoring, underwater inspections, or space exploration. However, accounting for potential communication loss or the unavailability of communication infrastructures in these application domains remains an open problem. Much of the applicable MRS research assumes that the system can sustain communication through proximity regulations and formation control or by devising a framework for separating and adhering to a predetermined plan for extended periods of disconnection. The latter technique enables an MRS to be more efficient, but breakdowns and environmental uncertainties can have a domino effect throughout the system, particularly when the mission goal is intricate or time-sensitive. To deal with this problem, our proposed framework has two main phases: i) a centralized planner to allocate mission tasks by rewarding intermittent rendezvous between robots to mitigate the effects of the unforeseen events during mission execution, and ii) a decentralized replanning scheme leveraging epistemic planning to formalize belief propagation and a Monte Carlo tree search for policy optimization given distributed rational belief updates. The proposed framework outperforms a baseline heuristic and is validated using simulations and experiments with aerial vehicles.
△ Less
Submitted 1 March, 2024;
originally announced March 2024.
-
The SPEC-RG Reference Architecture for the Compute Continuum
Authors:
Matthijs Jansen,
Auday Al-Dulaimy,
Alessandro V. Papadopoulos,
Animesh Trivedi,
Alexandru Iosup
Abstract:
As the next generation of diverse workloads like autonomous driving and augmented/virtual reality evolves, computation is shifting from cloud-based services to the edge, leading to the emergence of a cloud-edge compute continuum. This continuum promises a wide spectrum of deployment opportunities for workloads that can leverage the strengths of cloud (scalable infrastructure, high reliability) and…
▽ More
As the next generation of diverse workloads like autonomous driving and augmented/virtual reality evolves, computation is shifting from cloud-based services to the edge, leading to the emergence of a cloud-edge compute continuum. This continuum promises a wide spectrum of deployment opportunities for workloads that can leverage the strengths of cloud (scalable infrastructure, high reliability) and edge (energy efficient, low latencies). Despite its promises, the continuum has only been studied in silos of various computing models, thus lacking strong end-to-end theoretical and engineering foundations for computing and resource management across the continuum. Consequently, developers resort to ad hoc approaches to reason about performance and resource utilization of workloads in the continuum. In this work, we conduct a first-of-its-kind systematic study of various computing models, identify salient properties, and make a case to unify them under a compute continuum reference architecture. This architecture provides an end-to-end analysis framework for developers to reason about resource management, workload distribution, and performance analysis. We demonstrate the utility of the reference architecture by analyzing two popular continuum workloads, deep learning and industrial IoT. We have developed an accompanying deployment and benchmarking framework and first-order analytical model for quantitative reasoning of continuum workloads. The framework is open-sourced and available at https://github.com/atlarge-research/continuum.
△ Less
Submitted 2 March, 2023; v1 submitted 8 July, 2022;
originally announced July 2022.
-
STRETCH: Virtual Shared-Nothing Parallelism for Scalable and Elastic Stream Processing
Authors:
Vincenzo Gulisano,
Hannaneh Najdataei,
Yiannis Nikolakopoulos,
Alessandro V. Papadopoulos,
Marina Papatriantafilou,
Philippas Tsigas
Abstract:
Stream processing applications extract value from raw data through Directed Acyclic Graphs of data analysis tasks. Shared-nothing (SN) parallelism is the de-facto standard to scale stream processing applications. Given an application, SN parallelism instantiates several copies of each analysis task, making each instance responsible for a dedicated portion of the overall analysis, and relies on ded…
▽ More
Stream processing applications extract value from raw data through Directed Acyclic Graphs of data analysis tasks. Shared-nothing (SN) parallelism is the de-facto standard to scale stream processing applications. Given an application, SN parallelism instantiates several copies of each analysis task, making each instance responsible for a dedicated portion of the overall analysis, and relies on dedicated queues to exchange data among connected instances. On the one hand, SN parallelism can scale the execution of applications both up and out since threads can run task instances within and across processes/nodes. On the other hand, its lack of sharing can cause unnecessary overheads and hinder the scaling up when threads operate on data that could be jointly accessed in shared memory. This trade-off motivated us in studying a way for stream processing applications to leverage shared memory and boost the scale up (before the scale out) while adhering to the widely-adopted and SN-based APIs for stream processing applications.
We introduce STRETCH, a framework that maximizes the scale up and offers instantaneous elastic reconfigurations (without state transfer) for stream processing applications. We propose the concept of Virtual Shared-Nothing (VSN) parallelism and elasticity and provide formal definitions and correctness proofs for the semantics of the analysis tasks supported by STRETCH, showing they extend the ones found in common Stream Processing Engines. We also provide a fully implemented prototype and show that STRETCH's performance exceeds that of state-of-the-art frameworks such as Apache Flink and offers, to the best of our knowledge, unprecedented ultra-fast reconfigurations, taking less than 40 ms even when provisioning tens of new task instances.
△ Less
Submitted 29 April, 2022; v1 submitted 25 November, 2021;
originally announced November 2021.
-
Towards Map** Control Theory and Software Engineering Properties using Specification Patterns
Authors:
Ricardo Caldas,
Razan Ghzouli,
Alessandro V. Papadopoulos,
Patrizio Pelliccione,
Danny Weyns,
Thorsten Berger
Abstract:
A traditional approach to realize self-adaptation in software engineering (SE) is by means of feedback loops. The goals of the system can be specified as formal properties that are verified against models of the system. On the other hand, control theory (CT) provides a well-established foundation for designing feedback loop systems and providing guarantees for essential properties, such as stabili…
▽ More
A traditional approach to realize self-adaptation in software engineering (SE) is by means of feedback loops. The goals of the system can be specified as formal properties that are verified against models of the system. On the other hand, control theory (CT) provides a well-established foundation for designing feedback loop systems and providing guarantees for essential properties, such as stability, settling time, and steady state error. Currently, it is an open question whether and how traditional SE approaches to self-adaptation consider properties from CT. Answering this question is challenging given the principle differences in representing properties in both fields. In this paper, we take a first step to answer this question. We follow a bottom up approach where we specify a control design (in Simulink) for a case inspired by Scuderia Ferrari (F1) and provide evidence for stability and safety. The design is then transferred into code (in C) that is further optimized. Next, we define properties that enable verifying whether the control properties still hold at code level. Then, we consolidate the solution by map** the properties in both worlds using specification patterns as common language and we verify the correctness of this map**. The map** offers a reusable artifact to solve similar problems. Finally, we outline opportunities for future work, particularly to refine and extend the map** and investigate how it can improve the engineering of self-adaptive systems for both SE and CT engineers.
△ Less
Submitted 23 May, 2022; v1 submitted 18 August, 2021;
originally announced August 2021.
-
Performance Modeling and Vertical Autoscaling of Stream Joins
Authors:
Hannaneh Najdataei,
Vincenzo Gulisano,
Alessandro V. Papadopoulos,
Ivan Walulya,
Marina Papatriantafilou,
Philippas Tsigas
Abstract:
Streaming analysis is widely used in cloud as well as edge infrastructures. In these contexts, fine-grained application performance can be based on accurate modeling of streaming operators. This is especially beneficial for computationally expensive operators like adaptive stream joins that, being very sensitive to rate-varying data streams, would otherwise require costly frequent monitoring.
We…
▽ More
Streaming analysis is widely used in cloud as well as edge infrastructures. In these contexts, fine-grained application performance can be based on accurate modeling of streaming operators. This is especially beneficial for computationally expensive operators like adaptive stream joins that, being very sensitive to rate-varying data streams, would otherwise require costly frequent monitoring.
We propose a dynamic model for the processing throughput and latency of adaptive stream joins that run with different parallelism degrees. The model is presented with progressive complexity, from a centralized non-deterministic up to a deterministic parallel stream join, describing how throughput and latency dynamics are influenced by various configuration parameters. The model is catalytic for understanding the behavior of stream joins against different system deployments, as we show with our model-based autoscaling methodology to change the parallelism degree of stream joins during the execution. Our thorough evaluation, for a broad spectrum of parameter, confirms the model can reliably predict throughput and latency metrics with a fairly high accuracy, with the median error in estimation ranging from approximately 0.1% to 6.5%, even for an overloaded system. Furthermore, we show that our model allows to efficiently control adaptive stream joins by estimating the needed resources solely based on the observed input load. In particular, we show it can be employed to enable efficient autoscaling, even when big changes in the input load happen frequently (in the realm of seconds).
△ Less
Submitted 29 November, 2021; v1 submitted 11 May, 2020;
originally announced May 2020.
-
Towards Bridging the Gap between Control and Self-Adaptive System Properties
Authors:
Javier Cámara,
Alessandro V. Papadopoulos,
Thomas Vogel,
Danny Weyns,
David Garlan,
Shihong Huang,
Kenji Tei
Abstract:
Two of the main paradigms used to build adaptive software employ different types of properties to capture relevant aspects of the system's run-time behavior. On the one hand, control systems consider properties that concern static aspects like stability, as well as dynamic properties that capture the transient evolution of variables such as settling time. On the other hand, self-adaptive systems c…
▽ More
Two of the main paradigms used to build adaptive software employ different types of properties to capture relevant aspects of the system's run-time behavior. On the one hand, control systems consider properties that concern static aspects like stability, as well as dynamic properties that capture the transient evolution of variables such as settling time. On the other hand, self-adaptive systems consider mostly non-functional properties that capture concerns such as performance, reliability, and cost. In general, it is not easy to reconcile these two types of properties or identify under which conditions they constitute a good fit to provide run-time guarantees. There is a need of identifying the key properties in the areas of control and self-adaptation, as well as of characterizing and map** them to better understand how they relate and possibly complement each other. In this paper, we take a first step to tackle this problem by: (1) identifying a set of key properties in control theory, (2) illustrating the formalization of some of these properties employing temporal logic languages commonly used to engineer self-adaptive software systems, and (3) illustrating how to map key properties that characterize self-adaptive software systems into control properties, leveraging their formalization in temporal logics. We illustrate the different steps of the map** on an exemplar case in the cloud computing domain and conclude with identifying open challenges in the area.
△ Less
Submitted 24 April, 2020;
originally announced April 2020.
-
Performance-Feedback Autoscaling with Budget Constraints for Cloud-based Workloads of Workflows
Authors:
Alexey Ilyushkin,
André Bauer,
Alessandro V. Papadopoulos,
Ewa Deelman,
Alexandru Iosup
Abstract:
The growing popularity of workflows in the cloud domain promoted the development of sophisticated autoscaling policies that allow automatic allocation and deallocation of resources. However, many state-of-the-art autoscaling policies for workflows are mostly plan-based or designed for batches (ensembles) of workflows. This reduces their flexibility when dealing with workloads of workflows, as the…
▽ More
The growing popularity of workflows in the cloud domain promoted the development of sophisticated autoscaling policies that allow automatic allocation and deallocation of resources. However, many state-of-the-art autoscaling policies for workflows are mostly plan-based or designed for batches (ensembles) of workflows. This reduces their flexibility when dealing with workloads of workflows, as the workloads are often subject to unpredictable resource demand fluctuations. Moreover, autoscaling in clouds almost always imposes budget constraints that should be satisfied. The budget-aware autoscalers for workflows usually require task runtime estimates to be provided beforehand, which is not always possible when dealing with workloads due to their dynamic nature. To address these issues, we propose a novel Performance-Feedback Autoscaler (PFA) that is budget-aware and does not require task runtime estimates for its operation. Instead, it uses the performance-feedback loop that monitors the average throughput on each resource type. We implement PFA in the popular Apache Airflow workflow management system, and compare the performance of our autoscaler with other two state-of-the-art autoscalers, and with the optimal solution obtained with the Mixed Integer Programming approach. Our results show that PFA outperforms other considered online autoscalers, as it effectively minimizes the average job slowdown by up to 47% while still satisfying the budget constraints. Moreover, PFA shows by up to 76% lower average runtime than the competitors.
△ Less
Submitted 23 July, 2019; v1 submitted 24 May, 2019;
originally announced May 2019.
-
Reverse Flooding: exploiting radio interference for efficient propagation delay compensation in WSN clock synchronization
Authors:
Federico Terraneo,
Alberto Leva,
Silvano Seva,
Martina Maggio,
Alessandro Vittorio Papadopoulos
Abstract:
Clock synchronization is a necessary component in modern distributed systems, especially Wirless Sensor Networks (WSNs). Despite the great effort and the numerous improvements, the existing synchronization schemes do not yet address the cancellation of propagation delays. Up to a few years ago, this was not perceived as a problem, because the time-stam** precision was a more limiting factor for…
▽ More
Clock synchronization is a necessary component in modern distributed systems, especially Wirless Sensor Networks (WSNs). Despite the great effort and the numerous improvements, the existing synchronization schemes do not yet address the cancellation of propagation delays. Up to a few years ago, this was not perceived as a problem, because the time-stam** precision was a more limiting factor for the accuracy achievable with a synchronization scheme. However, the recent introduction of efficient flooding schemes based on constructive interference has greatly improved the achievable accuracy, to the point where propagation delays can effectively become the main source of error. In this paper, we propose a method to estimate and compensate for the network propagation delays. Our proposal does not require to maintain a spanning tree of the network, and exploits constructive interference even to transmit packets whose content are slightly different. To show the validity of the approach, we implemented the propagation delay estimator on top of the FLOPSYNC-2 synchronization scheme. Experimental results prove the feasibility of measuring propagation delays using off-the-shelf microcontrollers and radio transceivers, and show how the proposed solution allows to achieve sub-microsecond clock synchronization even for networks where propagation delays are significant.
△ Less
Submitted 17 August, 2018;
originally announced August 2018.