-
On the Semantic Overlap of Operators in Stream Processing Engines
Authors:
Vincenzo Gulisano,
Alessandro Margara,
Marina Papatriantafilou
Abstract:
Stream processing is extensively used in the IoT-to-Cloud spectrum to distill information from continuous streams of data. Streaming applications usually run in dedicated Stream Processing Engines (SPEs) that adopt the DataFlow model, which defines such applications as graphs of operators that, step by step, transform data into the desired results. As operators can be deployed and executed indepen…
▽ More
Stream processing is extensively used in the IoT-to-Cloud spectrum to distill information from continuous streams of data. Streaming applications usually run in dedicated Stream Processing Engines (SPEs) that adopt the DataFlow model, which defines such applications as graphs of operators that, step by step, transform data into the desired results. As operators can be deployed and executed independently, the DataFlow model supports parallelism and distribution, thus making streaming applications scalable.
Today, we witness an abundance of SPEs, each with its set of operators. In this context, understanding how operators' semantics overlap within and across SPEs, and thus which SPEs can support a given application, is not trivial. We tackle this problem by formally showing that common operators of SPEs can be expressed as compositions of a single, minimalistic Aggregate operator, thus showing any framework able to run compositions of such an operator can run applications defined for state-of-the-art SPEs. The Aggregate operator only relies on core concepts of the DataFlow model such as data partitioning by key and time-based windows, and can only output up to one value for each window it analyzes. Together with our formal argumentation, we empirically assess how an SPE that only relies on such an operator compares with an SPE offering operator-specific implementations, as well as study the performance impact of a more expressive Aggregate operator by relaxing the constraint of outputting up to one value per window.
The existence of such a common denominator not only implies the portability of operators within and across SPEs but also defines a concise set of requirements for other data processing frameworks to support streaming applications.
△ Less
Submitted 1 March, 2023;
originally announced March 2023.
-
Geographical Peer Matching for P2P Energy Sharing
Authors:
Romaric Duvignau,
Vincenzo Gulisano,
Marina Papatriantafilou,
Ralf Klasing
Abstract:
Significant cost reductions attract ever more households to invest in small-scale renewable electricity generation and storage. Such distributed resources are not used in the most effective way when only used individually, as sharing them provides even greater cost savings. Energy Peer-to-Peer (P2P) systems have thus been shown to be beneficial for prosumers and consumers through reductions in ene…
▽ More
Significant cost reductions attract ever more households to invest in small-scale renewable electricity generation and storage. Such distributed resources are not used in the most effective way when only used individually, as sharing them provides even greater cost savings. Energy Peer-to-Peer (P2P) systems have thus been shown to be beneficial for prosumers and consumers through reductions in energy cost while also being attractive to grid or service providers. However, many practical challenges have to be overcome before all players could gain in having efficient and automated local energy communities; such challenges include the inherent complexity of matching together geographically distributed peers and the significant computations required to calculate the local matching preferences. Hence dedicated algorithms are required to be able to perform a cost-efficient matching of thousands of peers in a computational-efficient fashion. We define and analyze in this work a precise mathematical modelling of the geographical peer matching problem and several heuristics solving it. Our experimental study, based on real-world energy data, demonstrates that our solutions are efficient both in terms of cost savings achieved by the peers and in terms of communication and computing requirements. Our scalable algorithms thus provide one core building block for practical and data-efficient peer-to-peer energy sharing communities within large-scale optimization systems.
△ Less
Submitted 24 January, 2024; v1 submitted 21 December, 2021;
originally announced December 2021.
-
STRETCH: Virtual Shared-Nothing Parallelism for Scalable and Elastic Stream Processing
Authors:
Vincenzo Gulisano,
Hannaneh Najdataei,
Yiannis Nikolakopoulos,
Alessandro V. Papadopoulos,
Marina Papatriantafilou,
Philippas Tsigas
Abstract:
Stream processing applications extract value from raw data through Directed Acyclic Graphs of data analysis tasks. Shared-nothing (SN) parallelism is the de-facto standard to scale stream processing applications. Given an application, SN parallelism instantiates several copies of each analysis task, making each instance responsible for a dedicated portion of the overall analysis, and relies on ded…
▽ More
Stream processing applications extract value from raw data through Directed Acyclic Graphs of data analysis tasks. Shared-nothing (SN) parallelism is the de-facto standard to scale stream processing applications. Given an application, SN parallelism instantiates several copies of each analysis task, making each instance responsible for a dedicated portion of the overall analysis, and relies on dedicated queues to exchange data among connected instances. On the one hand, SN parallelism can scale the execution of applications both up and out since threads can run task instances within and across processes/nodes. On the other hand, its lack of sharing can cause unnecessary overheads and hinder the scaling up when threads operate on data that could be jointly accessed in shared memory. This trade-off motivated us in studying a way for stream processing applications to leverage shared memory and boost the scale up (before the scale out) while adhering to the widely-adopted and SN-based APIs for stream processing applications.
We introduce STRETCH, a framework that maximizes the scale up and offers instantaneous elastic reconfigurations (without state transfer) for stream processing applications. We propose the concept of Virtual Shared-Nothing (VSN) parallelism and elasticity and provide formal definitions and correctness proofs for the semantics of the analysis tasks supported by STRETCH, showing they extend the ones found in common Stream Processing Engines. We also provide a fully implemented prototype and show that STRETCH's performance exceeds that of state-of-the-art frameworks such as Apache Flink and offers, to the best of our knowledge, unprecedented ultra-fast reconfigurations, taking less than 40 ms even when provisioning tens of new task instances.
△ Less
Submitted 29 April, 2022; v1 submitted 25 November, 2021;
originally announced November 2021.
-
Consistent Lock-free Parallel Stochastic Gradient Descent for Fast and Stable Convergence
Authors:
Karl Bäckström,
Ivan Walulya,
Marina Papatriantafilou,
Philippas Tsigas
Abstract:
Stochastic gradient descent (SGD) is an essential element in Machine Learning (ML) algorithms. Asynchronous parallel shared-memory SGD (AsyncSGD), including synchronization-free algorithms, e.g. HOGWILD!, have received interest in certain contexts, due to reduced overhead compared to synchronous parallelization. Despite that they induce staleness and inconsistency, they have shown speedup for prob…
▽ More
Stochastic gradient descent (SGD) is an essential element in Machine Learning (ML) algorithms. Asynchronous parallel shared-memory SGD (AsyncSGD), including synchronization-free algorithms, e.g. HOGWILD!, have received interest in certain contexts, due to reduced overhead compared to synchronous parallelization. Despite that they induce staleness and inconsistency, they have shown speedup for problems satisfying smooth, strongly convex targets, and gradient sparsity. Recent works take important steps towards understanding the potential of parallel SGD for problems not conforming to these strong assumptions, in particular for deep learning (DL). There is however a gap in current literature in understanding when AsyncSGD algorithms are useful in practice, and in particular how mechanisms for synchronization and consistency play a role. We focus on the impact of consistency-preserving non-blocking synchronization in SGD convergence, and in sensitivity to hyper-parameter tuning. We propose Leashed-SGD, an extensible algorithmic framework of consistency-preserving implementations of AsyncSGD, employing lock-free synchronization, effectively balancing throughput and latency. We argue analytically about the dynamics of the algorithms, memory consumption, the threads' progress over time, and the expected contention. We provide a comprehensive empirical evaluation, validating the analytical claims, benchmarking the proposed Leashed-SGD framework, and comparing to baselines for training multilayer perceptrons (MLP) and convolutional neural networks (CNN). We observe the crucial impact of contention, staleness and consistency and show how Leashed-SGD provides significant improvements in stability as well as wall-clock time to convergence (from 20-80% up to 4x improvements) compared to the standard lock-based AsyncSGD algorithm and HOGWILD!, while reducing the overall memory footprint.
△ Less
Submitted 17 February, 2021;
originally announced February 2021.
-
Performance Modeling and Vertical Autoscaling of Stream Joins
Authors:
Hannaneh Najdataei,
Vincenzo Gulisano,
Alessandro V. Papadopoulos,
Ivan Walulya,
Marina Papatriantafilou,
Philippas Tsigas
Abstract:
Streaming analysis is widely used in cloud as well as edge infrastructures. In these contexts, fine-grained application performance can be based on accurate modeling of streaming operators. This is especially beneficial for computationally expensive operators like adaptive stream joins that, being very sensitive to rate-varying data streams, would otherwise require costly frequent monitoring.
We…
▽ More
Streaming analysis is widely used in cloud as well as edge infrastructures. In these contexts, fine-grained application performance can be based on accurate modeling of streaming operators. This is especially beneficial for computationally expensive operators like adaptive stream joins that, being very sensitive to rate-varying data streams, would otherwise require costly frequent monitoring.
We propose a dynamic model for the processing throughput and latency of adaptive stream joins that run with different parallelism degrees. The model is presented with progressive complexity, from a centralized non-deterministic up to a deterministic parallel stream join, describing how throughput and latency dynamics are influenced by various configuration parameters. The model is catalytic for understanding the behavior of stream joins against different system deployments, as we show with our model-based autoscaling methodology to change the parallelism degree of stream joins during the execution. Our thorough evaluation, for a broad spectrum of parameter, confirms the model can reliably predict throughput and latency metrics with a fairly high accuracy, with the median error in estimation ranging from approximately 0.1% to 6.5%, even for an overloaded system. Furthermore, we show that our model allows to efficiently control adaptive stream joins by estimating the needed resources solely based on the observed input load. In particular, we show it can be employed to enable efficient autoscaling, even when big changes in the input load happen frequently (in the realm of seconds).
△ Less
Submitted 29 November, 2021; v1 submitted 11 May, 2020;
originally announced May 2020.
-
MindTheStep-AsyncPSGD: Adaptive Asynchronous Parallel Stochastic Gradient Descent
Authors:
Karl Bäckström,
Marina Papatriantafilou,
Philippas Tsigas
Abstract:
Stochastic Gradient Descent (SGD) is very useful in optimization problems with high-dimensional non-convex target functions, and hence constitutes an important component of several Machine Learning and Data Analytics methods. Recently there have been significant works on understanding the parallelism inherent to SGD, and its convergence properties. Asynchronous, parallel SGD (AsyncPSGD) has receiv…
▽ More
Stochastic Gradient Descent (SGD) is very useful in optimization problems with high-dimensional non-convex target functions, and hence constitutes an important component of several Machine Learning and Data Analytics methods. Recently there have been significant works on understanding the parallelism inherent to SGD, and its convergence properties. Asynchronous, parallel SGD (AsyncPSGD) has received particular attention, due to observed performance benefits. On the other hand, asynchrony implies inherent challenges in understanding the execution of the algorithm and its convergence, stemming from the fact that the contribution of a thread might be based on an old (stale) view of the state. In this work we aim to deepen the understanding of AsyncPSGD in order to increase the statistical efficiency in the presence of stale gradients. We propose new models for capturing the nature of the staleness distribution in a practical setting. Using the proposed models, we derive a staleness-adaptive SGD framework, MindTheStep-AsyncPSGD, for adapting the step size in an online-fashion, which provably reduces the negative impact of asynchrony. Moreover, we provide general convergence time bounds for a wide class of staleness-adaptive step size strategies for convex target functions. We also provide a detailed empirical study, showing how our approach implies faster convergence for deep learning applications.
△ Less
Submitted 8 November, 2019;
originally announced November 2019.
-
Piecewise Linear Approximation in Data Streaming: Algorithmic Implementations and Experimental Analysis
Authors:
Romaric Duvignau,
Vincenzo Gulisano,
Marina Papatriantafilou,
Vladimir Savic
Abstract:
Piecewise Linear Approximation (PLA) is a well-established tool to reduce the size of the representation of time series by approximating the series by a sequence of line segments while kee** the error introduced by the approximation within some predetermined threshold. With the recent rise of edge computing, PLA algorithms find a complete new set of applications with the emphasis on reducing the…
▽ More
Piecewise Linear Approximation (PLA) is a well-established tool to reduce the size of the representation of time series by approximating the series by a sequence of line segments while kee** the error introduced by the approximation within some predetermined threshold. With the recent rise of edge computing, PLA algorithms find a complete new set of applications with the emphasis on reducing the volume of streamed data. In this study, we identify two scenarios set in a data-stream processing context: data reduction in sensor transmissions and datacenter storage. In connection to those scenarios, we identify several streaming metrics and propose streaming protocols as algorithmic implementations of several state of the art PLA techniques. In an experimental evaluation, we measure the quality of the reviewed methods and protocols and evaluate their performance against those streaming statistics. All known methods have deficiencies when it comes to handling streaming-like data, e.g. inflation of the input stream, high latency or poor average error. Our experimental results highlight the challenges raised when transferring those classical methods into the stream processing world and present alternative techniques to overcome them and balance the related trade-offs.
△ Less
Submitted 9 October, 2018; v1 submitted 27 August, 2018;
originally announced August 2018.
-
Lisco: A Continuous Approach in LiDAR Point-cloud Clustering
Authors:
Hannaneh Najdataei,
Yiannis Nikolakopoulos,
Vincenzo Gulisano,
Marina Papatriantafilou
Abstract:
The light detection and ranging (LiDAR) technology allows to sense surrounding objects with fine-grained resolution in a large areas. Their data (aka point clouds), generated continuously at very high rates, can provide information to support automated functionality in cyberphysical systems. Clustering of point clouds is a key problem to extract this type of information. Methods for solving the pr…
▽ More
The light detection and ranging (LiDAR) technology allows to sense surrounding objects with fine-grained resolution in a large areas. Their data (aka point clouds), generated continuously at very high rates, can provide information to support automated functionality in cyberphysical systems. Clustering of point clouds is a key problem to extract this type of information. Methods for solving the problem in a continuous fashion can facilitate improved processing in e.g. fog architectures, allowing continuous, streaming processing of data close to the sources. We propose Lisco, a single-pass continuous Euclidean-distance-based clustering of LiDAR point clouds, that maximizes the granularity of the data processing pipeline. Besides its algorithmic analysis, we provide a thorough experimental evaluation and highlight its up to 3x improvements and its scalability benefits compared to the baseline, using both real-world datasets as well as synthetic ones to fully explore the worst-cases.
△ Less
Submitted 6 November, 2017;
originally announced November 2017.
-
Aiding Autonomous Vehicles with Fault-tolerant V2V Communication
Authors:
Vladimir Savic,
Elad M. Schiller,
Marina Papatriantafilou
Abstract:
Vehicle-to-vehicle (V2V) communication is a key component of the future autonomous driving systems. V2V can provide an improved awareness of the surrounding environment, and the knowledge about the future actions of nearby vehicles. However, V2V communication is subject to different kind of failures and delays, so a distributed fault-tolerant approach is required for safe and efficient transportat…
▽ More
Vehicle-to-vehicle (V2V) communication is a key component of the future autonomous driving systems. V2V can provide an improved awareness of the surrounding environment, and the knowledge about the future actions of nearby vehicles. However, V2V communication is subject to different kind of failures and delays, so a distributed fault-tolerant approach is required for safe and efficient transportation. This work considers fully autonomous vehicles that operates using local sensory information, and aided with fault-tolerant V2V communication. The sensors provide all basic functionality, but are overridden by V2V whenever is possible to increase the efficiency. As an example scenario, we consider intersection crossing (IC) with autonomous vehicles that cooperate via V2V communication, and propose a fully distributed and a fault-tolerant algorithm for this problem. According to our numerical results, based on a real data set, we show the crossing delay is only slightly increased in the presence of a burst of V2V failures, and that V2V can be successfully used in most scenarios.
△ Less
Submitted 29 September, 2017;
originally announced October 2017.
-
Distributed Algorithm for Collision Avoidance at Road Intersections in the Presence of Communication Failures
Authors:
Vladimir Savic,
Elad M. Schiller,
Marina Papatriantafilou
Abstract:
Vehicle-to-vehicle (V2V) communication is a crucial component of the future autonomous driving systems since it enables improved awareness of the surrounding environment, even without extensive processing of sensory information. However, V2V communication is prone to failures and delays, so a distributed fault-tolerant approach is required for safe and efficient transportation. In this paper, we f…
▽ More
Vehicle-to-vehicle (V2V) communication is a crucial component of the future autonomous driving systems since it enables improved awareness of the surrounding environment, even without extensive processing of sensory information. However, V2V communication is prone to failures and delays, so a distributed fault-tolerant approach is required for safe and efficient transportation. In this paper, we focus on the intersection crossing (IC) problem with autonomous vehicles that cooperate via V2V communications, and propose a novel distributed IC algorithm that can handle an unknown number of communication failures. Our analysis shows that both safety and liveness requirements are satisfied in all realistic situations. We also found, based on a real data set, that the crossing delay is only slightly increased even in the presence of highly correlated failures.
△ Less
Submitted 10 January, 2017;
originally announced January 2017.
-
Efficient data streaming multiway aggregation through concurrent algorithmic designs and new abstract data types
Authors:
Vincenzo Gulisano,
Yiannis Nikolakopoulos,
Daniel Cederman,
Marina Papatriantafilou,
Philippas Tsigas
Abstract:
Data streaming relies on continuous queries to process unbounded streams of data in a real-time fashion. It is commonly demanding in computation capacity, given that the relevant applications involve very large volumes of data. Data structures act as articulation points and maintain the state of data streaming operators, potentially supporting high parallelism and balancing the work between them.…
▽ More
Data streaming relies on continuous queries to process unbounded streams of data in a real-time fashion. It is commonly demanding in computation capacity, given that the relevant applications involve very large volumes of data. Data structures act as articulation points and maintain the state of data streaming operators, potentially supporting high parallelism and balancing the work between them. Prompted by this fact, in this work we study and analyze parallelization needs of these articulation points, focusing on the problem of streaming multiway aggregation, where large data volumes are received from multiple input streams. The analysis of the parallelization needs, as well as of the use and limitations of existing aggregate designs and their data structures, leads us to identify needs for proper shared objects that can achieve low-latency and high throughput multiway aggregation. We present the requirements of such objects as abstract data types and we provide efficient lock-free linearizable algorithmic implementations of them, along with new multiway aggregate algorithmic designs that leverage them, supporting both deterministic order-sensitive and order-insensitive aggregate functions. Furthermore, we point out future directions that open through these contributions. The paper includes an extensive experimental study, based on a variety of aggregation continuous queries on two large datasets extracted from SoundCloud, a music social network, and from a Smart Grid network. In all the experiments, the proposed data structures and the enhanced aggregate operators improved the processing performance significantly, up to one order of magnitude, in terms of both throughput and latency, over the commonly-used techniques based on queues.
△ Less
Submitted 15 June, 2016;
originally announced June 2016.
-
Shared-object System Equilibria: Delay and Throughput Analysis
Authors:
Iosif Salem,
Elad M. Schiller,
Marina Papatriantafilou,
Philippas Tsigas
Abstract:
We consider shared-object systems that require their threads to fulfill the system jobs by first acquiring sequentially the objects needed for the jobs and then holding on to them until the job completion. Such systems are in the core of a variety of shared-resource allocation and synchronization systems. This work opens a new perspective to study the expected job delay and throughput analytically…
▽ More
We consider shared-object systems that require their threads to fulfill the system jobs by first acquiring sequentially the objects needed for the jobs and then holding on to them until the job completion. Such systems are in the core of a variety of shared-resource allocation and synchronization systems. This work opens a new perspective to study the expected job delay and throughput analytically, given the possible set of jobs that may join the system dynamically.
We identify the system dependencies that cause contention among the threads as they try to acquire the job objects. We use these observations to define the shared-object system equilibria. We note that the system is in equilibrium whenever the rate in which jobs arrive at the system matches the job completion rate. These equilibria consider not only the job delay but also the job throughput, as well as the time in which each thread blocks other threads in order to complete its job. We then further study in detail the thread work cycles and, by using a graph representation of the problem, we are able to propose procedures for finding and estimating equilibria, i.e., discovering the job delay and throughput, as well as the blocking time.
To the best of our knowledge, this is a new perspective, that can provide better analytical tools for the problem, in order to estimate performance measures similar to ones that can be acquired through experimentation on working systems and simulations, e.g., as job delay and throughput in (distributed) shared-object systems.
△ Less
Submitted 2 November, 2015; v1 submitted 7 August, 2015;
originally announced August 2015.
-
Lock-free Concurrent Data Structures
Authors:
Daniel Cederman,
Anders Gidenstam,
Phuong Ha,
Håkan Sundell,
Marina Papatriantafilou,
Philippas Tsigas
Abstract:
Concurrent data structures are the data sharing side of parallel programming. Data structures give the means to the program to store data, but also provide operations to the program to access and manipulate these data. These operations are implemented through algorithms that have to be efficient. In the sequential setting, data structures are crucially important for the performance of the respecti…
▽ More
Concurrent data structures are the data sharing side of parallel programming. Data structures give the means to the program to store data, but also provide operations to the program to access and manipulate these data. These operations are implemented through algorithms that have to be efficient. In the sequential setting, data structures are crucially important for the performance of the respective computation. In the parallel programming setting, their importance becomes more crucial because of the increased use of data and resource sharing for utilizing parallelism.
The first and main goal of this chapter is to provide a sufficient background and intuition to help the interested reader to navigate in the complex research area of lock-free data structures. The second goal is to offer the programmer familiarity to the subject that will allow her to use truly concurrent methods.
△ Less
Submitted 12 February, 2013;
originally announced February 2013.