Search | arXiv e-print repository

On the Semantic Overlap of Operators in Stream Processing Engines

Authors: Vincenzo Gulisano, Alessandro Margara, Marina Papatriantafilou

Abstract: Stream processing is extensively used in the IoT-to-Cloud spectrum to distill information from continuous streams of data. Streaming applications usually run in dedicated Stream Processing Engines (SPEs) that adopt the DataFlow model, which defines such applications as graphs of operators that, step by step, transform data into the desired results. As operators can be deployed and executed indepen… ▽ More Stream processing is extensively used in the IoT-to-Cloud spectrum to distill information from continuous streams of data. Streaming applications usually run in dedicated Stream Processing Engines (SPEs) that adopt the DataFlow model, which defines such applications as graphs of operators that, step by step, transform data into the desired results. As operators can be deployed and executed independently, the DataFlow model supports parallelism and distribution, thus making streaming applications scalable. Today, we witness an abundance of SPEs, each with its set of operators. In this context, understanding how operators' semantics overlap within and across SPEs, and thus which SPEs can support a given application, is not trivial. We tackle this problem by formally showing that common operators of SPEs can be expressed as compositions of a single, minimalistic Aggregate operator, thus showing any framework able to run compositions of such an operator can run applications defined for state-of-the-art SPEs. The Aggregate operator only relies on core concepts of the DataFlow model such as data partitioning by key and time-based windows, and can only output up to one value for each window it analyzes. Together with our formal argumentation, we empirically assess how an SPE that only relies on such an operator compares with an SPE offering operator-specific implementations, as well as study the performance impact of a more expressive Aggregate operator by relaxing the constraint of outputting up to one value per window. The existence of such a common denominator not only implies the portability of operators within and across SPEs but also defines a concise set of requirements for other data processing frameworks to support streaming applications. △ Less

Submitted 1 March, 2023; originally announced March 2023.

arXiv:2112.11286 [pdf, other]

Geographical Peer Matching for P2P Energy Sharing

Authors: Romaric Duvignau, Vincenzo Gulisano, Marina Papatriantafilou, Ralf Klasing

Abstract: Significant cost reductions attract ever more households to invest in small-scale renewable electricity generation and storage. Such distributed resources are not used in the most effective way when only used individually, as sharing them provides even greater cost savings. Energy Peer-to-Peer (P2P) systems have thus been shown to be beneficial for prosumers and consumers through reductions in ene… ▽ More Significant cost reductions attract ever more households to invest in small-scale renewable electricity generation and storage. Such distributed resources are not used in the most effective way when only used individually, as sharing them provides even greater cost savings. Energy Peer-to-Peer (P2P) systems have thus been shown to be beneficial for prosumers and consumers through reductions in energy cost while also being attractive to grid or service providers. However, many practical challenges have to be overcome before all players could gain in having efficient and automated local energy communities; such challenges include the inherent complexity of matching together geographically distributed peers and the significant computations required to calculate the local matching preferences. Hence dedicated algorithms are required to be able to perform a cost-efficient matching of thousands of peers in a computational-efficient fashion. We define and analyze in this work a precise mathematical modelling of the geographical peer matching problem and several heuristics solving it. Our experimental study, based on real-world energy data, demonstrates that our solutions are efficient both in terms of cost savings achieved by the peers and in terms of communication and computing requirements. Our scalable algorithms thus provide one core building block for practical and data-efficient peer-to-peer energy sharing communities within large-scale optimization systems. △ Less

Submitted 24 January, 2024; v1 submitted 21 December, 2021; originally announced December 2021.

arXiv:2111.13058 [pdf, other]

STRETCH: Virtual Shared-Nothing Parallelism for Scalable and Elastic Stream Processing

Authors: Vincenzo Gulisano, Hannaneh Najdataei, Yiannis Nikolakopoulos, Alessandro V. Papadopoulos, Marina Papatriantafilou, Philippas Tsigas

Abstract: Stream processing applications extract value from raw data through Directed Acyclic Graphs of data analysis tasks. Shared-nothing (SN) parallelism is the de-facto standard to scale stream processing applications. Given an application, SN parallelism instantiates several copies of each analysis task, making each instance responsible for a dedicated portion of the overall analysis, and relies on ded… ▽ More Stream processing applications extract value from raw data through Directed Acyclic Graphs of data analysis tasks. Shared-nothing (SN) parallelism is the de-facto standard to scale stream processing applications. Given an application, SN parallelism instantiates several copies of each analysis task, making each instance responsible for a dedicated portion of the overall analysis, and relies on dedicated queues to exchange data among connected instances. On the one hand, SN parallelism can scale the execution of applications both up and out since threads can run task instances within and across processes/nodes. On the other hand, its lack of sharing can cause unnecessary overheads and hinder the scaling up when threads operate on data that could be jointly accessed in shared memory. This trade-off motivated us in studying a way for stream processing applications to leverage shared memory and boost the scale up (before the scale out) while adhering to the widely-adopted and SN-based APIs for stream processing applications. We introduce STRETCH, a framework that maximizes the scale up and offers instantaneous elastic reconfigurations (without state transfer) for stream processing applications. We propose the concept of Virtual Shared-Nothing (VSN) parallelism and elasticity and provide formal definitions and correctness proofs for the semantics of the analysis tasks supported by STRETCH, showing they extend the ones found in common Stream Processing Engines. We also provide a fully implemented prototype and show that STRETCH's performance exceeds that of state-of-the-art frameworks such as Apache Flink and offers, to the best of our knowledge, unprecedented ultra-fast reconfigurations, taking less than 40 ms even when provisioning tens of new task instances. △ Less

Submitted 29 April, 2022; v1 submitted 25 November, 2021; originally announced November 2021.

arXiv:2005.04935

Performance Modeling and Vertical Autoscaling of Stream Joins

Authors: Hannaneh Najdataei, Vincenzo Gulisano, Alessandro V. Papadopoulos, Ivan Walulya, Marina Papatriantafilou, Philippas Tsigas

Abstract: Streaming analysis is widely used in cloud as well as edge infrastructures. In these contexts, fine-grained application performance can be based on accurate modeling of streaming operators. This is especially beneficial for computationally expensive operators like adaptive stream joins that, being very sensitive to rate-varying data streams, would otherwise require costly frequent monitoring. We… ▽ More Streaming analysis is widely used in cloud as well as edge infrastructures. In these contexts, fine-grained application performance can be based on accurate modeling of streaming operators. This is especially beneficial for computationally expensive operators like adaptive stream joins that, being very sensitive to rate-varying data streams, would otherwise require costly frequent monitoring. We propose a dynamic model for the processing throughput and latency of adaptive stream joins that run with different parallelism degrees. The model is presented with progressive complexity, from a centralized non-deterministic up to a deterministic parallel stream join, describing how throughput and latency dynamics are influenced by various configuration parameters. The model is catalytic for understanding the behavior of stream joins against different system deployments, as we show with our model-based autoscaling methodology to change the parallelism degree of stream joins during the execution. Our thorough evaluation, for a broad spectrum of parameter, confirms the model can reliably predict throughput and latency metrics with a fairly high accuracy, with the median error in estimation ranging from approximately 0.1% to 6.5%, even for an overloaded system. Furthermore, we show that our model allows to efficiently control adaptive stream joins by estimating the needed resources solely based on the observed input load. In particular, we show it can be employed to enable efficient autoscaling, even when big changes in the input load happen frequently (in the realm of seconds). △ Less

Submitted 29 November, 2021; v1 submitted 11 May, 2020; originally announced May 2020.

Comments: part of the experimental campaign has been included in a different contribution

arXiv:1808.08877 [pdf, ps, other]

Piecewise Linear Approximation in Data Streaming: Algorithmic Implementations and Experimental Analysis

Authors: Romaric Duvignau, Vincenzo Gulisano, Marina Papatriantafilou, Vladimir Savic

Abstract: Piecewise Linear Approximation (PLA) is a well-established tool to reduce the size of the representation of time series by approximating the series by a sequence of line segments while kee** the error introduced by the approximation within some predetermined threshold. With the recent rise of edge computing, PLA algorithms find a complete new set of applications with the emphasis on reducing the… ▽ More Piecewise Linear Approximation (PLA) is a well-established tool to reduce the size of the representation of time series by approximating the series by a sequence of line segments while kee** the error introduced by the approximation within some predetermined threshold. With the recent rise of edge computing, PLA algorithms find a complete new set of applications with the emphasis on reducing the volume of streamed data. In this study, we identify two scenarios set in a data-stream processing context: data reduction in sensor transmissions and datacenter storage. In connection to those scenarios, we identify several streaming metrics and propose streaming protocols as algorithmic implementations of several state of the art PLA techniques. In an experimental evaluation, we measure the quality of the reviewed methods and protocols and evaluate their performance against those streaming statistics. All known methods have deficiencies when it comes to handling streaming-like data, e.g. inflation of the input stream, high latency or poor average error. Our experimental results highlight the challenges raised when transferring those classical methods into the stream processing world and present alternative techniques to overcome them and balance the related trade-offs. △ Less

Submitted 9 October, 2018; v1 submitted 27 August, 2018; originally announced August 2018.

Comments: 12 pages+1 for references, 16 figures, 3 tables

arXiv:1711.01853 [pdf, other]

Lisco: A Continuous Approach in LiDAR Point-cloud Clustering

Authors: Hannaneh Najdataei, Yiannis Nikolakopoulos, Vincenzo Gulisano, Marina Papatriantafilou

Abstract: The light detection and ranging (LiDAR) technology allows to sense surrounding objects with fine-grained resolution in a large areas. Their data (aka point clouds), generated continuously at very high rates, can provide information to support automated functionality in cyberphysical systems. Clustering of point clouds is a key problem to extract this type of information. Methods for solving the pr… ▽ More The light detection and ranging (LiDAR) technology allows to sense surrounding objects with fine-grained resolution in a large areas. Their data (aka point clouds), generated continuously at very high rates, can provide information to support automated functionality in cyberphysical systems. Clustering of point clouds is a key problem to extract this type of information. Methods for solving the problem in a continuous fashion can facilitate improved processing in e.g. fog architectures, allowing continuous, streaming processing of data close to the sources. We propose Lisco, a single-pass continuous Euclidean-distance-based clustering of LiDAR point clouds, that maximizes the granularity of the data processing pipeline. Besides its algorithmic analysis, we provide a thorough experimental evaluation and highlight its up to 3x improvements and its scalability benefits compared to the baseline, using both real-world datasets as well as synthetic ones to fully explore the worst-cases. △ Less

Submitted 6 November, 2017; originally announced November 2017.

arXiv:1606.04746 [pdf, other]

Efficient data streaming multiway aggregation through concurrent algorithmic designs and new abstract data types

Authors: Vincenzo Gulisano, Yiannis Nikolakopoulos, Daniel Cederman, Marina Papatriantafilou, Philippas Tsigas

Abstract: Data streaming relies on continuous queries to process unbounded streams of data in a real-time fashion. It is commonly demanding in computation capacity, given that the relevant applications involve very large volumes of data. Data structures act as articulation points and maintain the state of data streaming operators, potentially supporting high parallelism and balancing the work between them.… ▽ More Data streaming relies on continuous queries to process unbounded streams of data in a real-time fashion. It is commonly demanding in computation capacity, given that the relevant applications involve very large volumes of data. Data structures act as articulation points and maintain the state of data streaming operators, potentially supporting high parallelism and balancing the work between them. Prompted by this fact, in this work we study and analyze parallelization needs of these articulation points, focusing on the problem of streaming multiway aggregation, where large data volumes are received from multiple input streams. The analysis of the parallelization needs, as well as of the use and limitations of existing aggregate designs and their data structures, leads us to identify needs for proper shared objects that can achieve low-latency and high throughput multiway aggregation. We present the requirements of such objects as abstract data types and we provide efficient lock-free linearizable algorithmic implementations of them, along with new multiway aggregate algorithmic designs that leverage them, supporting both deterministic order-sensitive and order-insensitive aggregate functions. Furthermore, we point out future directions that open through these contributions. The paper includes an extensive experimental study, based on a variety of aggregation continuous queries on two large datasets extracted from SoundCloud, a music social network, and from a Smart Grid network. In all the experiments, the proposed data structures and the enhanced aggregate operators improved the processing performance significantly, up to one order of magnitude, in terms of both throughput and latency, over the commonly-used techniques based on queues. △ Less

Submitted 15 June, 2016; originally announced June 2016.

Showing 1–7 of 7 results for author: Gulisano, V