-
climber++: Pivot-Based Approximate Similarity Search over Big Data Series
Authors:
Liang Zhang,
Mohamed Y. Eltabakh,
Elke A. Rundensteiner,
Khalid Alnuaim
Abstract:
The generation and collection of big data series are becoming an integral part of many emerging applications in sciences, IoT, finance, and web applications among several others. The terabyte-scale of data series has motivated recent efforts to design fully distributed techniques for supporting operations such as approximate kNN similarity search, which is a building block operation in most analyt…
▽ More
The generation and collection of big data series are becoming an integral part of many emerging applications in sciences, IoT, finance, and web applications among several others. The terabyte-scale of data series has motivated recent efforts to design fully distributed techniques for supporting operations such as approximate kNN similarity search, which is a building block operation in most analytics services on data series. Unfortunately, these techniques are heavily geared towards achieving scalability at the cost of sacrificing the results' accuracy. State-of-the-art systems report accuracy below 10% and 40%, respectively, which is not practical for many real-world applications. In this paper, we investigate the root problems in these existing techniques that limit their ability to achieve better a trade-off between scalability and accuracy. Then, we propose a framework, called CLIMBER, that encompasses a novel feature extraction mechanism, indexing scheme, and query processing algorithms for supporting approximate similarity search in big data series. For CLIMBER, we propose a new loss-resistant dual representation composed of rank-sensitive and ranking-insensitive signatures capturing data series objects. Based on this representation, we devise a distributed two-level index structure supported by an efficient data partitioning scheme. Our similarity metrics tailored for this dual representation enables meaningful comparison and distance evaluation between the rank-sensitive and ranking-insensitive signatures. Finally, we propose two efficient query processing algorithms, CLIMBER-kNN and CLIMBER-kNN-Adaptive, for answering approximate kNN similarity queries. Our experimental study on real-world and benchmark datasets demonstrates that CLIMBER, unlike existing techniques, features results' accuracy above 80% while retaining the desired scalability to terabytes of data.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
To Share, or not to Share Online Event Trend Aggregation Over Bursty Event Streams
Authors:
Olga Poppe,
Chuan Lei,
Lei Ma,
Allison Rozet,
Elke A. Rundensteiner
Abstract:
Complex event processing (CEP) systems continuously evaluate large workloads of pattern queries under tight time constraints. Event trend aggregation queries with Kleene patterns are commonly used to retrieve summarized insights about the recent trends in event streams. State-of-art methods are limited either due to repetitive computations or unnecessary trend construction. Existing shared approac…
▽ More
Complex event processing (CEP) systems continuously evaluate large workloads of pattern queries under tight time constraints. Event trend aggregation queries with Kleene patterns are commonly used to retrieve summarized insights about the recent trends in event streams. State-of-art methods are limited either due to repetitive computations or unnecessary trend construction. Existing shared approaches are guided by statically selected and hence rigid sharing plans that are often sub-optimal under stream fluctuations. In this work, we propose a novel framework Hamlet that is the first to overcome these limitations. Hamlet introduces two key innovations. First, Hamlet adaptively decides whether to share or not to share computations depending on the current stream properties at run time to harvest the maximum sharing benefit. Second, Hamlet is equipped with a highly efficient shared trend aggregation strategy that avoids trend construction. Our experimental study on both real and synthetic data sets demonstrates that Hamlet consistently reduces query latency by up to five orders of magnitude compared to the state-of-the-art approaches.
△ Less
Submitted 3 March, 2021; v1 submitted 1 January, 2021;
originally announced January 2021.
-
Sharon: Shared Online Event Sequence Aggregation
Authors:
Olga Poppe,
Allison Rozet,
Chuan Lei,
Elke A. Rundensteiner,
David Maier
Abstract:
Streaming systems evaluate massive workloads of event sequence aggregation queries. State-of-the-art approaches suffer from long delays caused by not sharing intermediate results of similar queries and by constructing event sequences prior to their aggregation. To overcome these limitations, our Shared Online Event Sequence Aggregation (Sharon) approach shares intermediate aggregates among multipl…
▽ More
Streaming systems evaluate massive workloads of event sequence aggregation queries. State-of-the-art approaches suffer from long delays caused by not sharing intermediate results of similar queries and by constructing event sequences prior to their aggregation. To overcome these limitations, our Shared Online Event Sequence Aggregation (Sharon) approach shares intermediate aggregates among multiple queries while avoiding the expensive construction of event sequences. Our Sharon optimizer faces two challenges. One, a sharing decision is not always beneficial. Two, a sharing decision may exclude other sharing opportunities. To guide our Sharon optimizer, we compactly encode sharing candidates, their benefits, and conflicts among candidates into the Sharon graph. Based on the graph, we map our problem of finding an optimal sharing plan to the Maximum Weight Independent Set (MWIS) problem. We then use the guaranteed weight of a greedy algorithm for the MWIS problem to prune the search of our sharing plan finder without sacrificing its optimality. The Sharon optimizer is shown to produce sharing plans that achieve up to an 18-fold speed-up compared to state-of-the-art approaches.
△ Less
Submitted 6 October, 2020;
originally announced October 2020.
-
GRETA: Graph-based Real-time Event Trend Aggregation
Authors:
Olga Poppe,
Chuan Lei,
Elke A. Rundensteiner,
David Maier
Abstract:
Streaming applications from algorithmic trading to traffic management deploy Kleene patterns to detect and aggregate arbitrarily-long event sequences, called event trends. State-of-the-art systems process such queries in two steps. Namely, they first construct all trends and then aggregate them. Due to the exponential costs of trend construction, this two-step approach suffers from both a long del…
▽ More
Streaming applications from algorithmic trading to traffic management deploy Kleene patterns to detect and aggregate arbitrarily-long event sequences, called event trends. State-of-the-art systems process such queries in two steps. Namely, they first construct all trends and then aggregate them. Due to the exponential costs of trend construction, this two-step approach suffers from both a long delays and high memory costs. To overcome these limitations, we propose the Graph-based Real-time Event Trend Aggregation (Greta) approach that dynamically computes event trend aggregation without first constructing these trends. We define the Greta graph to compactly encode all trends. Our Greta runtime incrementally maintains the graph, while dynamically propagating aggregates along its edges. Based on the graph, the final aggregate is incrementally updated and instantaneously returned at the end of each query window. Our Greta runtime represents a win-win solution, reducing both the time complexity from exponential to quadratic and the space complexity from exponential to linear in the number of events. Our experiments demonstrate that Greta achieves up to four orders of magnitude speed-up and up to 50--fold memory reduction compared to the state-of-the-art two-step approaches.
△ Less
Submitted 6 October, 2020;
originally announced October 2020.
-
Event Trend Aggregation Under Rich Event Matching Semantics
Authors:
Olga Poppe,
Chuan Lei,
Elke A. Rundensteiner,
David Maier
Abstract:
Streaming applications from health care analytics to algorithmic trading deploy Kleene queries to detect and aggregate event trends. Rich event matching semantics determine how to compose events into trends. The expressive power of state-of-the-art systems remains limited in that they do not support the rich variety of these semantics. Worse yet, they suffer from long delays and high memory costs…
▽ More
Streaming applications from health care analytics to algorithmic trading deploy Kleene queries to detect and aggregate event trends. Rich event matching semantics determine how to compose events into trends. The expressive power of state-of-the-art systems remains limited in that they do not support the rich variety of these semantics. Worse yet, they suffer from long delays and high memory costs because they opt to maintain aggregates at a fine granularity. To overcome these limitations, our Coarse-Grained Event Trend Aggregation (Cogra) approach supports this rich diversity of event matching semantics within one system. Better yet, Cogra incrementally maintains aggregates at the coarsest granularity possible for each of these semantics. In this way, Cogra minimizes the number of aggregates -- reducing both time and space complexity. Our experiments demonstrate that Cogra achieves up to four orders of magnitude speed-up and up to eight orders of magnitude memory reduction compared to state-of-the-art approaches.
△ Less
Submitted 6 October, 2020;
originally announced October 2020.
-
Summarization and Matching of Density-Based Clusters in Streaming Environments
Authors:
Di Yang,
Elke A. Rundensteiner,
Matthew O. Ward
Abstract:
Density-based cluster mining is known to serve a broad range of applications ranging from stock trade analysis to moving object monitoring. Although methods for efficient extraction of density-based clusters have been studied in the literature, the problem of summarizing and matching of such clusters with arbitrary shapes and complex cluster structures remains unsolved. Therefore, the goal of our…
▽ More
Density-based cluster mining is known to serve a broad range of applications ranging from stock trade analysis to moving object monitoring. Although methods for efficient extraction of density-based clusters have been studied in the literature, the problem of summarizing and matching of such clusters with arbitrary shapes and complex cluster structures remains unsolved. Therefore, the goal of our work is to extend the state-of-art of density-based cluster mining in streams from cluster extraction only to now also support analysis and management of the extracted clusters. Our work solves three major technical challenges. First, we propose a novel multi-resolution cluster summarization method, called Skeletal Grid Summarization (SGS), which captures the key features of density-based clusters, covering both their external shape and internal cluster structures. Second, in order to summarize the extracted clusters in real-time, we present an integrated computation strategy C-SGS, which piggybacks the generation of cluster summarizations within the online clustering process. Lastly, we design a mechanism to efficiently execute cluster matching queries, which identify similar clusters for given cluster of analyst's interest from clusters extracted earlier in the stream history. Our experimental study using real streaming data shows the clear superiority of our proposed methods in both efficiency and effectiveness for cluster summarization and cluster matching queries to other potential alternatives.
△ Less
Submitted 30 October, 2011;
originally announced October 2011.