-
ILX: Intelligent "Location+X" Data Systems (Vision Paper)
Authors:
Walid G. Aref,
Ahmed M. Aly,
Anas Daghistani,
Yeasir Rayhan,
Jianguo Wang,
Libin Zhou
Abstract:
Due to the ubiquity of mobile phones and location-detection devices, location data is being generated in very large volumes. Queries and operations that are performed on location data warrant the use of database systems. Despite that, location data is being supported in data systems as an afterthought. Typically, relational or NoSQL data systems that are mostly designed with non-location data in m…
▽ More
Due to the ubiquity of mobile phones and location-detection devices, location data is being generated in very large volumes. Queries and operations that are performed on location data warrant the use of database systems. Despite that, location data is being supported in data systems as an afterthought. Typically, relational or NoSQL data systems that are mostly designed with non-location data in mind get extended with spatial or spatiotemporal indexes, some query operators, and higher level syntactic sugar in order to support location data. The ubiquity of location data and location data services call for systems that are solely designed and optimized for the efficient support of location data. This paper envisions designing intelligent location+X data systems, ILX for short, where location is treated as a first-class citizen type. ILX is tailored with location data as the main data type (location-first). Because location data is typically augmented with other data types X, e.g., graphs, text data, click streams, annotations, etc., ILX needs to be extensible to support other data types X along with location. This paper envisions the main features that ILX should support, and highlights research challenges in realizing and supporting ILX.
△ Less
Submitted 1 August, 2022; v1 submitted 19 June, 2022;
originally announced June 2022.
-
STULL: Unbiased Online Sampling for Visual Exploration of Large Spatiotemporal Data
Authors:
Guizhen Wang,
**g**g Guo,
Mingjie Tang,
José Florencio de Queiroz Neto,
Calvin Yau,
Anas Daghistani,
Morteza Karimzadeh,
Walid G. Aref,
David S. Ebert
Abstract:
Online sampling-supported visual analytics is increasingly important, as it allows users to explore large datasets with acceptable approximate answers at interactive rates. However, existing online spatiotemporal sampling techniques are often biased, as most researchers have primarily focused on reducing computational latency. Biased sampling approaches select data with unequal probabilities and p…
▽ More
Online sampling-supported visual analytics is increasingly important, as it allows users to explore large datasets with acceptable approximate answers at interactive rates. However, existing online spatiotemporal sampling techniques are often biased, as most researchers have primarily focused on reducing computational latency. Biased sampling approaches select data with unequal probabilities and produce results that do not match the exact data distribution, leading end users to incorrect interpretations. In this paper, we propose a novel approach to perform unbiased online sampling of large spatiotemporal data. The proposed approach ensures the same probability of selection to every point that qualifies the specifications of a user's multidimensional query. To achieve unbiased sampling for accurate representative interactive visualizations, we design a novel data index and an associated sample retrieval plan. Our proposed sampling approach is suitable for a wide variety of visual analytics tasks, e.g., tasks that run aggregate queries of spatiotemporal data. Extensive experiments confirm the superiority of our approach over a state-of-the-art spatial online sampling technique, demonstrating that within the same computational time, data samples generated in our approach are at least 50% more accurate in representing the actual spatial distribution of the data and enable approximate visualizations to present closer visual appearances to the exact ones.
△ Less
Submitted 29 August, 2020;
originally announced August 2020.
-
A Security and Performance Driven Architecture for Cloud Data Centers
Authors:
Muhamad Felemban,
Anas Daghistani,
Yahya Javeed,
Jason Kobes,
Arif Ghafoor
Abstract:
With the growing cyber-security threats, ensuring the security of data in Cloud data centers is a challenging task. A prominent type of attack on Cloud data centers is data tampering attack that can jeopardize the confidentiality and the integrity of data. In this article, we present a security and performance driven architecture for these centers that incorporates an intrusion management system f…
▽ More
With the growing cyber-security threats, ensuring the security of data in Cloud data centers is a challenging task. A prominent type of attack on Cloud data centers is data tampering attack that can jeopardize the confidentiality and the integrity of data. In this article, we present a security and performance driven architecture for these centers that incorporates an intrusion management system for multi-tenant distributed transactional databases. The proposed architecture uses a novel data partitioning and placement scheme based on damage containment and communication cost of distributed transactions. In addition, we present a benchmarking framework for evaluating the performance of the proposed architecture. The results illustrate a trade-off between security and performance goals for Cloud data centers.
△ Less
Submitted 27 March, 2020;
originally announced March 2020.
-
SWARM: Adaptive Load Balancing in Distributed Streaming Systems for Big Spatial Data
Authors:
Anas Daghistani,
Walid G. Aref,
Arif Ghafoor,
Ahmed R. Mahmood
Abstract:
The proliferation of GPS-enabled devices has led to the development of numerous location-based services. These services need to process massive amounts of spatial data in real-time. The current scale of spatial data cannot be handled using centralized systems. This has led to the development of distributed spatial streaming systems. Existing systems are using static spatial partitioning to distrib…
▽ More
The proliferation of GPS-enabled devices has led to the development of numerous location-based services. These services need to process massive amounts of spatial data in real-time. The current scale of spatial data cannot be handled using centralized systems. This has led to the development of distributed spatial streaming systems. Existing systems are using static spatial partitioning to distribute the workload. In contrast, the real-time streamed spatial data follows non-uniform spatial distributions that are continuously changing over time. Distributed spatial streaming systems need to react to the changes in the distribution of spatial data and queries. This paper introduces SWARM, a light-weight adaptivity protocol that continuously monitors the data and query workloads across the distributed processes of the spatial data streaming system, and redistribute and rebalance the workloads soon as performance bottlenecks get detected. SWARM is able to handle multiple query-execution and data-persistence models. A distributed streaming system can directly use SWARM to adaptively rebalance the system's workload among its machines with minimal changes to the original code of the underlying spatial application. Extensive experimental evaluation using real and synthetic datasets illustrate that, on average, SWARM achieves 200% improvement over a static grid partitioning that is determined based on observing a limited history of the data and query workloads. Moreover, SWARM reduces execution latency on average 4x compared with the other technique.
△ Less
Submitted 26 February, 2020;
originally announced February 2020.
-
Adaptive Processing of Spatial-Keyword Data Over a Distributed Streaming Cluster
Authors:
Ahmed R. Mahmood,
Anas Daghistani,
Ahmed M. Aly,
Walid G. Aref,
Mingjie Tang,
Saleh Basalamah,
Sunil Prabhakar
Abstract:
The widespread use of GPS-enabled smartphones along with the popularity of micro-blogging and social networking applications, e.g., Twitter and Facebook, has resulted in the generation of huge streams of geo-tagged textual data. Many applications require real-time processing of these streams. For example, location-based e-coupon and ad-targeting systems enable advertisers to register millions of a…
▽ More
The widespread use of GPS-enabled smartphones along with the popularity of micro-blogging and social networking applications, e.g., Twitter and Facebook, has resulted in the generation of huge streams of geo-tagged textual data. Many applications require real-time processing of these streams. For example, location-based e-coupon and ad-targeting systems enable advertisers to register millions of ads to millions of users. The number of users is typically very high and they are continuously moving, and the ads change frequently as well. Hence sending the right ad to the matching users is very challenging. Existing streaming systems are either centralized or are not spatial-keyword aware, and cannot efficiently support the processing of rapidly arriving spatial-keyword data streams. This paper presents Tornado, a distributed spatial-keyword stream processing system. Tornado features routing units to fairly distribute the workload, and furthermore, co-locate the data objects and the corresponding queries at the same processing units. The routing units use the Augmented-Grid, a novel structure that is equipped with an efficient search algorithm for distributing the data objects and queries. Tornado uses evaluators to process the data objects against the queries. The routing units minimize the redundant communication by not sending data updates for processing when these updates do not match any query. By applying dynamically evaluated cost formulae that continuously represent the processing overhead at each evaluator, Tornado is adaptive to changes in the workload. Extensive experimental evaluation using spatio-textual range queries over real Twitter data indicates that Tornado outperforms the non-spatio-textually aware approaches by up to two orders of magnitude in terms of the overall system throughput.
△ Less
Submitted 8 September, 2017;
originally announced September 2017.