-
BiasBuster: a Neural Approach for Accurate Estimation of Population Statistics using Biased Location Data
Authors:
Sepanta Zeighami,
Cyrus Shahabi
Abstract:
While extremely useful (e.g., for COVID-19 forecasting and policy-making, urban mobility analysis and marketing, and obtaining business insights), location data collected from mobile devices often contain data from a biased population subset, with some communities over or underrepresented in the collected datasets. As a result, aggregate statistics calculated from such datasets (as is done by vari…
▽ More
While extremely useful (e.g., for COVID-19 forecasting and policy-making, urban mobility analysis and marketing, and obtaining business insights), location data collected from mobile devices often contain data from a biased population subset, with some communities over or underrepresented in the collected datasets. As a result, aggregate statistics calculated from such datasets (as is done by various companies including Safegraph, Google, and Facebook), while ignoring the bias, leads to an inaccurate representation of population statistics. Such statistics will not only be generally inaccurate, but the error will disproportionately impact different population subgroups (e.g., because they ignore the underrepresented communities). This has dire consequences, as these datasets are used for sensitive decision-making such as COVID-19 policymaking. This paper tackles the problem of providing accurate population statistics using such biased datasets. We show that statistical debiasing, although in some cases useful, often fails to improve accuracy. We then propose BiasBuster, a neural network approach that utilizes the correlations between population statistics and location characteristics to provide accurate estimates of population statistics. Extensive experiments on real-world data show that BiasBuster improves accuracy by up to 2 times in general and up to 3 times for underrepresented populations.
△ Less
Submitted 17 February, 2024;
originally announced February 2024.
-
On Distribution Dependent Sub-Logarithmic Query Time of Learned Indexing
Authors:
Sepanta Zeighami,
Cyrus Shahabi
Abstract:
A fundamental problem in data management is to find the elements in an array that match a query. Recently, learned indexes are being extensively used to solve this problem, where they learn a model to predict the location of the items in the array. They are empirically shown to outperform non-learned methods (e.g., B-trees or binary search that answer queries in $O(\log n)$ time) by orders of magn…
▽ More
A fundamental problem in data management is to find the elements in an array that match a query. Recently, learned indexes are being extensively used to solve this problem, where they learn a model to predict the location of the items in the array. They are empirically shown to outperform non-learned methods (e.g., B-trees or binary search that answer queries in $O(\log n)$ time) by orders of magnitude. However, success of learned indexes has not been theoretically justified. Only existing attempt shows the same query time of $O(\log n)$, but with a constant factor improvement in space complexity over non-learned methods, under some assumptions on data distribution. In this paper, we significantly strengthen this result, showing that under mild assumptions on data distribution, and the same space complexity as non-learned methods, learned indexes can answer queries in $O(\log\log n)$ expected query time. We also show that allowing for slightly larger but still near-linear space overhead, a learned index can achieve $O(1)$ expected query time. Our results theoretically prove learned indexes are orders of magnitude faster than non-learned methods, theoretically grounding their empirical success.
△ Less
Submitted 18 June, 2023;
originally announced June 2023.
-
NeuroSketch: Fast and Approximate Evaluation of Range Aggregate Queries with Neural Networks
Authors:
Sepanta Zeighami,
Cyrus Shahabi,
Vatsal Sharan
Abstract:
Range aggregate queries (RAQs) are an integral part of many real-world applications, where, often, fast and approximate answers for the queries are desired. Recent work has studied answering RAQs using machine learning (ML) models, where a model of the data is learned to answer the queries. However, there is no theoretical understanding of why and when the ML based approaches perform well. Further…
▽ More
Range aggregate queries (RAQs) are an integral part of many real-world applications, where, often, fast and approximate answers for the queries are desired. Recent work has studied answering RAQs using machine learning (ML) models, where a model of the data is learned to answer the queries. However, there is no theoretical understanding of why and when the ML based approaches perform well. Furthermore, since the ML approaches model the data, they fail to capitalize on any query specific information to improve performance in practice. In this paper, we focus on modeling ``queries'' rather than data and train neural networks to learn the query answers. This change of focus allows us to theoretically study our ML approach to provide a distribution and query dependent error bound for neural networks when answering RAQs. We confirm our theoretical results by develo** NeuroSketch, a neural network framework to answer RAQs in practice. Extensive experimental study on real-world, TPC-benchmark and synthetic datasets show that NeuroSketch answers RAQs multiple orders of magnitude faster than state-of-the-art and with better accuracy.
△ Less
Submitted 7 April, 2023; v1 submitted 19 November, 2022;
originally announced November 2022.
-
A Neural Approach to Spatio-Temporal Data Release with User-Level Differential Privacy
Authors:
Ritesh Ahuja,
Sepanta Zeighami,
Gabriel Ghinita,
Cyrus Shahabi
Abstract:
Several companies (e.g., Meta, Google) have initiated "data-for-good" projects where aggregate location data are first sanitized and released publicly, which is useful to many applications in transportation, public health (e.g., COVID-19 spread) and urban planning. Differential privacy (DP) is the protection model of choice to ensure the privacy of the individuals who generated the raw location da…
▽ More
Several companies (e.g., Meta, Google) have initiated "data-for-good" projects where aggregate location data are first sanitized and released publicly, which is useful to many applications in transportation, public health (e.g., COVID-19 spread) and urban planning. Differential privacy (DP) is the protection model of choice to ensure the privacy of the individuals who generated the raw location data. However, current solutions fail to preserve data utility when each individual contributes multiple location reports (i.e., under user-level privacy). To offset this limitation, public releases by Meta and Google use high privacy budgets (e.g., $ε$=10-100), resulting in poor privacy. We propose a novel approach to release spatio-temporal data privately and accurately. We employ the pattern recognition power of neural networks, specifically variational auto-encoders (VAE), to reduce the noise introduced by DP mechanisms such that accuracy is increased, while the privacy requirement is still satisfied. Our extensive experimental evaluation on real datasets shows the clear superiority of our approach compared to benchmarks.
△ Less
Submitted 20 August, 2022;
originally announced August 2022.
-
A Neural Database for Differentially Private Spatial Range Queries
Authors:
Sepanta Zeighami,
Ritesh Ahuja,
Gabriel Ghinita,
Cyrus Shahabi
Abstract:
Mobile apps and location-based services generate large amounts of location data that can benefit research on traffic optimization, context-aware notifications and public health (e.g., spread of contagious diseases). To preserve individual privacy, one must first sanitize location data, which is commonly done using the powerful differential privacy (DP) concept. However, existing solutions fall sho…
▽ More
Mobile apps and location-based services generate large amounts of location data that can benefit research on traffic optimization, context-aware notifications and public health (e.g., spread of contagious diseases). To preserve individual privacy, one must first sanitize location data, which is commonly done using the powerful differential privacy (DP) concept. However, existing solutions fall short of properly capturing density patterns and correlations that are intrinsic to spatial data, and as a result yield poor accuracy. We propose a machine-learning based approach for answering statistical queries on location data with DP guarantees. We focus on countering the main source of error that plagues existing approaches (namely, uniformity error), and we design a neural database system that models spatial datasets such that important density and correlation features present in the data are preserved, even when DP-compliant noise is added. We employ a set of neural networks that learn from diverse regions of the dataset and at varying granularities, leading to superior accuracy. We also devise a framework for effective system parameter tuning on top of public data, which helps practitioners set important system parameters without having to expend scarce privacy budget. Extensive experimental results on real datasets with heterogeneous characteristics show that our proposed approach significantly outperforms the state of the art.
△ Less
Submitted 3 August, 2021;
originally announced August 2021.
-
NeuroDB: A Neural Network Framework for Answering Range Aggregate Queries and Beyond
Authors:
Sepanta Zeighami,
Cyrus Shahabi
Abstract:
Range aggregate queries (RAQs) are an integral part of many real-world applications, where, often, fast and approximate answers for the queries are desired. Recent work has studied answering RAQs using machine learning models, where a model of the data is learned to answer the queries. However, such modelling choices fail to utilize any query specific information. To capture such information, we o…
▽ More
Range aggregate queries (RAQs) are an integral part of many real-world applications, where, often, fast and approximate answers for the queries are desired. Recent work has studied answering RAQs using machine learning models, where a model of the data is learned to answer the queries. However, such modelling choices fail to utilize any query specific information. To capture such information, we observe that RAQs can be represented by query functions, which are functions that take a query instance (i.e., a specific RAQ) as an input and output its corresponding answer. Using this representation, we formulate the problem of learning to approximate the query function, and propose NeuroDB, a query specialized neural network framework, that answers RAQs efficiently. NeuroDB is query-type agnostic (i.e., it does not make any assumption about the underlying query type) and our observation that queries can be represented by functions is not specific to RAQs. Thus, we investigate whether NeuroDB can be used for other query types, by applying it to distance to nearest neighbour queries. We experimentally show that NeuroDB outperforms the state-of-the-art for this query type, often by orders of magnitude. Moreover, the same neural network architecture as for RAQs is used, bringing to light the possibility of using a generic framework to answer any query type efficiently.
△ Less
Submitted 10 July, 2021;
originally announced July 2021.
-
Towards Accurate Spatiotemporal COVID-19 Risk Scores using High Resolution Real-World Mobility Data
Authors:
Sirisha Rambhatla,
Sepanta Zeighami,
Kameron Shahabi,
Cyrus Shahabi,
Yan Liu
Abstract:
As countries look towards re-opening of economic activities amidst the ongoing COVID-19 pandemic, ensuring public health has been challenging. While contact tracing only aims to track past activities of infected users, one path to safe reopening is to develop reliable spatiotemporal risk scores to indicate the propensity of the disease. Existing works which aim to develop risk scores either rely o…
▽ More
As countries look towards re-opening of economic activities amidst the ongoing COVID-19 pandemic, ensuring public health has been challenging. While contact tracing only aims to track past activities of infected users, one path to safe reopening is to develop reliable spatiotemporal risk scores to indicate the propensity of the disease. Existing works which aim to develop risk scores either rely on compartmental model-based reproduction numbers (which assume uniform population mixing) or develop coarse-grain spatial scores based on reproduction number (R0) and macro-level density-based mobility statistics. Instead, in this paper, we develop a Hawkes process-based technique to assign relatively fine-grain spatial and temporal risk scores by leveraging high-resolution mobility data based on cell-phone originated location signals. While COVID-19 risk scores also depend on a number of factors specific to an individual, including demography and existing medical conditions, the primary mode of disease transmission is via physical proximity and contact. Therefore, we focus on develo** risk scores based on location density and mobility behaviour. We demonstrate the efficacy of the developed risk scores via simulation based on real-world mobility data. Our results show that fine-grain spatiotemporal risk scores based on high-resolution mobility data can provide useful insights and facilitate safe re-opening.
△ Less
Submitted 14 December, 2020;
originally announced December 2020.
-
Estimating Spread of Contact-Based Contagions in a Population Through Sub-Sampling
Authors:
Sepanta Zeighami,
Cyrus Shahabi,
John Krumm
Abstract:
Physical contacts result in the spread of various phenomena such as viruses, gossips, ideas, packages and marketing pamphlets across a population. The spread depends on how people move and co-locate with each other, or their mobility patterns. How far such phenomena spread has significance for both policy making and personal decision making, e.g., studying the spread of COVID-19 under different in…
▽ More
Physical contacts result in the spread of various phenomena such as viruses, gossips, ideas, packages and marketing pamphlets across a population. The spread depends on how people move and co-locate with each other, or their mobility patterns. How far such phenomena spread has significance for both policy making and personal decision making, e.g., studying the spread of COVID-19 under different intervention strategies such as wearing a mask. In practice, mobility patterns of an entire population is never available, and we usually have access to location data of a subset of individuals. In this paper, we formalize and study the problem of estimating the spread of a phenomena in a population, given that we only have access to sub-samples of location visits of some individuals in the population. We show that simple solutions such as estimating the spread in the sub-sample and scaling it to the population, or more sophisticated solutions that rely on modeling location visits of individuals do not perform well in practice, the former because it ignores contacts between unobserved individuals and sampled ones and the latter because it yields inaccurate modeling of co-locations. Instead, we directly model the co-locations between the individuals. We introduce PollSpreader and PollSusceptible, two novel approaches that model the co-locations between individuals using a contact network, and infer the properties of the contact network using the subsample to estimate the spread of the phenomena in the entire population. We show that our estimates provide an upper bound and a lower bound on the spread of the disease in expectation. Finally, using a large high-resolution real-world mobility dataset, we experimentally show that our estimates are accurate, while other methods that do not correctly account for co-locations between individuals result in wrong observations (e.g, premature herd-immunity).
△ Less
Submitted 13 December, 2020;
originally announced December 2020.
-
Bridging the Gap Between Theory and Practice on Insertion-Intensive Database
Authors:
Sepanta Zeighami,
Raymond Chi-Wing Wong
Abstract:
With the prevalence of online platforms, today, data is being generated and accessed by users at a very high rate. Besides, applications such as stock trading or high frequency trading require guaranteed low delays for performing an operation on a database. It is consequential to design databases that guarantee data insertion and query at a consistently high rate without introducing any long delay…
▽ More
With the prevalence of online platforms, today, data is being generated and accessed by users at a very high rate. Besides, applications such as stock trading or high frequency trading require guaranteed low delays for performing an operation on a database. It is consequential to design databases that guarantee data insertion and query at a consistently high rate without introducing any long delay during insertion. In this paper, we propose Nested B-trees (NB-trees), an index that can achieve a consistently high insertion rate on large volumes of data, while providing asymptotically optimal query performance that is very efficient in practice. Nested B-trees support insertions at rates higher than LSM-trees, the state-of-the-art index for insertion-intensive workloads, while avoiding their long insertion delays and improving on their query performance. They approach the query performance of B-trees when complemented with Bloom filters. In our experiments, NB-trees had worst-case delays up to 1000 smaller than LevelDB, RocksDB and bLSM, commonly used LSM-tree data-stores, could perform queries more than 4 times faster than LevelDB and 1.5 times faster than bLSM and RocksDB, while also outperforming them in terms of average insertion rate.
△ Less
Submitted 2 March, 2020;
originally announced March 2020.
-
Dynamic Skyline Queries on Encrypted Data Using Result Materialization
Authors:
Sepanta Zeighami,
Gabriel Ghinita,
Cyrus Shahabi
Abstract:
Skyline computation is an increasingly popular query, with broad applicability in domains such as healthcare, travel and finance. Given the recent trend to outsource databases and query evaluation, and due to the proprietary and sometimes highly sensitivity nature of the data (e.g., in healthcare), it is essential to evaluate skylines on encrypted datasets. Several research efforts acknowledged th…
▽ More
Skyline computation is an increasingly popular query, with broad applicability in domains such as healthcare, travel and finance. Given the recent trend to outsource databases and query evaluation, and due to the proprietary and sometimes highly sensitivity nature of the data (e.g., in healthcare), it is essential to evaluate skylines on encrypted datasets. Several research efforts acknowledged the importance of secure skyline computation, but existing solutions suffer from at least one of the following shortcomings: (i) they only provide ad-hoc security; (ii) they are prohibitively expensive; or (iii) they rely on unrealistic assumptions, such as the presence of multiple non-colluding parties in the protocol.
Inspired from solutions for secure nearest-neighbors (NN) computation, we conjecture that the most secure and efficient way to compute skylines is through result materialization. However, this approach is significantly more challenging for skylines than for NN queries. We exhaustively study and provide algorithms for pre-computation of skyline results, and we perform an in-depth theoretical analysis of this process. We show that pre-computing results while minimizing storage overhead is NP-hard, and we provide dynamic programming and greedy heuristics that solve the problem more efficiently, while maintaining storage at reasonable levels. Our algorithms are novel and applicable to plain-text skyline computation, but we focus on the encrypted setting where materialization reduces the cost of skyline computation from hours to seconds. Extensive experiments show that we clearly outperform existing work in terms of performance, and our security analysis proves that we obtain a smaller (and quantifiable) data leakage than competitors.
△ Less
Submitted 28 February, 2020;
originally announced March 2020.
-
Finding Average Regret Ratio Minimizing Set in Database
Authors:
Sepanta Zeighami,
Raymong Chi-Wing Wong
Abstract:
Selecting a certain number of data points (or records) from a database which "best" satisfy users' expectations is a very prevalent problem with many applications. One application is a hotel booking website showing a certain number of hotels on a single page. However, this problem is very challenging since the selected points should "collectively" satisfy the expectation of all users. Showing a ce…
▽ More
Selecting a certain number of data points (or records) from a database which "best" satisfy users' expectations is a very prevalent problem with many applications. One application is a hotel booking website showing a certain number of hotels on a single page. However, this problem is very challenging since the selected points should "collectively" satisfy the expectation of all users. Showing a certain number of data points to a single user could decrease the satisfaction of a user because the user may not be able to see his/her favorite point which could be found in the original database. In this paper, we would like to find a set of k points such that on average, the satisfaction (ratio) of a user is maximized. This problem takes into account the probability distribution of the users and considers the satisfaction (ratio) of all users, which is more reasonable in practice, compared with the existing studies that only consider the worst-case satisfaction (ratio) of the users, which may not reflect the whole population and is not useful in some applications. Motivated by this, in this paper, we propose algorithms for this problem. Finally, we conducted experiments to show the effectiveness and the efficiency of the algorithms.
△ Less
Submitted 18 October, 2018;
originally announced October 2018.