-
BiasBuster: a Neural Approach for Accurate Estimation of Population Statistics using Biased Location Data
Authors:
Sepanta Zeighami,
Cyrus Shahabi
Abstract:
While extremely useful (e.g., for COVID-19 forecasting and policy-making, urban mobility analysis and marketing, and obtaining business insights), location data collected from mobile devices often contain data from a biased population subset, with some communities over or underrepresented in the collected datasets. As a result, aggregate statistics calculated from such datasets (as is done by vari…
▽ More
While extremely useful (e.g., for COVID-19 forecasting and policy-making, urban mobility analysis and marketing, and obtaining business insights), location data collected from mobile devices often contain data from a biased population subset, with some communities over or underrepresented in the collected datasets. As a result, aggregate statistics calculated from such datasets (as is done by various companies including Safegraph, Google, and Facebook), while ignoring the bias, leads to an inaccurate representation of population statistics. Such statistics will not only be generally inaccurate, but the error will disproportionately impact different population subgroups (e.g., because they ignore the underrepresented communities). This has dire consequences, as these datasets are used for sensitive decision-making such as COVID-19 policymaking. This paper tackles the problem of providing accurate population statistics using such biased datasets. We show that statistical debiasing, although in some cases useful, often fails to improve accuracy. We then propose BiasBuster, a neural network approach that utilizes the correlations between population statistics and location characteristics to provide accurate estimates of population statistics. Extensive experiments on real-world data show that BiasBuster improves accuracy by up to 2 times in general and up to 3 times for underrepresented populations.
△ Less
Submitted 17 February, 2024;
originally announced February 2024.
-
On Distribution Dependent Sub-Logarithmic Query Time of Learned Indexing
Authors:
Sepanta Zeighami,
Cyrus Shahabi
Abstract:
A fundamental problem in data management is to find the elements in an array that match a query. Recently, learned indexes are being extensively used to solve this problem, where they learn a model to predict the location of the items in the array. They are empirically shown to outperform non-learned methods (e.g., B-trees or binary search that answer queries in $O(\log n)$ time) by orders of magn…
▽ More
A fundamental problem in data management is to find the elements in an array that match a query. Recently, learned indexes are being extensively used to solve this problem, where they learn a model to predict the location of the items in the array. They are empirically shown to outperform non-learned methods (e.g., B-trees or binary search that answer queries in $O(\log n)$ time) by orders of magnitude. However, success of learned indexes has not been theoretically justified. Only existing attempt shows the same query time of $O(\log n)$, but with a constant factor improvement in space complexity over non-learned methods, under some assumptions on data distribution. In this paper, we significantly strengthen this result, showing that under mild assumptions on data distribution, and the same space complexity as non-learned methods, learned indexes can answer queries in $O(\log\log n)$ expected query time. We also show that allowing for slightly larger but still near-linear space overhead, a learned index can achieve $O(1)$ expected query time. Our results theoretically prove learned indexes are orders of magnitude faster than non-learned methods, theoretically grounding their empirical success.
△ Less
Submitted 18 June, 2023;
originally announced June 2023.
-
NeuroSketch: Fast and Approximate Evaluation of Range Aggregate Queries with Neural Networks
Authors:
Sepanta Zeighami,
Cyrus Shahabi,
Vatsal Sharan
Abstract:
Range aggregate queries (RAQs) are an integral part of many real-world applications, where, often, fast and approximate answers for the queries are desired. Recent work has studied answering RAQs using machine learning (ML) models, where a model of the data is learned to answer the queries. However, there is no theoretical understanding of why and when the ML based approaches perform well. Further…
▽ More
Range aggregate queries (RAQs) are an integral part of many real-world applications, where, often, fast and approximate answers for the queries are desired. Recent work has studied answering RAQs using machine learning (ML) models, where a model of the data is learned to answer the queries. However, there is no theoretical understanding of why and when the ML based approaches perform well. Furthermore, since the ML approaches model the data, they fail to capitalize on any query specific information to improve performance in practice. In this paper, we focus on modeling ``queries'' rather than data and train neural networks to learn the query answers. This change of focus allows us to theoretically study our ML approach to provide a distribution and query dependent error bound for neural networks when answering RAQs. We confirm our theoretical results by develo** NeuroSketch, a neural network framework to answer RAQs in practice. Extensive experimental study on real-world, TPC-benchmark and synthetic datasets show that NeuroSketch answers RAQs multiple orders of magnitude faster than state-of-the-art and with better accuracy.
△ Less
Submitted 7 April, 2023; v1 submitted 19 November, 2022;
originally announced November 2022.
-
A Neural Approach to Spatio-Temporal Data Release with User-Level Differential Privacy
Authors:
Ritesh Ahuja,
Sepanta Zeighami,
Gabriel Ghinita,
Cyrus Shahabi
Abstract:
Several companies (e.g., Meta, Google) have initiated "data-for-good" projects where aggregate location data are first sanitized and released publicly, which is useful to many applications in transportation, public health (e.g., COVID-19 spread) and urban planning. Differential privacy (DP) is the protection model of choice to ensure the privacy of the individuals who generated the raw location da…
▽ More
Several companies (e.g., Meta, Google) have initiated "data-for-good" projects where aggregate location data are first sanitized and released publicly, which is useful to many applications in transportation, public health (e.g., COVID-19 spread) and urban planning. Differential privacy (DP) is the protection model of choice to ensure the privacy of the individuals who generated the raw location data. However, current solutions fail to preserve data utility when each individual contributes multiple location reports (i.e., under user-level privacy). To offset this limitation, public releases by Meta and Google use high privacy budgets (e.g., $ε$=10-100), resulting in poor privacy. We propose a novel approach to release spatio-temporal data privately and accurately. We employ the pattern recognition power of neural networks, specifically variational auto-encoders (VAE), to reduce the noise introduced by DP mechanisms such that accuracy is increased, while the privacy requirement is still satisfied. Our extensive experimental evaluation on real datasets shows the clear superiority of our approach compared to benchmarks.
△ Less
Submitted 20 August, 2022;
originally announced August 2022.
-
Dynamics of Explosive Events by Interface Region Imaging Spectrograph
Authors:
E. Tavabi,
S. Zeighami,
M. Heydari
Abstract:
In this research, we investigate Explosive Events (EEs) in the off-limb solar atmosphere, with simultaneous observations from the Si IV, Mg II k, and slit-jaw images (SJI) of the Interface Region Imaging Spectrograph (IRIS), on 17 August 2014, and 19 February. IRIS data can be investigated to observe the motion of matter, fluctuations, energy absorption, and heat transition of the solar atmosphere…
▽ More
In this research, we investigate Explosive Events (EEs) in the off-limb solar atmosphere, with simultaneous observations from the Si IV, Mg II k, and slit-jaw images (SJI) of the Interface Region Imaging Spectrograph (IRIS), on 17 August 2014, and 19 February. IRIS data can be investigated to observe the motion of matter, fluctuations, energy absorption, and heat transition of the solar atmosphere. Mechanisms responsible for solar large-scale structures, such as flares and coronal mass ejections, might originate from these small scale energetic events. Therefore, the study of these events can be helpful for understanding mechanisms in mass and energy transport from the chromosphere toward the transition region and corona. We obtain intensity profiles from spectra in two altitudes, i.e., at the solar limb and 5 arcsec distance from the solar limb, and then analyze the EE fluctuations at these two altitudes along the slit. We find that some spectral line profiles show enhancements in blue and red wings indicating upward and downward flows, and some profiles have opposite EEs in both wings. The amplitude of the Doppler velocity in the two data sets of different altitudes was approximated to be about 50 km/s. We calculated the phase velocity of the oscillations using a technique based on cross-correlation. The phase velocity is obtained as about 220 km/s. According to the periodic red and blue enhancements in EEs, we suggest that the fluctuations in the EEs with one side enhancement indicate a swaying motion of spicules about their axes, and those EEs observed in both wings indicate a rotational motion of spicules. The swaying and rotational motions are indicative of kink and torsional waves, respectively.
△ Less
Submitted 5 July, 2022; v1 submitted 3 July, 2022;
originally announced July 2022.
-
A Neural Database for Differentially Private Spatial Range Queries
Authors:
Sepanta Zeighami,
Ritesh Ahuja,
Gabriel Ghinita,
Cyrus Shahabi
Abstract:
Mobile apps and location-based services generate large amounts of location data that can benefit research on traffic optimization, context-aware notifications and public health (e.g., spread of contagious diseases). To preserve individual privacy, one must first sanitize location data, which is commonly done using the powerful differential privacy (DP) concept. However, existing solutions fall sho…
▽ More
Mobile apps and location-based services generate large amounts of location data that can benefit research on traffic optimization, context-aware notifications and public health (e.g., spread of contagious diseases). To preserve individual privacy, one must first sanitize location data, which is commonly done using the powerful differential privacy (DP) concept. However, existing solutions fall short of properly capturing density patterns and correlations that are intrinsic to spatial data, and as a result yield poor accuracy. We propose a machine-learning based approach for answering statistical queries on location data with DP guarantees. We focus on countering the main source of error that plagues existing approaches (namely, uniformity error), and we design a neural database system that models spatial datasets such that important density and correlation features present in the data are preserved, even when DP-compliant noise is added. We employ a set of neural networks that learn from diverse regions of the dataset and at varying granularities, leading to superior accuracy. We also devise a framework for effective system parameter tuning on top of public data, which helps practitioners set important system parameters without having to expend scarce privacy budget. Extensive experimental results on real datasets with heterogeneous characteristics show that our proposed approach significantly outperforms the state of the art.
△ Less
Submitted 3 August, 2021;
originally announced August 2021.
-
NeuroDB: A Neural Network Framework for Answering Range Aggregate Queries and Beyond
Authors:
Sepanta Zeighami,
Cyrus Shahabi
Abstract:
Range aggregate queries (RAQs) are an integral part of many real-world applications, where, often, fast and approximate answers for the queries are desired. Recent work has studied answering RAQs using machine learning models, where a model of the data is learned to answer the queries. However, such modelling choices fail to utilize any query specific information. To capture such information, we o…
▽ More
Range aggregate queries (RAQs) are an integral part of many real-world applications, where, often, fast and approximate answers for the queries are desired. Recent work has studied answering RAQs using machine learning models, where a model of the data is learned to answer the queries. However, such modelling choices fail to utilize any query specific information. To capture such information, we observe that RAQs can be represented by query functions, which are functions that take a query instance (i.e., a specific RAQ) as an input and output its corresponding answer. Using this representation, we formulate the problem of learning to approximate the query function, and propose NeuroDB, a query specialized neural network framework, that answers RAQs efficiently. NeuroDB is query-type agnostic (i.e., it does not make any assumption about the underlying query type) and our observation that queries can be represented by functions is not specific to RAQs. Thus, we investigate whether NeuroDB can be used for other query types, by applying it to distance to nearest neighbour queries. We experimentally show that NeuroDB outperforms the state-of-the-art for this query type, often by orders of magnitude. Moreover, the same neural network architecture as for RAQs is used, bringing to light the possibility of using a generic framework to answer any query type efficiently.
△ Less
Submitted 10 July, 2021;
originally announced July 2021.
-
Towards Accurate Spatiotemporal COVID-19 Risk Scores using High Resolution Real-World Mobility Data
Authors:
Sirisha Rambhatla,
Sepanta Zeighami,
Kameron Shahabi,
Cyrus Shahabi,
Yan Liu
Abstract:
As countries look towards re-opening of economic activities amidst the ongoing COVID-19 pandemic, ensuring public health has been challenging. While contact tracing only aims to track past activities of infected users, one path to safe reopening is to develop reliable spatiotemporal risk scores to indicate the propensity of the disease. Existing works which aim to develop risk scores either rely o…
▽ More
As countries look towards re-opening of economic activities amidst the ongoing COVID-19 pandemic, ensuring public health has been challenging. While contact tracing only aims to track past activities of infected users, one path to safe reopening is to develop reliable spatiotemporal risk scores to indicate the propensity of the disease. Existing works which aim to develop risk scores either rely on compartmental model-based reproduction numbers (which assume uniform population mixing) or develop coarse-grain spatial scores based on reproduction number (R0) and macro-level density-based mobility statistics. Instead, in this paper, we develop a Hawkes process-based technique to assign relatively fine-grain spatial and temporal risk scores by leveraging high-resolution mobility data based on cell-phone originated location signals. While COVID-19 risk scores also depend on a number of factors specific to an individual, including demography and existing medical conditions, the primary mode of disease transmission is via physical proximity and contact. Therefore, we focus on develo** risk scores based on location density and mobility behaviour. We demonstrate the efficacy of the developed risk scores via simulation based on real-world mobility data. Our results show that fine-grain spatiotemporal risk scores based on high-resolution mobility data can provide useful insights and facilitate safe re-opening.
△ Less
Submitted 14 December, 2020;
originally announced December 2020.
-
Estimating Spread of Contact-Based Contagions in a Population Through Sub-Sampling
Authors:
Sepanta Zeighami,
Cyrus Shahabi,
John Krumm
Abstract:
Physical contacts result in the spread of various phenomena such as viruses, gossips, ideas, packages and marketing pamphlets across a population. The spread depends on how people move and co-locate with each other, or their mobility patterns. How far such phenomena spread has significance for both policy making and personal decision making, e.g., studying the spread of COVID-19 under different in…
▽ More
Physical contacts result in the spread of various phenomena such as viruses, gossips, ideas, packages and marketing pamphlets across a population. The spread depends on how people move and co-locate with each other, or their mobility patterns. How far such phenomena spread has significance for both policy making and personal decision making, e.g., studying the spread of COVID-19 under different intervention strategies such as wearing a mask. In practice, mobility patterns of an entire population is never available, and we usually have access to location data of a subset of individuals. In this paper, we formalize and study the problem of estimating the spread of a phenomena in a population, given that we only have access to sub-samples of location visits of some individuals in the population. We show that simple solutions such as estimating the spread in the sub-sample and scaling it to the population, or more sophisticated solutions that rely on modeling location visits of individuals do not perform well in practice, the former because it ignores contacts between unobserved individuals and sampled ones and the latter because it yields inaccurate modeling of co-locations. Instead, we directly model the co-locations between the individuals. We introduce PollSpreader and PollSusceptible, two novel approaches that model the co-locations between individuals using a contact network, and infer the properties of the contact network using the subsample to estimate the spread of the phenomena in the entire population. We show that our estimates provide an upper bound and a lower bound on the spread of the disease in expectation. Finally, using a large high-resolution real-world mobility dataset, we experimentally show that our estimates are accurate, while other methods that do not correctly account for co-locations between individuals result in wrong observations (e.g, premature herd-immunity).
△ Less
Submitted 13 December, 2020;
originally announced December 2020.
-
Bridging the Gap Between Theory and Practice on Insertion-Intensive Database
Authors:
Sepanta Zeighami,
Raymond Chi-Wing Wong
Abstract:
With the prevalence of online platforms, today, data is being generated and accessed by users at a very high rate. Besides, applications such as stock trading or high frequency trading require guaranteed low delays for performing an operation on a database. It is consequential to design databases that guarantee data insertion and query at a consistently high rate without introducing any long delay…
▽ More
With the prevalence of online platforms, today, data is being generated and accessed by users at a very high rate. Besides, applications such as stock trading or high frequency trading require guaranteed low delays for performing an operation on a database. It is consequential to design databases that guarantee data insertion and query at a consistently high rate without introducing any long delay during insertion. In this paper, we propose Nested B-trees (NB-trees), an index that can achieve a consistently high insertion rate on large volumes of data, while providing asymptotically optimal query performance that is very efficient in practice. Nested B-trees support insertions at rates higher than LSM-trees, the state-of-the-art index for insertion-intensive workloads, while avoiding their long insertion delays and improving on their query performance. They approach the query performance of B-trees when complemented with Bloom filters. In our experiments, NB-trees had worst-case delays up to 1000 smaller than LevelDB, RocksDB and bLSM, commonly used LSM-tree data-stores, could perform queries more than 4 times faster than LevelDB and 1.5 times faster than bLSM and RocksDB, while also outperforming them in terms of average insertion rate.
△ Less
Submitted 2 March, 2020;
originally announced March 2020.
-
Dynamic Skyline Queries on Encrypted Data Using Result Materialization
Authors:
Sepanta Zeighami,
Gabriel Ghinita,
Cyrus Shahabi
Abstract:
Skyline computation is an increasingly popular query, with broad applicability in domains such as healthcare, travel and finance. Given the recent trend to outsource databases and query evaluation, and due to the proprietary and sometimes highly sensitivity nature of the data (e.g., in healthcare), it is essential to evaluate skylines on encrypted datasets. Several research efforts acknowledged th…
▽ More
Skyline computation is an increasingly popular query, with broad applicability in domains such as healthcare, travel and finance. Given the recent trend to outsource databases and query evaluation, and due to the proprietary and sometimes highly sensitivity nature of the data (e.g., in healthcare), it is essential to evaluate skylines on encrypted datasets. Several research efforts acknowledged the importance of secure skyline computation, but existing solutions suffer from at least one of the following shortcomings: (i) they only provide ad-hoc security; (ii) they are prohibitively expensive; or (iii) they rely on unrealistic assumptions, such as the presence of multiple non-colluding parties in the protocol.
Inspired from solutions for secure nearest-neighbors (NN) computation, we conjecture that the most secure and efficient way to compute skylines is through result materialization. However, this approach is significantly more challenging for skylines than for NN queries. We exhaustively study and provide algorithms for pre-computation of skyline results, and we perform an in-depth theoretical analysis of this process. We show that pre-computing results while minimizing storage overhead is NP-hard, and we provide dynamic programming and greedy heuristics that solve the problem more efficiently, while maintaining storage at reasonable levels. Our algorithms are novel and applicable to plain-text skyline computation, but we focus on the encrypted setting where materialization reduces the cost of skyline computation from hours to seconds. Extensive experiments show that we clearly outperform existing work in terms of performance, and our security analysis proves that we obtain a smaller (and quantifiable) data leakage than competitors.
△ Less
Submitted 28 February, 2020;
originally announced March 2020.
-
Finding Average Regret Ratio Minimizing Set in Database
Authors:
Sepanta Zeighami,
Raymong Chi-Wing Wong
Abstract:
Selecting a certain number of data points (or records) from a database which "best" satisfy users' expectations is a very prevalent problem with many applications. One application is a hotel booking website showing a certain number of hotels on a single page. However, this problem is very challenging since the selected points should "collectively" satisfy the expectation of all users. Showing a ce…
▽ More
Selecting a certain number of data points (or records) from a database which "best" satisfy users' expectations is a very prevalent problem with many applications. One application is a hotel booking website showing a certain number of hotels on a single page. However, this problem is very challenging since the selected points should "collectively" satisfy the expectation of all users. Showing a certain number of data points to a single user could decrease the satisfaction of a user because the user may not be able to see his/her favorite point which could be found in the original database. In this paper, we would like to find a set of k points such that on average, the satisfaction (ratio) of a user is maximized. This problem takes into account the probability distribution of the users and considers the satisfaction (ratio) of all users, which is more reasonable in practice, compared with the existing studies that only consider the worst-case satisfaction (ratio) of the users, which may not reflect the whole population and is not useful in some applications. Motivated by this, in this paper, we propose algorithms for this problem. Finally, we conducted experiments to show the effectiveness and the efficiency of the algorithms.
△ Less
Submitted 18 October, 2018;
originally announced October 2018.
-
Evidence for Energy Supply by Active Region Spicules to the Solar Atmosphere
Authors:
S. Zeighami,
A. R. Ahangarzadeh Maralani,
E. Tavabi,
A. Ajabshirizadeh
Abstract:
We investigate the role of active region spicules in the mass balance of the solar wind and energy supply for heating the solar atmosphere. We use high cadence observations from the Solar Optical Telescope (SOT) onboard the Hinode satellite in the Ca II H line filter obtained on 26 January 2007. The observational technique provides the high spatio-temporal resolution required to detect fine struct…
▽ More
We investigate the role of active region spicules in the mass balance of the solar wind and energy supply for heating the solar atmosphere. We use high cadence observations from the Solar Optical Telescope (SOT) onboard the Hinode satellite in the Ca II H line filter obtained on 26 January 2007. The observational technique provides the high spatio-temporal resolution required to detect fine structures such as spicules. We apply Fourier power spectrum and wavelet analysis to SOT/Hinode time series of an active region data to explore the existence of coherent intensity oscillations. The presence of coherent waves could be an evidence for energy transport to heat the solar atmosphere. Using time series, we measure the phase difference between two intensity profiles obtained at two different heights, which gives information about the phase difference between oscillations at those heights as a function of frequency. The results of a fast Fourier transform (FFT) show peaks in the power spectrum at frequencies in the range from 2 to 8 mHz at four different heights (above the limb), while the wavelet analysis indicate dominant frequencies similar to those of the Fourier power spectrum results. A coherency study indicates the presence of coherent oscillations at about 5.5 mHz (3 min). We measure mean phase speeds in the range 250 to 425 km/s increasing with height.
△ Less
Submitted 12 February, 2016; v1 submitted 9 February, 2016;
originally announced February 2016.
-
Spicules Intensity Oscillations in SOT/HINODE Observations
Authors:
E. Tavabi,
A. Ajabshirizadeh,
A. R. Ahangarzadeh Maralani,
S. Zeighami
Abstract:
Aims. We study the coherency of solar spicules intensity oscillations with increasing height above the solar limb in quiet Sun, active Sun and active region using observations from HINODE/SOT. Existence of coherency up to transition region strengthens the theory of the coronal heating and solar wind through energy transport and photospheric oscillations. Methods. Using time sequences from the HINO…
▽ More
Aims. We study the coherency of solar spicules intensity oscillations with increasing height above the solar limb in quiet Sun, active Sun and active region using observations from HINODE/SOT. Existence of coherency up to transition region strengthens the theory of the coronal heating and solar wind through energy transport and photospheric oscillations. Methods. Using time sequences from the HINODE/SOT in Ca II H line, we investigate oscillations found in intensity profiles at different heights above the solar limb. We use the Fourier and wavelet analysis to measure dominant frequency peaks of intensity at the heights, and phase difference between oscillations at two certain heights, to find evidence for the coherency of the oscillations. Finally, we can calculate the energy and the mass transported by spicules providing energy equilibrium, according to density values of spicules at different heights. To extend this work, we can also consider coherent oscillations at different latitudes and suggest to study of oscillations which may be obtained from observations of other satellites.
△ Less
Submitted 1 April, 2015;
originally announced April 2015.
-
Alfvenic waves in polar spicules
Authors:
E. Tavabi,
S. Koutchmy,
A. Ajabshirizadeh,
A. R. Ahangarzadeh Maralani,
S. Zeighami
Abstract:
Context. For investigating spicules from the photosphere to coronal heights, the new Hinode/SOT long series of high resolution observations from Space taken in CaII H line emission offers an improved way to look at their remarkable dynamical behavior using images free of seeing effects. They should be put in the context of the huge amount of already accumulated material from ground-based instrumen…
▽ More
Context. For investigating spicules from the photosphere to coronal heights, the new Hinode/SOT long series of high resolution observations from Space taken in CaII H line emission offers an improved way to look at their remarkable dynamical behavior using images free of seeing effects. They should be put in the context of the huge amount of already accumulated material from ground-based instruments, including high- resolution spectra of off-limb spicules. Results. The surge-like behavior of solar polar region spicules supports the untwisting multi-component interpretation of spicules exhibiting helical dynamics. Several tall spicules are found with (i) upward and downward flows similar at lower and middle-levels, the rate of upward motion being slightly higher at high levels; (ii) the left and right-hand velocities are also increasing with height; (iii) a large number of multi-component spicules show shearing motion of both left-handed and right-handed senses occurring simultaneously, which might be understood as twisting (or untwisting) threads. The number of turns depends on the overall diameter of the structure made of components and changes from at least one turn for the smallest structure to at most two or three turns for surge-like broad structures; the curvature along the spicule corresponds to a low turn number similar to a transverse kink mode oscillation along the threads.
△ Less
Submitted 30 September, 2014; v1 submitted 26 September, 2014;
originally announced September 2014.