Search | arXiv e-print repository

Towards Mobility Data Science (Vision Paper)

Authors: Mohamed Mokbel, Mahmoud Sakr, Li Xiong, Andreas Züfle, Jussara Almeida, Taylor Anderson, Walid Aref, Gennady Andrienko, Natalia Andrienko, Yang Cao, Sanjay Chawla, Reynold Cheng, Panos Chrysanthis, Xiqi Fei, Gabriel Ghinita, Anita Graser, Dimitrios Gunopulos, Christian Jensen, Joon-Seok Kim, Kyoung-Sook Kim, Peer Kröger, John Krumm, Johannes Lauer, Amr Magdy, Mario Nascimento , et al. (23 additional authors not shown)

Abstract: Mobility data captures the locations of moving objects such as humans, animals, and cars. With the availability of GPS-equipped mobile devices and other inexpensive location-tracking technologies, mobility data is collected ubiquitously. In recent years, the use of mobility data has demonstrated significant impact in various domains including traffic management, urban planning, and health sciences… ▽ More Mobility data captures the locations of moving objects such as humans, animals, and cars. With the availability of GPS-equipped mobile devices and other inexpensive location-tracking technologies, mobility data is collected ubiquitously. In recent years, the use of mobility data has demonstrated significant impact in various domains including traffic management, urban planning, and health sciences. In this paper, we present the emerging domain of mobility data science. Towards a unified approach to mobility data science, we envision a pipeline having the following components: mobility data collection, cleaning, analysis, management, and privacy. For each of these components, we explain how mobility data science differs from general data science, we survey the current state of the art and describe open challenges for the research community in the coming years. △ Less

Submitted 7 March, 2024; v1 submitted 21 June, 2023; originally announced July 2023.

Comments: Updated to reflect the major revision for ACM Transactions on Spatial Algorithms and Systems (TSAS). This version reflects the final version accepted by ACM TSAS

arXiv:2204.00714 [pdf, other]

Reliable Geofence Activation with Sparse and Sporadic Location Measurements: Extended Version

Authors: Kien Nguyen, John Krumm

Abstract: Geofences are a fundamental tool of location-based services. A geofence is usually activated by detecting a location measurement inside the geofence region. However, location measurements such as GPS often appear sporadically on smartphones, partly due to weak signal, or privacy preservation, because users may restrict location sensing, or energy conservation, because sensing locations can consume… ▽ More Geofences are a fundamental tool of location-based services. A geofence is usually activated by detecting a location measurement inside the geofence region. However, location measurements such as GPS often appear sporadically on smartphones, partly due to weak signal, or privacy preservation, because users may restrict location sensing, or energy conservation, because sensing locations can consume a significant amount of energy. These unpredictable, and sometimes long, gaps between measurements mean that entry into a geofence can go completely undetected. In this paper we argue that short term location prediction can help alleviate this problem by computing the probability of entering a geofence in the future. Complicating this prediction approach is the fact that another location measurement could appear at any time, making the prediction redundant and wasteful. Therefore, we develop a framework that accounts for uncertain location predictions and the possibility of new measurements to trigger geofence activations. Our framework optimizes over the benefits and costs of correct and incorrect geofence activations, leading to an algorithm that reacts intelligently to the uncertainties of future movements and measurements. △ Less

Submitted 1 April, 2022; originally announced April 2022.

Comments: 10 pages, MDM 2022

arXiv:2110.03712 [pdf, other]

Gaussian Process for Trajectories

Authors: Kien Nguyen, John Krumm, Cyrus Shahabi

Abstract: The Gaussian process is a powerful and flexible technique for interpolating spatiotemporal data, especially with its ability to capture complex trends and uncertainty from the input signal. This chapter describes Gaussian processes as an interpolation technique for geospatial trajectories. A Gaussian process models measurements of a trajectory as coming from a multidimensional Gaussian, and it pro… ▽ More The Gaussian process is a powerful and flexible technique for interpolating spatiotemporal data, especially with its ability to capture complex trends and uncertainty from the input signal. This chapter describes Gaussian processes as an interpolation technique for geospatial trajectories. A Gaussian process models measurements of a trajectory as coming from a multidimensional Gaussian, and it produces for each timestamp a Gaussian distribution as a prediction. We discuss elements that need to be considered when applying Gaussian process to trajectories, common choices for those elements, and provide a concrete example of implementing a Gaussian process. △ Less

Submitted 7 October, 2021; originally announced October 2021.

Comments: SpatialGems workshop 2021, 7 pages

arXiv:2108.12450 [pdf, other]

Quantifying Intrinsic Value of Information of Trajectories

Authors: Kien Nguyen, John Krumm, Cyrus Shahabi

Abstract: A trajectory, defined as a sequence of location measurements, contains valuable information about movements of an individual. Its value of information (VOI) may change depending on the specific application. However, in a variety of applications, knowing the intrinsic VOI of a trajectory is important to guide other subsequent tasks or decisions. This work aims to find a principled framework to quan… ▽ More A trajectory, defined as a sequence of location measurements, contains valuable information about movements of an individual. Its value of information (VOI) may change depending on the specific application. However, in a variety of applications, knowing the intrinsic VOI of a trajectory is important to guide other subsequent tasks or decisions. This work aims to find a principled framework to quantify the intrinsic VOI of trajectories from the owner's perspective. This is a challenging problem because an appropriate framework needs to take into account various characteristics of the trajectory, prior knowledge, and different types of trajectory degradation. We propose a framework based on information gain (IG) as a principled approach to solve this problem. Our IG framework transforms a trajectory with discrete-time measurements to a canonical representation, i.e., continuous in time with continuous mean and variance estimates, and then quantifies the reduction of uncertainty about the locations of the owner over a period of time as the VOI of the trajectory. Qualitative and extensive quantitative evaluation show that the IG framework is capable of effectively capturing important characteristics contributing to the VOI of trajectories. △ Less

Submitted 7 September, 2021; v1 submitted 27 August, 2021; originally announced August 2021.

Comments: 10 pages, SIGSPATIAL'21

arXiv:2107.13749 [pdf, other]

HTF: Homogeneous Tree Framework for Differentially-Private Release of Location Data

Authors: Sina Shaham, Gabriel Ghinita, Ritesh Ahuja, John Krumm, Cyrus Shahabi

Abstract: Mobile apps that use location data are pervasive, spanning domains such as transportation, urban planning and healthcare. Important use cases for location data rely on statistical queries, e.g., identifying hotspots where users work and travel. Such queries can be answered efficiently by building histograms. However, precise histograms can expose sensitive details about individual users. Different… ▽ More Mobile apps that use location data are pervasive, spanning domains such as transportation, urban planning and healthcare. Important use cases for location data rely on statistical queries, e.g., identifying hotspots where users work and travel. Such queries can be answered efficiently by building histograms. However, precise histograms can expose sensitive details about individual users. Differential privacy (DP) is a mature and widely-adopted protection model, but most approaches for DP-compliant histograms work in a data-independent fashion, leading to poor accuracy. The few proposed data-dependent techniques attempt to adjust histogram partitions based on dataset characteristics, but they do not perform well due to the addition of noise required to achieve DP. We identify density homogeneity as a main factor driving the accuracy of DP-compliant histograms, and we build a data structure that splits the space such that data density is homogeneous within each resulting partition. We show through extensive experiments on large-scale real-world data that the proposed approach achieves superior accuracy compared to existing approaches. △ Less

Submitted 29 July, 2021; originally announced July 2021.

arXiv:2012.06987 [pdf, other]

doi 10.14778/3461535.3461544

Estimating Spread of Contact-Based Contagions in a Population Through Sub-Sampling

Authors: Sepanta Zeighami, Cyrus Shahabi, John Krumm

Abstract: Physical contacts result in the spread of various phenomena such as viruses, gossips, ideas, packages and marketing pamphlets across a population. The spread depends on how people move and co-locate with each other, or their mobility patterns. How far such phenomena spread has significance for both policy making and personal decision making, e.g., studying the spread of COVID-19 under different in… ▽ More Physical contacts result in the spread of various phenomena such as viruses, gossips, ideas, packages and marketing pamphlets across a population. The spread depends on how people move and co-locate with each other, or their mobility patterns. How far such phenomena spread has significance for both policy making and personal decision making, e.g., studying the spread of COVID-19 under different intervention strategies such as wearing a mask. In practice, mobility patterns of an entire population is never available, and we usually have access to location data of a subset of individuals. In this paper, we formalize and study the problem of estimating the spread of a phenomena in a population, given that we only have access to sub-samples of location visits of some individuals in the population. We show that simple solutions such as estimating the spread in the sub-sample and scaling it to the population, or more sophisticated solutions that rely on modeling location visits of individuals do not perform well in practice, the former because it ignores contacts between unobserved individuals and sampled ones and the latter because it yields inaccurate modeling of co-locations. Instead, we directly model the co-locations between the individuals. We introduce PollSpreader and PollSusceptible, two novel approaches that model the co-locations between individuals using a contact network, and infer the properties of the contact network using the subsample to estimate the spread of the phenomena in the entire population. We show that our estimates provide an upper bound and a lower bound on the spread of the disease in expectation. Finally, using a large high-resolution real-world mobility dataset, we experimentally show that our estimates are accurate, while other methods that do not correctly account for co-locations between individuals result in wrong observations (e.g, premature herd-immunity). △ Less

Submitted 13 December, 2020; originally announced December 2020.

Journal ref: Proc. VLDB Endow. 14, 9, 1557-1569 (2021)

arXiv:2008.11817 [pdf, other]

doi 10.1145/3397536.3422213

Spatial Privacy Pricing: The Interplay between Privacy, Utility and Price in Geo-Marketplaces

Authors: Kien Nguyen, John Krumm, Cyrus Shahabi

Abstract: A geo-marketplace allows users to be paid for their location data. Users concerned about privacy may want to charge more for data that pinpoints their location accurately, but may charge less for data that is more vague. A buyer would prefer to minimize data costs, but may have to spend more to get the necessary level of accuracy. We call this interplay between privacy, utility, and price \emph{sp… ▽ More A geo-marketplace allows users to be paid for their location data. Users concerned about privacy may want to charge more for data that pinpoints their location accurately, but may charge less for data that is more vague. A buyer would prefer to minimize data costs, but may have to spend more to get the necessary level of accuracy. We call this interplay between privacy, utility, and price \emph{spatial privacy pricing}. We formalize the issues mathematically with an example problem of a buyer deciding whether or not to open a restaurant by purchasing location data to determine if the potential number of customers is sufficient to open. The problem is expressed as a sequential decision making problem, where the buyer first makes a series of decisions about which data to buy and concludes with a decision about opening the restaurant or not. We present two algorithms to solve this problem, including experiments that show they perform better than baselines. △ Less

Submitted 3 September, 2020; v1 submitted 25 August, 2020; originally announced August 2020.

Comments: 10 pages, SIGSPATIAL'20

arXiv:2008.00304 [pdf, other]

SemEval-2020 Task 7: Assessing Humor in Edited News Headlines

Authors: Nabil Hossain, John Krumm, Michael Gamon, Henry Kautz

Abstract: This paper describes the SemEval-2020 shared task "Assessing Humor in Edited News Headlines." The task's dataset contains news headlines in which short edits were applied to make them funny, and the funniness of these edited headlines was rated using crowdsourcing. This task includes two subtasks, the first of which is to estimate the funniness of headlines on a humor scale in the interval 0-3. Th… ▽ More This paper describes the SemEval-2020 shared task "Assessing Humor in Edited News Headlines." The task's dataset contains news headlines in which short edits were applied to make them funny, and the funniness of these edited headlines was rated using crowdsourcing. This task includes two subtasks, the first of which is to estimate the funniness of headlines on a humor scale in the interval 0-3. The second subtask is to predict, for a pair of edited versions of the same original headline, which is the funnier version. To date, this task is the most popular shared computational humor task, attracting 48 teams for the first subtask and 31 teams for the second. △ Less

Submitted 1 August, 2020; originally announced August 2020.

arXiv:2002.02031 [pdf, other]

Stimulating Creativity with FunLines: A Case Study of Humor Generation in Headlines

Authors: Nabil Hossain, John Krumm, Tanvir Sajed, Henry Kautz

Abstract: Building datasets of creative text, such as humor, is quite challenging. We introduce FunLines, a competitive game where players edit news headlines to make them funny, and where they rate the funniness of headlines edited by others. FunLines makes the humor generation process fun, interactive, collaborative, rewarding and educational, kee** players engaged and providing humor data at a very low… ▽ More Building datasets of creative text, such as humor, is quite challenging. We introduce FunLines, a competitive game where players edit news headlines to make them funny, and where they rate the funniness of headlines edited by others. FunLines makes the humor generation process fun, interactive, collaborative, rewarding and educational, kee** players engaged and providing humor data at a very low cost compared to traditional crowdsourcing approaches. FunLines offers useful performance feedback, assisting players in getting better over time at generating and assessing humor, as our analysis shows. This helps to further increase the quality of the generated dataset. We show the effectiveness of this data by training humor classification models that outperform a previous benchmark, and we release this dataset to the public. △ Less

Submitted 5 February, 2020; originally announced February 2020.

arXiv:1906.00274 [pdf, other]

"President Vows to Cut <Taxes> Hair": Dataset and Analysis of Creative Text Editing for Humorous Headlines

Authors: Nabil Hossain, John Krumm, Michael Gamon

Abstract: We introduce, release, and analyze a new dataset, called Humicroedit, for research in computational humor. Our publicly available data consists of regular English news headlines paired with versions of the same headlines that contain simple replacement edits designed to make them funny. We carefully curated crowdsourced editors to create funny headlines and judges to score a to a total of 15,095 e… ▽ More We introduce, release, and analyze a new dataset, called Humicroedit, for research in computational humor. Our publicly available data consists of regular English news headlines paired with versions of the same headlines that contain simple replacement edits designed to make them funny. We carefully curated crowdsourced editors to create funny headlines and judges to score a to a total of 15,095 edited headlines, with five judges per headline. The simple edits, usually just a single word replacement, mean we can apply straightforward analysis techniques to determine what makes our edited headlines humorous. We show how the data support classic theories of humor, such as incongruity, superiority, and setup/punchline. Finally, we develop baseline classifiers that can predict whether or not an edited headline is funny, which is a first step toward automatically generating humorous headlines as an approach to creating topical humor. △ Less

Submitted 1 June, 2019; originally announced June 2019.

Comments: Accepted in NAACL 2019

Showing 1–10 of 10 results for author: Krumm, J