Search | arXiv e-print repository

Source Localization for Cross Network Information Diffusion

Authors: Chen Ling, Tanmoy Chowdhury, Jie Ji, Sirui Li, Andreas Züfle, Liang Zhao

Abstract: Source localization aims to locate information diffusion sources only given the diffusion observation, which has attracted extensive attention in the past few years. Existing methods are mostly tailored for single networks and may not be generalized to handle more complex networks like cross-networks. Cross-network is defined as two interconnected networks, where one network's functionality depend… ▽ More Source localization aims to locate information diffusion sources only given the diffusion observation, which has attracted extensive attention in the past few years. Existing methods are mostly tailored for single networks and may not be generalized to handle more complex networks like cross-networks. Cross-network is defined as two interconnected networks, where one network's functionality depends on the other. Source localization on cross-networks entails locating diffusion sources on the source network by only giving the diffused observation in the target network. The task is challenging due to challenges including: 1) diffusion sources distribution modeling; 2) jointly considering both static and dynamic node features; and 3) heterogeneous diffusion patterns learning. In this work, we propose a novel method, namely CNSL, to handle the three primary challenges. Specifically, we propose to learn the distribution of diffusion sources through Bayesian inference and leverage disentangled encoders to separately learn static and dynamic node features. The learning objective is coupled with the cross-network information propagation estimation model to make the inference of diffusion sources considering the overall diffusion process. Additionally, we also provide two novel cross-network datasets collected by ourselves. Extensive experiments are conducted on both datasets to demonstrate the effectiveness of \textit{CNSL} in handling the source localization on cross-networks. △ Less

Submitted 22 April, 2024; originally announced April 2024.

Comments: Code and data are available at: https://github.com/tanmoysr/CNSL/

arXiv:2404.07308 [pdf, other]

Spatial Transfer Learning for Estimating PM2.5 in Data-poor Regions

Authors: Shrey Gupta, Yongbee Park, Jianzhao Bi, Suyash Gupta, Andreas Züfle, Avani Wildani, Yang Liu

Abstract: Air pollution, especially particulate matter 2.5 (PM2.5), is a pressing concern for public health and is difficult to estimate in develo** countries (data-poor regions) due to a lack of ground sensors. Transfer learning models can be leveraged to solve this problem, as they use alternate data sources to gain knowledge (i.e., data from data-rich regions). However, current transfer learning method… ▽ More Air pollution, especially particulate matter 2.5 (PM2.5), is a pressing concern for public health and is difficult to estimate in develo** countries (data-poor regions) due to a lack of ground sensors. Transfer learning models can be leveraged to solve this problem, as they use alternate data sources to gain knowledge (i.e., data from data-rich regions). However, current transfer learning methodologies do not account for dependencies between the source and the target domains. We recognize this transfer problem as spatial transfer learning and propose a new feature named Latent Dependency Factor (LDF) that captures spatial and semantic dependencies of both domains and is subsequently added to the feature spaces of the domains. We generate LDF using a novel two-stage autoencoder model that learns from clusters of similar source and target domain data. Our experiments show that transfer learning models using LDF have a 19.34% improvement over the baselines. We additionally support our experiments with qualitative findings. △ Less

Submitted 22 June, 2024; v1 submitted 10 April, 2024; originally announced April 2024.

Comments: Accepted for publication at ECML-PKDD 2024

arXiv:2310.04942 [pdf, ps, other]

Large Language Models for Spatial Trajectory Patterns Mining

Authors: Zheng Zhang, Hossein Amiri, Zhenke Liu, Andreas Züfle, Liang Zhao

Abstract: Identifying anomalous human spatial trajectory patterns can indicate dynamic changes in mobility behavior with applications in domains like infectious disease monitoring and elderly care. Recent advancements in large language models (LLMs) have demonstrated their ability to reason in a manner akin to humans. This presents significant potential for analyzing temporal patterns in human mobility. In… ▽ More Identifying anomalous human spatial trajectory patterns can indicate dynamic changes in mobility behavior with applications in domains like infectious disease monitoring and elderly care. Recent advancements in large language models (LLMs) have demonstrated their ability to reason in a manner akin to humans. This presents significant potential for analyzing temporal patterns in human mobility. In this paper, we conduct empirical studies to assess the capabilities of leading LLMs like GPT-4 and Claude-2 in detecting anomalous behaviors from mobility data, by comparing to specialized methods. Our key findings demonstrate that LLMs can attain reasonable anomaly detection performance even without any specific cues. In addition, providing contextual clues about potential irregularities could further enhances their prediction efficacy. Moreover, LLMs can provide reasonable explanations for their judgments, thereby improving transparency. Our work provides insights on the strengths and limitations of LLMs for human spatial trajectory analysis. △ Less

Submitted 7 October, 2023; originally announced October 2023.

arXiv:2307.05717 [pdf, other]

Towards Mobility Data Science (Vision Paper)

Authors: Mohamed Mokbel, Mahmoud Sakr, Li Xiong, Andreas Züfle, Jussara Almeida, Taylor Anderson, Walid Aref, Gennady Andrienko, Natalia Andrienko, Yang Cao, Sanjay Chawla, Reynold Cheng, Panos Chrysanthis, Xiqi Fei, Gabriel Ghinita, Anita Graser, Dimitrios Gunopulos, Christian Jensen, Joon-Seok Kim, Kyoung-Sook Kim, Peer Kröger, John Krumm, Johannes Lauer, Amr Magdy, Mario Nascimento , et al. (23 additional authors not shown)

Abstract: Mobility data captures the locations of moving objects such as humans, animals, and cars. With the availability of GPS-equipped mobile devices and other inexpensive location-tracking technologies, mobility data is collected ubiquitously. In recent years, the use of mobility data has demonstrated significant impact in various domains including traffic management, urban planning, and health sciences… ▽ More Mobility data captures the locations of moving objects such as humans, animals, and cars. With the availability of GPS-equipped mobile devices and other inexpensive location-tracking technologies, mobility data is collected ubiquitously. In recent years, the use of mobility data has demonstrated significant impact in various domains including traffic management, urban planning, and health sciences. In this paper, we present the emerging domain of mobility data science. Towards a unified approach to mobility data science, we envision a pipeline having the following components: mobility data collection, cleaning, analysis, management, and privacy. For each of these components, we explain how mobility data science differs from general data science, we survey the current state of the art and describe open challenges for the research community in the coming years. △ Less

Submitted 7 March, 2024; v1 submitted 21 June, 2023; originally announced July 2023.

Comments: Updated to reflect the major revision for ACM Transactions on Spatial Algorithms and Systems (TSAS). This version reflects the final version accepted by ACM TSAS

arXiv:2208.12810 [pdf, other]

doi 10.1109/TGRS.2023.3291309

Riesz-Quincunx-UNet Variational Auto-Encoder for Satellite Image Denoising

Authors: Duy H. Thai, Xiqi Fei, Minh Tri Le, Andreas Züfle, Konrad Wessels

Abstract: Multiresolution deep learning approaches, such as the U-Net architecture, have achieved high performance in classifying and segmenting images. However, these approaches do not provide a latent image representation and cannot be used to decompose, denoise, and reconstruct image data. The U-Net and other convolutional neural network (CNNs) architectures commonly use pooling to enlarge the receptive… ▽ More Multiresolution deep learning approaches, such as the U-Net architecture, have achieved high performance in classifying and segmenting images. However, these approaches do not provide a latent image representation and cannot be used to decompose, denoise, and reconstruct image data. The U-Net and other convolutional neural network (CNNs) architectures commonly use pooling to enlarge the receptive field, which usually results in irreversible information loss. This study proposes to include a Riesz-Quincunx (RQ) wavelet transform, which combines 1) higher-order Riesz wavelet transform and 2) orthogonal Quincunx wavelets (which have both been used to reduce blur in medical images) inside the U-net architecture, to reduce noise in satellite images and their time-series. In the transformed feature space, we propose a variational approach to understand how random perturbations of the features affect the image to further reduce noise. Combining both approaches, we introduce a hybrid RQUNet-VAE scheme for image and time series decomposition used to reduce noise in satellite imagery. We present qualitative and quantitative experimental results that demonstrate that our proposed RQUNet-VAE was more effective at reducing noise in satellite imagery compared to other state-of-the-art methods. We also apply our scheme to several applications for multi-band satellite images, including: image denoising, image and time-series decomposition by diffusion and image segmentation. △ Less

Submitted 25 August, 2022; originally announced August 2022.

Comments: Submitted to IEEE Transactions on Geoscience and Remote Sensing (TGRS)

arXiv:2112.06344 [pdf, other]

Probabilistic Counting in Uncertain Spatial Databases using Generating Functions

Authors: Andreas Züfle

Abstract: Location data is inherently uncertain for many reasons including 1) imprecise location measurements, 2) obsolete observations that are often interpolated, and 3) deliberate obfuscation to preserve location privacy. What makes handling uncertainty data challenging is the exponentially large number of possible worlds, which lies in O(2^N), for a database having N uncertain objects as it has been sho… ▽ More Location data is inherently uncertain for many reasons including 1) imprecise location measurements, 2) obsolete observations that are often interpolated, and 3) deliberate obfuscation to preserve location privacy. What makes handling uncertainty data challenging is the exponentially large number of possible worlds, which lies in O(2^N), for a database having N uncertain objects as it has been shown that general query processing in uncertain spatial data is NP-hard. Many applications using spatial data require counting the number of spatial objects within a region. An example is the k-Nearest Neighbor (kNN) query: Asking if an object A is a kNN of another object Q is equivalent to asking whether no more than k-1 objects are located inside the circle centered at Q having a radius equal to the distance between Q and A. For this problem of counting uncertain objects within a region, an efficient solution based on Generating Functions has been proposed and successfully used in many applications, including range-count queries, kNN queries, distance ranking queries, and reverse kNN queries. This spatial gem describes the generating function technique for probabilistic counting and provides examples and implementation details. △ Less

Submitted 12 December, 2021; originally announced December 2021.

Comments: 6 Pages, 1 Figure, to be submitted to the 4th ACM SIGSPATIAL International Workshop on Spatial Gems (SpatialGems 2022) (https://www.spatialgems.net/)

ACM Class: H.4

arXiv:2109.09022 [pdf, other]

doi 10.1371/journal.pone.0259031

Change of human mobility during COVID-19: A United States case study

Authors: Justin Elarde, Joon-Seok Kim, Hamdi Kavak, Andreas Züfle, Taylor Anderson

Abstract: With the onset of COVID-19 and the resulting shelter in place guidelines combined with remote working practices, human mobility in 2020 has been dramatically impacted. Existing studies typically examine whether mobility in specific localities increases or decreases at specific points in time and relate these changes to certain pandemic and policy events. In this paper, we study mobility change in… ▽ More With the onset of COVID-19 and the resulting shelter in place guidelines combined with remote working practices, human mobility in 2020 has been dramatically impacted. Existing studies typically examine whether mobility in specific localities increases or decreases at specific points in time and relate these changes to certain pandemic and policy events. In this paper, we study mobility change in the US through a five-step process using mobility footprint data. (Step 1) Propose the delta Time Spent in Public Places (Delta-TSPP) as a measure to quantify daily changes in mobility for each US county from 2019-2020. (Step 2) Conduct Principal Component Analysis (PCA) to reduce the Delta-TSPP time series of each county to lower-dimensional latent components of change in mobility. (Step 3) Conduct clustering analysis to find counties that exhibit similar latent components. (Step 4) Investigate local and global spatial autocorrelation for each component. (Step 5) Conduct correlation analysis to investigate how various population characteristics and behavior correlate with mobility patterns. Results show that by describing each county as a linear combination of the three latent components, we can explain 59% of the variation in mobility trends across all US counties. Specifically, change in mobility in 2020 for US counties can be explained as a combination of three latent components: 1) long-term reduction in mobility, 2) no change in mobility, and 3) short-term reduction in mobility. We observe significant correlations between the three latent components of mobility change and various population characteristics, including political leaning, population, COVID-19 cases and deaths, and unemployment. We find that our analysis provides a comprehensive understanding of mobility change in response to the COVID-19 pandemic. △ Less

Submitted 18 September, 2021; originally announced September 2021.

Comments: Current under review at PLOS One

arXiv:2009.01121 [pdf, other]

Uncertain Spatial Data Management:An Overview

Authors: Andreas Zuefle

Abstract: Both the current trends in technology such as smartphones, general mobile devices, stationary sensors, and satellites as well as a new user mentality of using this technology to voluntarily share enriched location information produces a flood of geo-spatial and geo-spatiotemporal data. This data flood provides tremendous potential for discovering new and useful knowledge. But in addition to the fa… ▽ More Both the current trends in technology such as smartphones, general mobile devices, stationary sensors, and satellites as well as a new user mentality of using this technology to voluntarily share enriched location information produces a flood of geo-spatial and geo-spatiotemporal data. This data flood provides tremendous potential for discovering new and useful knowledge. But in addition to the fact that measurements are imprecise, spatial data is often interpolated between discrete observations. To reduce communication and bandwidth utilization, data is often subjected to a reduction, thereby eliminating some of the known/recorded values. These issues introduce the notion of uncertainty in spatial data management, an aspect raising the imminent need for scalable and flexible solutions. The main scope of this chapter is to survey existing techniques for managing, querying, and mining uncertain spatial data. First, this chapter surveys common data representations for uncertain data, explains the commonly used possible worlds semantics to interpret an uncertain database, and surveys existing system to process uncertain data. Then, this chapter defines the notion of probabilistic result semantics to distinguish the task of computing individual object probabilities versus computing entire result probabilities. This is important, as, for many queries, the problem of computing object-level probabilities can be solved efficiently, whereas result-level probabilities are hard to compute. Finally, this chapter introduces a novel paradigm to efficiently answer any kind of query on uncertain data: the Paradigm of Equivalent Worlds, which groups the exponential set of possible database worlds into a polynomial number of sets of equivalent worlds that can be processed efficiently. Examples and use-cases of querying uncertain spatial data are provided using the example of uncertain range queries. △ Less

Submitted 2 September, 2020; originally announced September 2020.

arXiv:2004.12828 [pdf, other]

Station-to-User Transfer Learning: Towards Explainable User Clustering Through Latent Trip Signatures Using Tidal-Regularized Non-Negative Matrix Factorization

Authors: Liming Zhang, Andreas Züfle, Dieter Pfoser

Abstract: Urban areas provide us with a treasure trove of available data capturing almost every aspect of a population's life. This work focuses on mobility data and how it will help improve our understanding of urban mobility patterns. Readily available and sizable farecard data captures trips in a public transportation network. However, such data typically lacks temporal modalities and as such the task of… ▽ More Urban areas provide us with a treasure trove of available data capturing almost every aspect of a population's life. This work focuses on mobility data and how it will help improve our understanding of urban mobility patterns. Readily available and sizable farecard data captures trips in a public transportation network. However, such data typically lacks temporal modalities and as such the task of inferring trip semantic, station function, and user profile is quite challenging. As existing approaches either focus on station-level or user-level signals, they are prone to overfitting and generate less credible and insightful results. To properly learn such characteristics from trip data, we propose a Collective Learning Framework through Latent Representation, which augments user-level learning with collective patterns learned from station-level signals. This framework uses a novel, so-called Tidal-Regularized Non-negative Matrix Factorization method, which incorporates domain knowledge in the form of temporal passenger flow patterns in generic Non-negative Matrix Factorization. To evaluate our model performance, a user stability test based on the classical Rand Index is introduced as a metric to benchmark different unsupervised learning models. We provide a qualitative analysis of the station functions and user profiles for the Washington D.C. metro and show how our method supports spatiotemporal intra-city mobility exploration. △ Less

Submitted 27 April, 2020; originally announced April 2020.

arXiv:2001.05581 [pdf, other]

Complete and Sufficient Spatial Domination of Multidimensional Rectangles

Authors: Tobias Emrich, Hans-Peter Kriegel, Andreas Züfle, Peer Kröger, Matthias Renz

Abstract: Rectangles are used to approximate objects, or sets of objects, in a plethora of applications, systems and index structures. Many tasks, such as nearest neighbor search and similarity ranking, require to decide if objects in one rectangle A may, must, or must not be closer to objects in a second rectangle B, than objects in a third rectangle R. To decide this relation of "Spatial Domination" it ca… ▽ More Rectangles are used to approximate objects, or sets of objects, in a plethora of applications, systems and index structures. Many tasks, such as nearest neighbor search and similarity ranking, require to decide if objects in one rectangle A may, must, or must not be closer to objects in a second rectangle B, than objects in a third rectangle R. To decide this relation of "Spatial Domination" it can be shown that using minimum and maximum distances it is often impossible to detect spatial domination. This spatial gem provides a necessary and sufficient decision criterion for spatial domination that can be computed efficiently even in higher dimensional space. In addition, this spatial gem provides an example, pseudocode and an implementation in Python. △ Less

Submitted 15 January, 2020; originally announced January 2020.

arXiv:1701.05395 [pdf, other]

doi 10.1109/TKDE.2017.2780123

Efficient Information Flow Maximization in Probabilistic Graphs

Authors: Christian Frey, Andreas Züfle, Tobias Emrich, Matthias Renz

Abstract: Reliable propagation of information through large networks, e.g., communication networks, social networks or sensor networks is very important in many applications concerning marketing, social networks, and wireless sensor networks. However, social ties of friendship may be obsolete, and communication links may fail, inducing the notion of uncertainty in such networks. In this paper, we address th… ▽ More Reliable propagation of information through large networks, e.g., communication networks, social networks or sensor networks is very important in many applications concerning marketing, social networks, and wireless sensor networks. However, social ties of friendship may be obsolete, and communication links may fail, inducing the notion of uncertainty in such networks. In this paper, we address the problem of optimizing information propagation in uncertain networks given a constrained budget of edges. We show that this problem requires to solve two NP-hard subproblems: the computation of expected information flow, and the optimal choice of edges. To compute the expected information flow to a source vertex, we propose the F-tree as a specialized data structure, that identifies independent components of the graph for which the information flow can either be computed analytically and efficiently, or for which traditional Monte-Carlo sampling can be applied independently of the remaining network. For the problem of finding the optimal edges, we propose a series of heuristics that exploit properties of this data structure. Our evaluation shows that these heuristics lead to high quality solutions, thus yielding high information flow, while maintaining low running time. △ Less

Submitted 5 May, 2018; v1 submitted 19 January, 2017; originally announced January 2017.

Journal ref: IEEE Transactions on Knowledge & Data Engineering, vol. 30, no. 5, pp. 880-894, 2018

arXiv:1409.2585 [pdf, other]

doi 10.1145/2666310.2666485

Towards Knowledge-Enriched Path Computation

Authors: Georgios Skoumas, Klaus Arthur Schmid, Gregor Jossé, Andreas Züfle, Mario A. Nascimento, Matthias Renz, Dieter Pfoser

Abstract: Directions and paths, as commonly provided by navigation systems, are usually derived considering absolute metrics, e.g., finding the shortest path within an underlying road network. With the aid of crowdsourced geospatial data we aim at obtaining paths that do not only minimize distance but also lead through more popular areas using knowledge generated by users. We extract spatial relations such… ▽ More Directions and paths, as commonly provided by navigation systems, are usually derived considering absolute metrics, e.g., finding the shortest path within an underlying road network. With the aid of crowdsourced geospatial data we aim at obtaining paths that do not only minimize distance but also lead through more popular areas using knowledge generated by users. We extract spatial relations such as "nearby" or "next to" from travel blogs, that define closeness between pairs of points of interest (PoIs) and quantify each of these relations using a probabilistic model. Subsequently, we create a relationship graph where each node corresponds to a PoI and each edge describes the spatial connection between the respective PoIs. Using Bayesian inference we obtain a probabilistic measure of spatial closeness according to the crowd. Applying this measure to the corresponding road network, we obtain an altered cost function which does not exclusively rely on distance, and enriches an actual road networks taking crowdsourced spatial relations into account. Finally, we propose two routing algorithms on the enriched road networks. To evaluate our approach, we use Flickr photo data as a ground truth for popularity. Our experimental results -- based on real world datasets -- show that the paths computed w.r.t.\ our alternative cost function yield competitive solutions in terms of path length while also providing more "popular" paths, making routing easier and more informative for the user. △ Less

Submitted 9 September, 2014; originally announced September 2014.

Comments: Accepted as a short paper at ACM SIGSPATIAL GIS 2014

ACM Class: H.2.8

arXiv:1305.3407 [pdf, other]

Probabilistic Nearest Neighbor Queries on Uncertain Moving Object Trajectories

Authors: Johannes Niedermayer, Andreas Züfle, Tobias Emrich, Matthias Renz, Nikos Mamoulis, Lei Chen, Hans-Peter Kriegel

Abstract: Nearest neighbor (NN) queries in trajectory databases have received significant attention in the past, due to their application in spatio-temporal data analysis. Recent work has considered the realistic case where the trajectories are uncertain; however, only simple uncertainty models have been proposed, which do not allow for accurate probabilistic search. In this paper, we fill this gap by addre… ▽ More Nearest neighbor (NN) queries in trajectory databases have received significant attention in the past, due to their application in spatio-temporal data analysis. Recent work has considered the realistic case where the trajectories are uncertain; however, only simple uncertainty models have been proposed, which do not allow for accurate probabilistic search. In this paper, we fill this gap by addressing probabilistic nearest neighbor queries in databases with uncertain trajectories modeled by stochastic processes, specifically the Markov chain model. We study three nearest neighbor query semantics that take as input a query state or trajectory $q$ and a time interval. For some queries, we show that no polynomial time solution can be found. For problems that can be solved in PTIME, we present exact query evaluation algorithms, while for the general case, we propose a sophisticated sampling approach, which uses Bayesian inference to guarantee that sampled trajectories conform to the observation data stored in the database. This sampling approach can be used in Monte-Carlo based approximation solutions. We include an extensive experimental study to support our theoretical results. △ Less

Submitted 20 January, 2014; v1 submitted 15 May, 2013; originally announced May 2013.

Comments: 12 pages

Journal ref: PVLDB 7(3): 205-216 (2013)

arXiv:1103.0172 [pdf, other]

Inverse Queries For Multidimensional Spaces

Authors: Thomas Bernecker, Tobias Emrich, Hans-Peter Kriegel, Nikos Mamoulis, Matthias Renz, Shiming Zhang, Andreas Züfle

Abstract: Traditional spatial queries return, for a given query object $q$, all database objects that satisfy a given predicate, such as epsilon range and $k$-nearest neighbors. This paper defines and studies {\em inverse} spatial queries, which, given a subset of database objects $Q$ and a query predicate, return all objects which, if used as query objects with the predicate, contain $Q$ in their result. W… ▽ More Traditional spatial queries return, for a given query object $q$, all database objects that satisfy a given predicate, such as epsilon range and $k$-nearest neighbors. This paper defines and studies {\em inverse} spatial queries, which, given a subset of database objects $Q$ and a query predicate, return all objects which, if used as query objects with the predicate, contain $Q$ in their result. We first show a straightforward solution for answering inverse spatial queries for any query predicate. Then, we propose a filter-and-refinement framework that can be used to improve efficiency. We show how to apply this framework on a variety of inverse queries, using appropriate space pruning strategies. In particular, we propose solutions for inverse epsilon range queries, inverse $k$-nearest neighbor queries, and inverse skyline queries. Our experiments show that our framework is significantly more efficient than naive approaches. △ Less

Submitted 5 May, 2011; v1 submitted 1 March, 2011; originally announced March 2011.

arXiv:1101.2613 [pdf, other]

A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

Authors: Thomas Bernecker, Tobias Emrich, Hans-Peter Kriegel, Nikos Mamoulis, Matthias Renz, Andreas Zuefle

Abstract: In this paper, we propose a novel, effective and efficient probabilistic pruning criterion for probabilistic similarity queries on uncertain data. Our approach supports a general uncertainty model using continuous probabilistic density functions to describe the (possibly correlated) uncertain attributes of objects. In a nutshell, the problem to be solved is to compute the PDF of the random variabl… ▽ More In this paper, we propose a novel, effective and efficient probabilistic pruning criterion for probabilistic similarity queries on uncertain data. Our approach supports a general uncertainty model using continuous probabilistic density functions to describe the (possibly correlated) uncertain attributes of objects. In a nutshell, the problem to be solved is to compute the PDF of the random variable denoted by the probabilistic domination count: Given an uncertain database object B, an uncertain reference object R and a set D of uncertain database objects in a multi-dimensional space, the probabilistic domination count denotes the number of uncertain objects in D that are closer to R than B. This domination count can be used to answer a wide range of probabilistic similarity queries. Specifically, we propose a novel geometric pruning filter and introduce an iterative filter-refinement strategy for conservatively and progressively estimating the probabilistic domination count in an efficient way while kee** correctness according to the possible world semantics. In an experimental evaluation, we show that our proposed technique allows to acquire tight probability bounds for the probabilistic domination count quickly, even for large uncertain databases. △ Less

Submitted 5 May, 2011; v1 submitted 13 January, 2011; originally announced January 2011.

arXiv:1008.2300 [pdf, other]

Probabilistic Frequent Pattern Growth for Itemset Mining in Uncertain Databases (Technical Report)

Authors: Thomas Bernecker, Hans-Peter Kriegel, Matthias Renz, Florian Verhein, Andreas Züfle

Abstract: Frequent itemset mining in uncertain transaction databases semantically and computationally differs from traditional techniques applied on standard (certain) transaction databases. Uncertain transaction databases consist of sets of existentially uncertain items. The uncertainty of items in transactions makes traditional techniques inapplicable. In this paper, we tackle the problem of finding proba… ▽ More Frequent itemset mining in uncertain transaction databases semantically and computationally differs from traditional techniques applied on standard (certain) transaction databases. Uncertain transaction databases consist of sets of existentially uncertain items. The uncertainty of items in transactions makes traditional techniques inapplicable. In this paper, we tackle the problem of finding probabilistic frequent itemsets based on possible world semantics. In this context, an itemset X is called frequent if the probability that X occurs in at least minSup transactions is above a given threshold. We make the following contributions: We propose the first probabilistic FP-Growth algorithm (ProFP-Growth) and associated probabilistic FP-Tree (ProFP-Tree), which we use to mine all probabilistic frequent itemsets in uncertain transaction databases without candidate generation. In addition, we propose an efficient technique to compute the support probability distribution of an itemset in linear time using the concept of generating functions. An extensive experimental section evaluates the our proposed techniques and shows that our ProFP-Growth approach is significantly faster than the current state-of-the-art algorithm. △ Less

Submitted 13 August, 2010; originally announced August 2010.

Comments: Technical Report, 21 pages

arXiv:0907.2868 [pdf, other]

Scalable Probabilistic Similarity Ranking in Uncertain Databases (Technical Report)

Authors: Thomas Bernecker, Hans-Peter Kriegel, Nikos Mamoulis, Matthias Renz, Andreas Zuefle

Abstract: This paper introduces a scalable approach for probabilistic top-k similarity ranking on uncertain vector data. Each uncertain object is represented by a set of vector instances that are assumed to be mutually-exclusive. The objective is to rank the uncertain data according to their distance to a reference object. We propose a framework that incrementally computes for each object instance and ran… ▽ More This paper introduces a scalable approach for probabilistic top-k similarity ranking on uncertain vector data. Each uncertain object is represented by a set of vector instances that are assumed to be mutually-exclusive. The objective is to rank the uncertain data according to their distance to a reference object. We propose a framework that incrementally computes for each object instance and ranking position, the probability of the object falling at that ranking position. The resulting rank probability distribution can serve as input for several state-of-the-art probabilistic ranking models. Existing approaches compute this probability distribution by applying a dynamic programming approach of quadratic complexity. In this paper we theoretically as well as experimentally show that our framework reduces this to a linear-time complexity while having the same memory requirements, facilitated by incremental accessing of the uncertain vector instances in increasing order of their distance to the reference object. Furthermore, we show how the output of our method can be used to apply probabilistic top-k ranking for the objects, according to different state-of-the-art definitions. We conduct an experimental evaluation on synthetic and real data, which demonstrates the efficiency of our approach. △ Less

Submitted 16 July, 2009; originally announced July 2009.

Showing 1–17 of 17 results for author: Züfle, A