A Zero Auxiliary Knowledge Membership Inference Attack on Aggregate Location Data
Abstract.
Location data is frequently collected from populations and shared in aggregate form to guide policy and decision making. However, the prevalence of aggregated data also raises the privacy concern of membership inference attacks (MIAs). MIAs infer whether an individual’s data contributed to the aggregate release. Although effective MIAs have been developed for aggregate location data, these require access to an extensive auxiliary dataset of individual traces over the same locations, which are collected from a similar population. This assumption is often impractical given common privacy practices surrounding location data. To measure the risk of an MIA performed by a realistic adversary, we develop the first Zero Auxiliary Knowledge (ZK) MIA on aggregate location data, which eliminates the need for an auxiliary dataset of real individual traces. Instead, we develop a novel synthetic approach, such that suitable synthetic traces are generated from the released aggregate. We also develop methods to correct for bias and noise, to show that our synthetic-based attack is still applicable when privacy mechanisms are applied prior to release. Using two large-scale location datasets, we demonstrate that our ZK MIA matches the state-of-the-art Knock-Knock (KK) MIA across a wide range of settings, including popular implementations of differential privacy (DP) and suppression of small counts. Furthermore, we show that ZK MIA remains highly effective even when the adversary only knows a small fraction () of their target’s location history. This demonstrates that effective MIAs can be performed by realistic adversaries, highlighting the need for strong DP protection.
1. Introduction
Human mobility and location data are widely used across many important domains, such as epidemiology (Hara and Yamaguchi,, 2021; Grantz et al.,, 2020), humanitarian response (Yabe et al.,, 2022), and finance (Holmes et al.,, 2013), as they offer insights into movement and density patterns. However, many people are concerned about the extensive collection of personal location data (Van Zoonen,, 2016; Zhou,, 2017; Hope,, 2021), particularly since this data may provide information regarding a person’s social, economic, and political life (Georgiadou et al.,, 2019).
Individual-level location datasets have been shown to be highly vulnerable to re-identification attacks, due to the unicity and temporal consistency of people’s mobility patterns (Zang and Bolot,, 2011; de Montjoye et al.,, 2013; Tournier and de Montjoye,, 2022). To address these privacy concerns, data practitioners commonly use aggregate statistics, instead of individual-level records (Aktay et al.,, 2020; Popa et al.,, 2011; Xu et al.,, 2015). For example, the Public Health Agency of Canada studied citizens’ movement during the COVID-19 pandemic, using aggregate location data from millions of mobile devices, provided by TELUS (Office of the Privacy Commissioner of Canada,, 2023; Oli,, 2021). British researchers conducted similar COVID-19 mobility analysis (Jeffrey et al.,, 2020; Trasberg and Cheshire,, 2023), using aggregate location data obtained from O2 and Facebook. Because aggregate location data is often considered to be sufficiently de-identified (Office of the Privacy Commissioner of Canada,, 2023), it is commonly sold by data brokers to interested parties (Savage,, 2021; Boorstein and Kelly,, 2023). Notably, the U.S. government has been criticized for using commercial aggregate location data for law enforcement purposes and military intelligence (Savage,, 2021). Aggregate location data is also used in other sectors, such as urban design, to optimize public transit networks (Morgan and Lovelace,, 2021; Kakakhel,, 2022; O2,, 2019), and finance, to understand consumer behaviour (SafeGraph,, 2024; Precisely,, 2024).
![Refer to caption](x1.png)
Motivation. As outlined in the E.U. Article 29 Working Party’s guidance on anonymization techniques, aggregation reduces the risk of re-identification but does not eliminate all privacy risks (wp2,, 2014). In particular, aggregates may still be vulnerable to membership inference attacks (MIAs), whose goal is to infer if an individual’s data was included in the data release, e.g. aggregate data. MIAs have become the de facto standard in privacy auditing due to their practical threat model and theoretical properties. From a practical perspective, a successful MIA is a direct privacy violation whenever participation in the data release is sensitive (Li et al.,, 2013). Furthermore, MIAs can be used as building blocks for other attacks, by first inferring a user’s participation and then inferring their sensitive attributes. From a theoretical perspective, the success rate of an MIA is upper bounded following the application of differential privacy (DP) (Dwork et al.,, 2006; Yeom et al.,, 2018; Humphries et al.,, 2023). Hence, MIAs can be used as an auditing tool for DP implementations (Jagielski et al.,, 2020; Nasr et al.,, 2021). Today, MIAs are widely used to assess the privacy risk of a broad range of data releases, including aggregate genetic data (Homer et al.,, 2008; Sankararaman et al.,, 2009), aggregate survey data (Bauer and Bindschaedler,, 2020), aggregate location data (Pyrgelis et al.,, 2017; Oehmichen et al.,, 2019), machine learning models (Shokri et al.,, 2017; Jayaraman and Evans,, 2019; Nasr et al.,, 2021) and synthetic data releases (Stadler et al.,, 2022; Houssiau et al.,, 2022; Meeus et al.,, 2023; Guépin et al.,, 2023).
MIAs pose an especially strong privacy threat on aggregate location data, since location data is often processed alongside sensitive attributes, such as socioeconomic status (Trasberg and Cheshire,, 2023) and vaccination status (Hope,, 2021). In a notable example, a high-ranking priest resigned after being outed as homosexual by a radical group that matched his smartphone data with location data from Grindr, a popular dating app among the LGBTQ+ community (Boorstein and Kelly,, 2023). It is therefore important to understand the practical risk that MIAs pose on aggregate location data, particularly by a realistic adversary, who only possesses information about their target.
The first and most prominent MIA on aggregate location data was proposed by Pyrgelis et al., (2017). Their “Knock-Knock” (KK) MIA works by training a binary classifier on a set of aggregates, wherein the adversary includes the target trace half of the time, and labels the aggregates accordingly. However, in addition to knowing the target trace, KK MIA requires the adversary to have access to a large auxiliary dataset of individual-level traces over the same locations and from a similar population as in the aggregate release. This is, when it comes to location data, a very strong assumption. This reliance on a strong adversary has led companies and practitioners to dismiss the risk posed by MIAs on location data. To the best of our knowledge, all previous works studying MIAs on aggregate location data require a similar auxiliary dataset (Pyrgelis et al.,, 2020; Zhang et al.,, 2020; Oehmichen et al.,, 2019).
Contributions. To assess the realistic privacy risk of releasing aggregate location data, we introduce the Zero Auxiliary Knowledge (ZK) MIA. ZK MIA is the first MIA on aggregate location data that does not require the adversary to have access to an auxiliary dataset. To remove this strong assumption, we develop a novel synthetic data-based approach, in which the adversary generates a reference dataset of synthetic traces, using only statistical parameters estimated from the aggregate. Training aggregates are then created using the synthetic reference. To account for privacy mechanisms applied to the release, we develop techniques to correct the parameter estimation for bias and noise, which enables ZK MIA to effectively attack privacy-aware aggregates as well. We also demonstrate that a paired sampling technique further improves MIA performance by isolating the contribution of the target trace within the high-dimensional aggregate. In the setting of -DP aggregate location data, we show that paired sampling enables MIAs to approach the worst-case -DP bound, offering a significant increase in performance to previous implementations.
We evaluate our Zero Auxiliary Knowledge (ZK) MIA against the state-of-the-art Knock-Knock (KK) MIA from Pyrgelis et al., (2017) using two location datasets: i) a large-scale call detail record (CDR) dataset, and ii) the Milan Twitter dataset (SpazioDati and di Milano,, 2015) from the Telecom Italia Big Data Challenge (Barlacchi et al.,, 2015). We apply the MIAs on raw and privacy-aware aggregates computed over users. Our results show that our ZK MIA closely matches the performance of KK MIA, without depending on extensive prior knowledge. On raw aggregates, both MIAs achieve AUC on both datasets, suggesting that aggregation in itself is an ineffective safeguard. Both MIAs also surpass AUC on both datasets under common privacy settings, including event-level DP noise addition.
We further relax assumptions and show that the adversary does not need the full target trace for ZK MIA to succeed. Indeed, ZK MIA still achieved AUC on the CDR dataset, with event-level DP in place, when the adversary only knew a random 10% of the target trace.
After extensive evaluations across different privacy mechanisms, namely the suppression of small counts (Chen et al.,, 2009) and -DP noise addition (Dwork et al.,, 2006), we argue that the commonly used -DP implementations on aggregate location data (Desfontaines,, 2021) do not protect against realistic privacy threats, such as our ZK MIA. We conclude that the only effective mitigation is the application of strong user level DP or user-day level DP guarantees, which is not yet a common practice (Telus,, 2024; Martínez-Durive et al.,, 2023; O2,, 2019; SafeGraph,, 2024; Precisely,, 2024).
2. Definitions and Threat Model
We formally define location traces and aggregates in Section 2.1 and overview aggregate-level privacy measures in Section 2.2. In Section 2.3 and 2.4, we outline the membership inference problem on aggregate location data and introduce the concept of a membership classifier. We present the threat model for our Zero Auxiliary Knowledge MIA and compare it against previous threat models for MIAs on location aggregates in Section 2.5. Table 1 of the Appendix contains a glossary of common terms.
2.1. Location Traces and Aggregates
Let represent the set of all regions of interest (ROIs) where location data is collected. Similarly, denotes the set of time intervals, also known as epochs, during which data collection occurs. In this paper, we assume that the geographic positions (i.e. approximate longitude and latitude) of the ROIs are known. For example, may represent a set of square regions that partition a city into a grid, and may represent contiguous hours over one month.
We focus on the scenario where location data of a set of users is collected over the ROIs and the epochs . We define the location trace of a user as the set of geo-tagged and time-stamped visits that made within during . We formally represent a user’s location trace as the binary matrix
(1) |
![Refer to caption](x2.png)
Let be a group of users whose location data is aggregated. We define an aggregate to be the aggregate count statistics for over . Formally, this is defined by the sum
(2) |
The entry therefore corresponds to the number of users in who visited ROI during epoch .
2.2. Privacy Measures on Location Aggregates
The data collector may be wary of the privacy risks of releasing the raw aggregate , and therefore apply privacy measures before releasing it.
2.2.1. Differential Privacy
Differential privacy (DP) (Dwork et al.,, 2006) is considered the gold standard for releasing information while protecting the privacy of individuals with formal guarantees. In essence, DP requires that the output of a computation over a dataset should not depend too much on the inclusion of any one record.
Definition 0 (-DP (Dwork et al.,, 2006)).
A randomized algorithm satisfies -DP if for all neighbouring datasets (i.e., differing in exactly one record), and all possible outputs :
(3) |
Thus, -DP limits the amount of information that can be inferred about individual records in the dataset, according to the privacy budget (Dwork et al.,, 2006). However, the privacy protection depends on what one considers as a “record”, or privacy unit, when defining the neighbouring datasets. The most common definitions, in increasing level of privacy protection, are event-level, user-day level, and user-level DP (see Desfontaines, (2021) for an overview). The privacy unit for event-level DP is an individual data entry by any given user. For aggregate location data, this would be a single visit by a user to a ROI during an epoch . The privacy unit for user-day level would be all visits registered by any given user over a day. Finally, for user-level, the unit would be all visits in any given user’s trace.
Randomised -DP mechanisms can be designed by adding noise sampled from the Laplace distribution (Dwork et al.,, 2006). would satisfy (3) and be an -DP aggregation mechanism, where is the global sensitivity, determined by the privacy unit.
In this paper, we assume the common practice of post-processing to ensure legitimate aggregate counts (Ge and Fukuda,, 2016; Pyrgelis et al.,, 2017, 2020; Zhu et al.,, 2022). Negative counts are set to , counts exceeding the group size are set to , and counts are rounded down to the nearest integer. These transformations will preserve -DP due to the post-processing theorem (Dwork et al.,, 2019). We note that the adversary can always apply these transformations themselves if the data collector does not do so already.
2.2.2. Suppression of Small Counts (SSC)
SSC is a privacy mechanism that aims to protect user privacy by hiding rare values. It has been frequently used across different types of datasets (Cretu et al.,, 2022; Gadotti et al.,, 2019; Pyrgelis et al.,, 2020), including mobility datasets (Kohli et al.,, 2023; Aktay et al.,, 2020). Instead of releasing the raw aggregate , the data collector may choose a threshold , and release the suppressed aggregate
(4) |
therefore contains the true count of users who visited a ROI during epoch as long as the count exceeds . Lesser visited pairs that record or less visits are reported as instead. Suppression can also be applied following -DP noise addition, such that (4) is applied on a noisy aggregate . This produces , an -DP aggregate whose final counts have been suppressed with threshold . This transformation would preserve -DP due to post-processing (Dwork et al.,, 2019), and may add a layer of complexity that mitigates attacks in practice.
2.3. Problem Formulation
We assume that the data collector releases aggregate count statistics over the ROIs and the epochs , for the users in the group . There are various cases depending on the privacy measures applied prior to release:
The goal of an adversary performing an MIA on is to determine whether their target contributed to , inferring IN for and OUT for .
2.4. Membership Classifier
Given an aggregate release over users, an adversary infers membership of the target within the aggregation group by using a binary membership classifier. Classifiers are commonly instantiated as machine learning models (Pyrgelis et al.,, 2017, 2020; Zhang et al.,, 2020), but statistical models, like the log-likelihood function, have been applied as well (Homer et al.,, 2008; Bauer and Bindschaedler,, 2020). To train the classifier, typically creates a balanced set of labeled size training aggregates (Pyrgelis et al.,, 2017; Zhang et al.,, 2020; Pyrgelis et al.,, 2020; Oehmichen et al.,, 2019). Half of the aggregates include the target trace and are labeled , and the other half are labeled . Training the classifier will create a decision boundary in the underlying space of aggregate releases (Bishop and Nasrabadi,, 2006). In the case of aggregate location data over ROIs and epochs , the decision boundary is characterized by a hypersurface that partitions the matrix space into two sets.
2.5. Threat Model
In this section, we present our Zero Auxiliary Knowledge (ZK) MIA threat model. It is commonly assumed in MIAs across various domains that the adversary has access to an auxiliary dataset and complete knowledge of the target record (Pyrgelis et al.,, 2017; Nasr et al.,, 2019; Salem et al.,, 2018; Truex et al.,, 2019; Yeom et al.,, 2018; Shokri et al.,, 2017). Our ZK threat model relaxes both assumptions by eliminating the need for an auxiliary dataset and allowing for only partial knowledge of the target trace.
For context, we also describe threat models of previous MIAs on aggregate location data. All threat models consider an adversary , whose goal is to determine whether a specific target user is included in the released aggregate . The aggregate is computed across users over ROIs and epochs . We assume that the locations of the ROIs are known.
Knock-Knock (Pyrgelis et al.,, 2017): The adversary has an auxiliary dataset of user traces, over the same locations and a similar population as the released aggregate. has at least traces, including the full target trace .
LocMIA (Zhang et al.,, 2020): The adversary knows ’s social network and has an auxiliary dataset of user traces, over the same locations and a similar population as the released aggregate . has at least traces, including the traces of ’s friends, but not .
Zero Auxiliary Knowledge (ours): The adversary knows a subset of the target ’s visits. Equivalently, the adversary knows a partial target trace , such that .
KK MIA and LocMIA are reliant on the adversary’s access to an extensive auxiliary dataset . In particular, samples individual traces from to create training aggregates. These traces are also assumed to range over the same locations, and belong to a similar population as the traces aggregated in the release , in order to properly train the membership classifier (Section 2.4). However, individual traces are known to be sensitive (de Montjoye et al.,, 2013), and are unlikely to be made available, particularly when the data is aggregated as a privacy measure. Furthermore, because must contain at least traces, this assumption is impractical for even moderately sized aggregates. Although LocMIA removes prior knowledge about the target trace , it must assume knowledge of ’s friends’ traces to create a suitable proxy. More importantly, LocMIA still requires a large auxiliary dataset of individual traces.
In contrast, our Zero Auxiliary Knowledge adversary only requires that the adversary has knowledge about the target’s location history. We emphasize that the adversary does not need to know the full trace . Our threat model encompasses the case where only a few of the target’s visits are known to the adversary. For example, the adversary may infer some of ’s visits from social media activity or direct observation.
3. Related Work
MIAs on Aggregate Location Data. MIAs on aggregate location data have been shown to be successful on multiple location datasets (Pyrgelis et al.,, 2017, 2020; Zhang et al.,, 2020), using a binary classifier to perform the inference task. The performance of the MIAs on small aggregates () is especially well-studied, as the influence of the target is easier to distinguish (Pyrgelis et al.,, 2017). For example, the KK MIA by Pyrgelis et al., (2017), achieved when attacking size aggregates across two different mobility datasets. LocMIA (Zhang et al.,, 2020) is another MIA on aggregate location data, which removes prior knowledge about the target trace. Instead, LocMIA assumes access to social network information, and the traces of the target’s friends, in order to construct a proxy for the target’s real trace. However, both KK MIA and LocMIA crucially require the adversary to have access to a large auxiliary dataset to train the binary classifier. In contrast, our ZK MIA does not require any auxiliary dataset, and only requires partial information about the target trace (e.g. A random proportion of their visits). ZK MIA therefore addresses the research gap of the MIA risk posed by a less knowledgeable attacker. The distinctions in prior knowledge are discussed in depth in Section 2.5. Our ZK MIA also features a novel approach, being the first MIA on aggregate location data to use synthetic trace generation.
Generation of Synthetic Location Traces. There are many techniques for generating synthetic location traces that capture realistic human mobility patterns (Kulkarni and Garbinato,, 2017; Kulkarni et al.,, 2018; Ouyang et al.,, 2018; Karagiannis et al.,, 2007; Jahromi et al.,, 2016; Lee et al.,, 2009). However, since our ZK MIA requires generating suitable synthetic traces without using additional information, this heavily limits the scope of applicable techniques. RNNs, GANs, and copulas have been used to generate synthetic traces that collectively approximate a real mobility dataset (Kulkarni and Garbinato,, 2017; Kulkarni et al.,, 2018; Ouyang et al.,, 2018). However, these techniques require real traces to train the model, which the ZK adversary does not have. Many of the state-of-the-art mobility models are also unsuitable because they simulate small-scale continuous trajectories (e.g. walks on campus) (Karagiannis et al.,, 2007; Lee et al.,, 2009). In contrast, location aggregates typically comprise discrete traces over a metropolitan region. We therefore identified a probabilistic unicity model by Farzanehfar et al., (2021), which requires only four statistical parameters to guide the synthetic generation. We demonstrate that we can non-trivially adapt this model for the ZK MIA. In particular, we develop methods to precisely estimate these parameters from the aggregates in order to produce realistic synthetic location aggregates.
MIAs with Reduced Auxiliary Data. Previous attempts have been made (Shokri et al.,, 2017; Salem et al.,, 2018; Truex et al.,, 2019; Yeom et al.,, 2018; Creţu et al.,, 2021; Guépin et al.,, 2023) to relax the standard assumption of an adversary’s access to an auxiliary dataset that has high statistical similarity with the attacked dataset, e.g. sampled from the same distribution (Shokri et al.,, 2017; Nasr et al.,, 2019). In the setting of machine learning (ML) models, where an MIA infers whether a record was a part of the ML model’s training set, Shokri et al., (2017) proposed an MIA without auxiliary data. Instead, they use synthetic data, which they generate using the ML model’s confidence scores. Similarly, Salem et al., (2018) trained an MIA using unrelated data, e.g., training on text data to attack an image model. These approaches require access to the ML model, and they train on features that are specific to ML models, such as the top- confidence scores, which may be shared across ML models pertaining to different types of data.
![Refer to caption](x3.png)
In contrast, the features used to train an MIA on aggregate location data explicitly depend on the specific regions, times, and population over which the aggregates are computed. In the context of synthetic generators, Guépin et al., (2023) performed an MIA against synthetic data using the released synthetic dataset as the auxiliary dataset. However, in the case of aggregate location data, the release cannot be directly used as a reference dataset, since it does not contain individual records.
4. Methodology
We dedicate Sections 4.1-4.3 to explaining the synthetic-based methodology of our Zero Auxiliary Knowledge MIA. Section 4.4 explains the paired sampling mechanism on training aggregates, which boosts the performance of ZK MIA and KK MIA, as shown in Section 6.4.
4.1. Zero Auxiliary Knowledge MIA Framework
We implement the Zero Auxiliary Knowledge MIA as a binary classifier. However, whereas uses their auxiliary dataset as a reference for creating training aggregates for KK MIA and LocMIA, instead uses the reference of synthetic traces that they created from the released aggregate. Furthermore, if the full target trace is not known, may instead use a partial trace when creating the training aggregates. Figure 3 illustrates ZK MIA’s overall attack architecture.
4.2. Generating Synthetic Traces from Aggregate Location Data
In order to generate synthetic traces for our Zero Auxiliary Knowledge MIA, we adapt a probabilistic mobility model (Farzanehfar et al.,, 2021). Farzanehfar et al., (2021) developed this model to reproduce unicity patterns in large populations. The model requires four statistical parameters, described below and illustrated in Figure 16 of the Appendix. Recall that for a discrete random variable taking values in , its probability mass function (p.m.f.) maps each possible value to its corresponding probability.
-
(1)
The marginal space distribution is a p.m.f. such that is proportional to the number of visits to ROI by users in across all epochs in .
-
(2)
The marginal time distribution is a p.m.f. such that is proportional to the number of visits during epoch by users in across all ROIs in .
-
(3)
The marginal activity distribution models the total number of visits recorded within during by a user drawn from .
-
(4)
The Delaunay triangulation, denoted , is a triangulation with vertices corresponding to the set of positions (longitude and latitude) of ROIs in . has the property that no vertex lies inside the circumcircle of any triangle in (Delaunay et al.,, 1934).
We note that the Delaunay triangulation is determined by the locations of the ROIs. Since the locations of the ROIs are assumed to be known (Section 2.5), can be immediately obtained from the release. We explain how the other statistical inputs, the three marginal distributions, can be approximated from the released aggregate in Section 4.3.
We now describe our procedure, adapted from Farzanehfar et al., (2021), for generating synthetic traces using the four inputs. For each synthetic trace , we first sample the number of visits from the activity marginal . This determines the number of nonzero entries in the matrix . Second, we sample an origin ROI from the space marginal , and a connected sub-graph from the Delaunay triangulation , such that . will correspond to the set of ROIs that may be visited in . This is done to emulate the natural tendency to move to and from the same proximate locations (e.g. home and work). Finally, we sample spatiotemporal visits for which we set . For each visit, is sampled from the space marginal restricted to , and is sampled from the time marginal . All the sampling steps are independent.
We make two modifications of the original algorithm (Farzanehfar et al.,, 2021). First, to avoid over-saturating unpopular regions, we sample the origin ROI according to the space marginal rather than uniformly. Second, to allow users to visit multiple ROIs within the same epoch, we sample the epochs with replacement rather than without replacement. Our procedure is summarized in Algorithm 1 of the appendix.
4.3. Obtaining Accurate Marginals
In order to generate suitable synthetic traces for ZK MIA, must estimate the marginal distributions , computed over the full population , using the aggregate release . In this section, we motivate and justify the techniques that we developed to obtain strong estimates . This task is especially challenging when privacy measures distort the aggregate data. We develop separate techniques to correct for bias in the case of SSC, and to correct for noise in the case of DP. The effects are shown in Figures 4 and 5 respectively. Algorithm 2 in the Appendix summarizes how we approximate all three marginals from .
4.3.1. Estimating Space and Time Marginals
![Refer to caption](x4.png)
Suppose that the data collector releases the aggregate , which may or may not have privacy measures. can directly compute the empirical space and time marginals, which we denote and , from the released aggregate matrix .
(5) | |||
(6) |
Raw Aggregate: In the case where the released aggregate provide the raw counts, i.e. , the empirical marginals tend to be highly accurate. An example is shown in Figure 13 in the Appendix. The accuracy of these estimates is intuitive because we expect the mobility patterns of an aggregation group to resemble those of the population . Thus, we set if the released aggregate is unmodified. We use the subscript to indicate generality for both the space and time marginals.
Suppressed Aggregate: However, if the data collector applies SSC with threshold , i.e. , then this will systematically bias the empirical marginal , because popular ROIs and epochs are more likely to evade suppression. It is therefore easy to see that suppression will reduce the observed probabilities of less popular entries and boost the probabilities of more popular entries.
To correct the bias, we flatten the empirical estimate by boosting low frequency counts and reducing high frequency counts. Upon the insight that can be likened to an audio signal, we adapt the logarithmic compression technique used to reduce dynamic range (Müller,, 2015; Miguel Alonso and Richard,, 2004)
(7) |
where the scaling factor regulates the compression level (Müller,, 2015). In music signal processing, corresponds to the intensity of a given frequency. In our case, corresponds to probabilities within the empirical marginal . We choose
(8) |
![Refer to caption](x5.png)
to automatically parameterize based on the smallest observed non-zero probability. We therefore estimate , where we omit the normalization constant.
We do not argue that our choice of method and parameter is optimal. However, the debiasing substantially improves the estimate, as shown in Figure 4, and it is done without additional information.
DP Aggregate. If -DP noise is added to each entry in the aggregate release, i.e. , then the noise will overpower the signal in the computation of the empirical marginals (Eq. 6). This follows from the fact that location aggregates are high-dimensional sparse matrices (Pyrgelis et al.,, 2020). Therefore, conversely to the SSC case, the probabilities within the empirical marginals are compressed, since each probability is characterized mostly by thousands of independent noise samples. This effect is visualized for several different noise scales in Figure 17(a) of the Appendix. We also prove that under strong sparsity assumptions, converges to the discrete uniform distribution on as the number of epochs in the observation period , in Theorem A.4 of the Appendix.
To correct the low variance of the observed probabilities, we propose the power transformation with , followed by renormalization. It is easy to see that this will increase the variance since the probabilities are in . Automatically calibrating the power is a delicate matter. To do so, we start with and augment gradually until the transformed distribution achieves the target variance . Without any prior knowledge, we consider the case where each probability is randomly drawn. Equivalently, each probablity is sampled from , and then renormalized so that the total probability is . Let denote the probabilities after normalization, and be the mean of the normalized probabilities. For the space marginal, the variance is
(9) |
![Refer to caption](x6.png)
For the variance from the time marginal, we replace with in the above equation. The algorithm for selecting is given in Algorithm 4 of the Appendix. Figure 5 shows that our automatically parameterized power transformation significantly improves the estimate. In contrast, the exponential transformation , which is the inverse of the log compression function , fails to denoise because the inverse is inapplicable after considering normalization.
4.3.2. Estimating the activity marginal
The released aggregate does not leak granular information about the activity marginal . However, may obtain the empirical mean number of visits per user according to the released aggregate,
(10) |
If the aggregate is raw, then we expect to be a strong estimate due to well-known regularity results about population-wide mobility activity (Schneider et al.,, 2013; Seshadri et al.,, 2008; Farzanehfar et al.,, 2021). In this case, sets .
However, if either SSC or -DP is applied, then the estimate would fail. Algorithm 3 in the Appendix describes how can obtain a better estimate . Given an aggregate release of size , can use and to iteratively improve their estimate, starting with . Given guess , creates a synthetic aggregate by generating synthetic traces, parameterized by and . may then apply the same privacy measures that were applied on . is obtained by increasing or decreasing relative to the difference in counts with .
Once is obtained, can simply pick , such that each synthetic trace has visits. However, it is well known that human mobility activity follows a heavy-tailed distribution, i.e., a heavier tail than the exponential distribution. The best approximations are lognormal, beta, or power-law distributions (Farzanehfar et al.,, 2021; Schneider et al.,, 2013; Seshadri et al.,, 2008).
It would be reasonable for to use a heavy-tailed distribution with mean , but these distributions require a second parameter, e.g. skewness, to determine the distribution shape. can use well-known parameters from other cities’ datasets to complete the estimate (Schneider et al.,, 2013; Seshadri et al.,, 2008), but to ensure that does not use additional knowledge, we assume that they use the sub-optimal estimate , as shown in Figure 6.
4.4. Paired Sampling for Training
When MIAs target high-dimensional aggregate data, such as location data, the membership classifier must handle noise arising from thousands of entries, which are unrelated to the target record. For example, many of the IN training aggregates may coincidentally have high counts in entries that are absent in the target record. This would skew the decision boundary of the membership classifier, which may lead to false positives when testing. We may similarly obtain false negatives due to spurious patterns within the OUT training aggregates. These challenges are compounded by the implementation of privacy measures. For example, DP would add noise to each entry. Given the dimensionality and nature of the aggregate data, spurious patterns will likely skew the decision boundary, even when hundreds or thousands of training aggregates are sampled. Given a fixed number of training samples, we demonstrate that the way in which the training set is sampled strongly influences the performance of the MIA. In particular, the sampling technique can guide the convergence of the decision boundary in order to prevent misclassification due to noise.
To the best of our knowledge, all previous MIAs on aggregate location data sampled their training set via independent random sampling (Pyrgelis et al.,, 2017; Oehmichen et al.,, 2019; Zhang et al.,, 2020; Pyrgelis et al.,, 2020). Training aggregates are created by independently sampling groups of users from the population and labeling them according to the target ’s presence.
On the one hand, independent sampling discourages overfitting to the training data by exposing the classifier to a wide variation of samples. On the other hand, independent sampling does nothing to prevent spurious patterns from distorting the decision boundary.
We propose a paired sampling technique to guide the convergence of the decision boundary. The idea is to use sampling to help the classifier identify the differential impact of the target record at the aggregate level. Paired sampling independently samples groups of users from . Then, an IN sample is created by adding as the group’s last member, and an OUT sample is created by adding another randomly selected user. The training set is therefore characterized by a set of IN/OUT pairs, which differ in exactly one record (the target’s). If noise is added to aggregates prior to release, then must inject the same noise sample to each paired sample, and , to ensure that the target’s differential impact is preserved between each IN/OUT pair.
![Refer to caption](x7.png)
![Refer to caption](x8.png)
Paired sampling therefore actively encourages the membership decision boundary to be formed based on relevant criteria related to the target. It also discourages spurious decision boundaries because of the high degree of similarity between IN/OUT pairs. An extreme value in an aggregate entry from an IN sample will be matched with a similar value from its paired OUT sample, with high probability. However, we note that using paired sampling effectively halves the training variation compared to independent sampling. Our experiments in Section 6.4 demonstrate that paired sampling outperforms independent sampling across all tested settings of -DP noise addition. Hence, guiding the decision boundary towards relevant membership criteria often takes precedence over maximizing training variation. For completeness, we note that while we developed, studied, and named paired sampling independently, we later found that a similar idea was used in Bauer and Bindschaedler, (2020) but had not been compared to independent sampling nor used elsewhere in the literature so far to the best of our knowledge.
5. Experimental Setup
To evaluate the efficacy of our ZK MIA, we compare it against the state-of-the-art Knock Knock (KK) MIA (Pyrgelis et al.,, 2017, 2020) using aggregated location data from two different datasets.
5.1. Datasets.
In this section, we describe the two location datasets used for evaluating the MIAs, and discuss ethical considerations of the data collection and usage.
CDR: The first dataset, which we refer to as ”CDR”, is a private dataset, shared with us by Flowminder (flo,, 2024) for the purpose of this research. The raw dataset comprises timestamped and geo-tagged call records of approximately mobile phone users within a Latin American metropolitan area. The observation period is June 2021, with epochs defined by the hourly timeslots. The ROIs are defined by the service regions of approximately cellular antenna towers within the metropolitan area, which spans . The users were selected such that they registered at least one visit per week, to omit users who changed SIM cards, and such that the majority of their visits are within the region, to ensure that they are residents. target users for the MIAs were randomly selected by Flowminder. A histogram of the number of visits over the target traces is plotted in Figure 14(a) in the Appendix.
Milan: The second dataset is the Milan Social Pulse dataset (SpazioDati and di Milano,, 2015), made publicly available as part of the Telecom Italia Big Data Challenge (Barlacchi et al.,, 2015). This dataset comprises timestamped and geo-tagged tweets from mobile phone users within the Milano region. The ROIs are defined by a grid of points, each with an approximate area of 256 . We consider the location data from the first week of data, yielding hourly epochs. We do not delete any users from the dataset prior to aggregation. We randomly select targets among users who tweeted at least times during the observation period. A histogram of the number of visits over the target traces is plotted in Figure 14(b) in the Appendix.
Ethical Considerations: Because of the sensitivity of location data, we did not access raw individual-level data, and instead collaborated with Flowminder (FM) to develop a privacy-preserving data-sharing pipeline for the purpose of this research (de Montjoye et al.,, 2018). More specifically, data sharing was restricted to pre-computed aggregate matrices (labeled according to target membership, computed by FM on the data provider server) and target traces randomly chosen by FM. To further mitigate the privacy risk, the ROI and epoch indices were randomly permuted in the shared aggregate and target trace matrices, according to a map** known only by FM. This random permutation relabeled the space and time indices, enabling us to test the MIAs without knowing the true times or locations. The graphs of the marginal statistics (see Figures 4, 5, 6) were plotted by FM and shared with us. All the data shared by FM with us is subject to a research contract between FM and our institution and was kept on our segregated server. The Milan dataset, derived from geo-tagged tweets, remains publicly available, and was only used for the purpose of testing the MIAs.
5.2. MIA Implementation
We perform a fair comparison between KK MIA and ZK MIA by training the binary membership classifier using the same parameters and architecture. This also helps us isolate the effect of removing auxiliary data on performance. We use a Logistic Regression binary classifier with default hyperparameters and regularization, implemented with . The number of training groups matches previous implementations (Pyrgelis et al.,, 2017, 2020), and the groups are selected using paired sampling, unless specified otherwise. We additionally fine tune the decision boundary using balanced independently sampled validation groups. We flatten the aggregates and feed them directly into the classifier as a vector, without any processing, such as PCA or feature extraction. Finally, as done in Pyrgelis et al., (2017), applies the same privacy measures to training and validation aggregates, if the released aggregate is privacy-aware.
Knock-Knock. To implement the Knock-Knock MIA, we provide with a reference group of real user traces (including the target trace ) when attacking the larger CDR dataset. We set for the Milan dataset. We note that this significantly surpasses previous reference sizes () implemented by Pyrgelis et al., (2017).
![Refer to caption](x9.png)
![Refer to caption](x10.png)
![Refer to caption](x11.png)
![Refer to caption](x12.png)
Zero Auxiliary Knowledge.Our Zero Auxiliary Knowledge MIA is structurally identical to KK (PS), but ZK has a synthetic reference , rather than a set of real traces. This reference is created according to the methodology detailed in Section 4. Setting allows for a direct comparison of the functionality of synthetic traces to real ones for the purpose of the MIA. However, we remark that cap** the number of synthetic traces at is an artificial restriction, since may generate arbitrarily more synthetic traces and achieve better performance, as shown in Figure 12 of the Appendix. By default, we assume full access to the target trace . This assumption is relaxed for experiment 6.3.
5.3. Evaluation
Default Experimental Parameters. We randomly select targets from the dataset for evaluation. These targets are re-used for each experiment. Furthermore, in each experiment, independently sampled balanced test aggregates are created for each target, and shared across both MIAs to ensure that the test sets are identical. As done in (Pyrgelis et al.,, 2017), the test aggregate user groups are sampled from a disjoint set of users to the Knock-Knock adversarial reference , plus the target . This corresponds to roughly user traces for CDR and user traces for Milan.
We perform all experiments on size aggregates, which matches the largest aggregate size tested in Pyrgelis et al., (2017) and exceeds the largest aggregate size tested in Zhang et al., (2020) (800). We do not vary since the relationship between aggregate group size and MIA effectiveness has already been documented extensively (Pyrgelis et al.,, 2017, 2020; Zhang et al.,, 2020). We perform the Knock-Knock and Zero Auxiliary Knowledge MIAs in this setting under different privacy measures. We also perform experiments such that only knows a fraction of the target trace , but by default, we assume that they know the full trace, i.e. .
Evaluation Metrics. In the past, MIAs on aggregate location data have been primarily evaluated using the area under the ROC curve (AUC score) as a metric (Pyrgelis et al.,, 2017; Zhang et al.,, 2020). For this reason, and its suitability for assessing the strength of a binary classifier, we use the mean AUC over all targets as our primary metric. However, we also include the mean attack accuracy over all targets as a secondary metric, listing these scores in Section C of the Appendix.
6. Experimental Results
6.1. Against Suppression of Small Counts
We first compare the performances of KK and ZK on aggregates whose counts have been suppressed according to threshold . We apply SSC with thresholds on test aggregates of size . We note that the case corresponds to releasing a raw aggregate. We remark that there is a trivial rule that sufficiently determines non-membership in this special case.
Rule (): If visits ROI during epoch and no users in the aggregation group visit , then cannot be in , i.e.,
![Refer to caption](x13.png)
![Refer to caption](x14.png)
We therefore incorporate this rule when , such that both MIAs first check if the released aggregate elicits the contradiction. If so, we immediately predict OUT. Otherwise, we train, validate, and test the classifier as usual. The rule is invalid for , since it would predict OUT whenever has a visit to a suppressed entry.
Results. Figure 7 shows that membership inference is a trivial task when applied to raw aggregates (). Both ZK and KK achieve near perfect AUC () on both datasets. This implies that aggregation is not an effective privacy mechanism in itself to protect high-dimensional location data from MIAs by weak or strong adversaries.
The results for reveal two general patterns. First, ZK compares closely with KK across different levels of SSC. On the CDR dataset, ZK’s AUC stays within of KK for each . We observe slightly worse results on the Milan dataset, but ZK still stays within AUC of KK for each . Second, there is a monotonic decrease in performance when the threshold is increased. For , the AUC is always less than . This follows from the fact that suppression reduces the amount of available information, and of all nonzero entries are suppressed by , as shown in Figure 15 of the Appendix. Therefore, although SSC eventually mitigates the MIAs, it may come at the cost of destroying virtually all utility.
Both MIAs perform worse on the Milan dataset compared to the CDR dataset. This is expected because we observe times less data per target in the Milan dataset, as shown in Figure 14. Indeed, people generally tweet less often than they text and call. ZK MIA is more affected by dataset sparsity, given its dependence on marginal distribution estimates, which become less reliable in sparser datasets.
6.2. Against -DP Noise Addition
Informed by practical applications of DP (Desfontaines,, 2021), we consider event level and user-day level to be the privacy units, and we vary the privacy budget for each unit.
Event Level DP. An event is equivalent to a visit by a user to . To offer privacy protection over an event, we set the global sensitivity . -DP is then ensured by adding noise to each count in the aggregate matrix.
User-day Level DP. In order to protect each user’s daily contributions without adding excessive noise, it is common to restrict user contributions prior to aggregation to achieve a smaller global sensitivity (Aktay et al.,, 2020; Herdağdelen and Dow,, 2021). We analysed daily activity distributions, and had them preprocessed such that a user may only contribute up to visits in any given day for CDR, and visits in any given day for Milan. -DP at the user-day level is then ensured by adding noise to each count in the aggregate matrix.
Results. Figure 8 shows that ZK MIA matches KK MIA across all tested DP settings. Indeed, ZK maintained a mean AUC within of KK (PS) across each of the privacy settings for both datasets. KK and ZK notably succeeded for many of the tested privacy budgets , particularly in the event level setting. Indeed, we observed for both MIAs whenever the noise scale for the CDR dataset, and for the Milan dataset. These settings are in line with many real-life applications (Kohli et al.,, 2023; Desfontaines,, 2021). Conversely, user-day level DP with privacy budget effectively reduced both MIAs to an AUC below . We discuss the significance of these results with respect to practical mitigations in Section 7.
6.3. Partial Knowledge of the Target Trace
We now relax the assumption that knows the full target trace . This is in line with our ZK threat model, and we expand KK MIA for this setting to be able to compare methods. To simulate a weaker adversary, we suppose that only knows a subset of the target ’s visits. We assume that only knows a random fraction of the trace . The number of retained visits is rounded up to the next integer to prevent cases where knows visits. For example, if a target has visits and , then this would correspond to knowing visit. This partial trace is used instead of the full trace when creating IN training and validation aggregates. The full trace is still used for IN test aggregates.
We perform this experiment in the setting where the data collector applies event-level DP with , followed by suppression. We choose this setting for a couple of reasons. First, to study the degradation of the MIAs with decreasing information about the target, we choose a setting where would succeed given the full target trace. Previous experiments revealed that -DP at the event level and suppression were not effective in preventing the MIAs by themselves, as the MIAs achieved AUC on the CDR dataset and AUC for Milan. Second, we combine the two defense mechanisms to see if suppression has an observable mitigation effect when applied following DP noise addition. By the post-processing property of DP, this would not alter the theoretical performance bound. However, zeroing all counts might add a layer of complexity that affect MIAs in practice.
Results. First, we note that applying SSC on top of event level DP has an insignificant effect on the MIAs. For the full target trace, we continue to observe AUC on the CDR dataset and AUC on the Milan dataset. The only MIA with a noticeable decline was KK MIA on the Milan dataset, which dropped from AUC to AUC.
Although decreasing the fraction of the target trace known to the adversary from to decreases the performance of the MIAs, the corresponding degradation is relatively gradual. All AUCs are captured within a range of on the CDR dataset, and within a range of on the Milan dataset. Even the lowest observed AUC by ZK MIA on the CDR dataset ( when of the target trace is known) achieves high discrimination. We note that the targets in both datasets have a wide variation in trace size, as shown in Figure 14 of the Appendix. For some targets, will only know one of the target’s visits, whereas for others, they will still know dozens and be able to infer membership easily. Interestingly, we note that knowing a single visit from a target trace can still train a classifier that is better than random. For one CDR target with visits in their full trace, we observed ZK achieve an AUC of across aggregates when only random visit was known. Although far from perfect, we found this surprising, as it shows that even a single visit by the target can inform an MIA against a noisy aggregate over users.
![Refer to caption](x15.png)
![Refer to caption](x16.png)
6.4. Paired Sampling vs. Independent Sampling
In this section, we study the performance of KK MIA and ZK MIA when we vary the sampling technique used for creating their training aggregates, i.e., paired sampling (PS) or independent sampling (IS). To test this, we consider the implementation of user-day on the Milan dataset across the privacy budgets for all four possible MIAs: KK (PS), KK (IS), ZK (PS), ZK (IS).
Results. From Figure 10, we see that the paired sampling MIA always outperforms its independent sampling equivalent across all privacy budgets . Paired sampling provides the largest boost when the inference task is challenging but not intractable. In particular, we notice a few striking examples of ZK (PS) drastically outperforming ZK (IS) in the middle of the graph. For , ZK (IS) is basically random (), yet simply switching to paired sampling enables the classifier to achieve an AUC of .
The improvement achieved by switching from independent sampling to paired sampling is less significant for KK in this experiment. This suggests that using training aggregates sampled from a reference dataset of real traces may introduce less randomness to the membership classifier’s decision boundary, compared to when we use a synthetic reference. This is intuitive because of ZK’s probabilistic generation method, which relies on sampling from three different estimated distributions. However, the MIAs have indistinguishable performance when both attacks use paired sampling, with the difference in AUC always staying within of one another across the privacy settings. This suggests that paired sampling effectively eliminates the noise contributed by coincidental patterns in random entries, and enables the membership classifier to form a suitable decision boundary.
7. Discussion
We first provide a critical analysis of the experimental results, followed by a discussion of mitigation strategies and their practicality. We then consider limitations in our methods and evaluations, before discussing how our methodology may be generalized to MIAs beyond the setting of aggregate location data.
7.1. Analysis of Results: Practical Risk of MIA
In Section 6, we observed that our ZK MIA achieves approximately the same performance as the KK MIA across all experiments, and that both MIAs performed effectively across a range of common privacy settings. This has several important implications.
First, the ZK MIA significantly increases the attack surface of aggregate location data, since no auxiliary dataset is needed for the MIA. Although previous MIAs on aggregate location data have been successful, the strong assumption of a large auxiliary dataset prevents these attackers from attempting the MIA in most real-life cases. The auxiliary dataset comprises sensitive user-level information collected from the same dataset that is being aggregated. However, aggregation is applied to prevent the release of personal information. Moreover, would be restricted by the size of their auxiliary dataset, since they would not be able to perform MIAs on aggregates computed over more users than there are in their reference. In contrast, we demonstrated that our Zero Auxiliary Knowledge can create an arbitrary number of synthetic traces upon seeing the released aggregate, without an auxiliary dataset. This offers the flexibility to attack aggregates of any size. Section D.1 in the Appendix shows results for KK and ZK MIA across aggregates of size . Figure 12 of the Appendix also shows that the ZK can boost their own performance up to diminishing marginal returns, simply by generating more traces.
We have also shown that MIAs on aggregate location data are more powerful than previously known. By incorporating paired sampling for training, we have demonstrated more effective MIA results on size aggregates than previously reported (Pyrgelis et al.,, 2020), particularly when protected by differential privacy. Our results therefore demonstrate that MIAs on aggregate location data are easily performed without auxiliary data, more effective than previously believed, and that common privacy measures fail to protect against the risk.
7.2. Proposed Mitigations
Our results show that aggregated location data requires more stringent privacy safeguards to protect against MIAs. This is an inherently challenging task because our ZK MIA was able to succeed by using the aggregate to estimate where the population moves (space marginal), when the population moves (time marginal), and how frequently the population moves (activity marginal). However, location aggregates naturally leak this information. In fact, much of its utility is derived from these marginal statistics. Therefore, while one can mitigate the ZK MIA by perturbing the aggregate to the point where the basic mobility patterns of its population are unrecoverable, doing so may also destroy the aggregates’ utility.
In light of these results, we advise that data practitioners be mindful of the parameters that they select for DP, because DP does not guarantee sufficient protection from MIAs if the parameters are chosen too loosely. Since data practitioners often prioritize utility, it is common to pick more relaxed parameters for the privacy unit (e.g. event or user-day instead of user) and budget (e.g. ). For example, Kohli et al., (2023) studied at the event level in the context of aggregate O-D mobility matrices, Facebook used at the event level when collecting data about URLs shared on the site, and Apple uses between and at the user-day level when collecting IOS data (Desfontaines,, 2021). Recall that ZK and KK achieved on the CDR dataset whenever the noise scale was or less. This corresponds to for event level DP and for user-day level DP with up to daily visits. ZK and KK therefore both achieved high discrimination on privacy settings that are in line with many real-life applications.
However, we do observe DP mitigating the MIAs when we pick sufficiently strict parameters. For example, no MIA achieved better than random performance for in the user-day setting. We also note that we did not evaluate using the user level setting, which would achieve the strongest privacy protection. We note that the suitability of privacy parameters depends on the desired utility and sensitivity of the dataset. A stricter parameter choice is particularly relevant if the aggregate is publicly released and/or pertains to sensitive data.
Although -DP always offers privacy guarantees, our experimental results emphasize the importance of picking appropriate parameters. In particular, we observed that event-level DP was largely ineffective in preventing MIAs from both strong and weak adversaries. We instead encourage the use of user-day or user-level DP with carefully selected privacy budgets to mitigate the practical threat of an MIA.
7.3. Limitations
We have so far taken the Knock-Knock MIA to refer to the Subset of Locations setting (Pyrgelis et al.,, 2017). We now address why we do not consider the Knock-Knock Participation in Past Groups (Pyrgelis et al.,, 2017) threat model in this paper. Under the Participation in Past Groups setting, the adversary has access to a set of past aggregates , collected over the same ROIs as the released aggregate . Moreover, is assumed to know the membership status of the target in each of these aggregates. That is, knows whether or not for all . This last assumption is crucial because directly uses as their training data for the membership classifier in this setting. This is unrealistic for multiple reasons. First, to train an effective membership classifier, there would need to be hundreds of labeled aggregates to have sufficient training data. More importantly, there would be no reason for the membership status of an individual within an aggregate to be released in practice. We argue that the only plausible scenario in which would know the membership status of each aggregate is if they created the aggregates themselves. This reduces to the Subset of Locations setting that we have assumed in this paper.
In terms of limitations for our ZK MIA, recall that the Delaunay triangulation of the ROIs, , is the only non-probabilistic parameter used to generate synthetic traces for the ZK MIA. The triangulation only depends on the locations of the ROIs, which we have so far assumed to be shared as part of the aggregate release (Section 2.5). We believe this to be a realistic assumption, as omitting the locations of the ROIs would strongly diminish the utility of aggregate location data. Nonetheless, there might exist cases where released location aggregates do not relay the positions of the ROIs. For example, Google binned ROIs into categories, ex. restaurants, parks, and hospitals, when publicly releasing their mobility report during COVID https://www.google.com/covid19/mobility/. In this setting, the adversary would proceed without knowing where the ROIs are situated with respect to one another. The privacy risk under this setting is not known, and we identify it as an area of future research. Similarly, there might exist cases where the adversary knows the ROIs that were visited by the target (ex. home and work), but not the visitation times. We show in Appendix D.3 that only knowing the visited ROIs substantially reduces the effectiveness of both MIAs.
ZK MIA also requires that we estimate statistical parameters from the released aggregate. It may be difficult to estimate these precisely if the aggregate size is small or if the collected location data is not regular. However, we still observe strong performance by ZK MIA on both datasets for small aggregate sizes (see Appendix D.1).
Furthermore, aggregate location data collected over large metropolitan populations are known to obey high regularity across different cities and time periods. These patterns include log-normal activity distributions (Schneider et al.,, 2013; Seshadri et al.,, 2008; Farzanehfar et al.,, 2021) and periodic ”circadian rhythm” time marginals (Csáji et al.,, 2013; Song et al.,, 2010; Seshadri et al.,, 2008; Farzanehfar et al.,, 2021). This suggests that our statistical parameter estimation should be highly transferable across sufficiently regular datasets. However, we acknowledge that there are scenarios where the observed population is not regular (e.g. taxi drivers).
7.4. Generalizations to MIAs in other Settings
In this paper, we have proposed a new methodology to perform membership inference attacks on aggregate data, by training the attack on synthetic records, generated from the released aggregate. We believe that this approach can be adapted for MIAs in settings beyond aggregate location data. Our methodology can be broken down into two main steps: 1) extracting noise-less global statistics from the released aggregate, 2) use these statistics to create individual-level records to train the MIA.
In the setting of location data, the relevant statistics pertain to the mobility trends of large-scale human populations (Schneider et al.,, 2013; Seshadri et al.,, 2008; Farzanehfar et al.,, 2021; Csáji et al.,, 2013; Song et al.,, 2010; Seshadri et al.,, 2008), and individual location (Kulkarni and Garbinato,, 2017; Kulkarni et al.,, 2018; Ouyang et al.,, 2018; Karagiannis et al.,, 2007; Lee et al.,, 2009) which have both been well established in the literature. This facilitates both steps of our methodology, as we know in advance what location data should look like at both the global and individual level.
Although the trends will be distinct from aggregate location data, aggregate releases for other types of data will generally reveal global statistics. For instance, categorical tabular data is modeled by discrete random variables, whereas location data is modeled by continuous random variables, and approximated by high-dimensional discrete data. Our methods for denoising and debiasing statistics from differentially private and suppressed aggregates are however not specific to location data, and should generalize to other data releases. Regarding the second step, using the statistics to create individual records for training, the probabilistic method used for our ZK MIA, drawing from the Delaunay triangulation and the relevant marginal distributions, is partially specific to location data. One would thus need to carefully consider the statistical properties of the type of data to create high quality individual records.
8. Conclusion
Aggregate location data is widely shared and used by governments (Savage,, 2021; Hope,, 2021; Oli,, 2021), companies (Apple,, 2017; Aktay et al.,, 2020; Herdağdelen and Dow,, 2021), and researchers (Trasberg and Cheshire,, 2023; Jeffrey et al.,, 2020; Kohli et al.,, 2023) because of its insights into human behaviour and its presumed security against reidentification.
In this paper, we demonstrated that aggregate location data is susceptible to MIAs by realistic adversaries, who only know some of their target’s location history. With ZK MIA, we introduced the first MIA on aggregate location data that does not require an auxiliary dataset. We accomplished this by generating appropriate synthetic traces, using statistics that are estimated from the released aggregate. We also equipoed our parameter estimation with techniques that automatically correct for bias and noise from popular privacy mechanisms like suppression of small counts and -DP noise.
We then showed that MIAs on aggregate location data are significantly improved by incorporating a paired sampling technique, which helps isolate the effect of the target trace within a high dimensional aggregate. Hence, the vulnerability of aggregate location data is further heightened by these improved attacks.
Our evaluations over two large datasets demonstrate that, despite the absence of an auxiliary dataset, ZK MIA performs as well as the state-of-the-art KK MIA, with both MIAs achieving high discrimination when commonly used privacy settings are applied. ZK MIA remains effective in realistic privacy settings, even when only a small fraction () of the target trace is known. These results emphasize the need for strict differential privacy guarantees on released aggregate location data.
Taken together, our findings show that membership inference attacks are not merely a theoretical privacy threat posed by unrealistically strong adversaries, but also a realistic threat to contend with in practice.
Acknowledgements.
Ana-Maria Cretu did most of her work while she was at Imperial College London and was partially funded by the Agence Française de Développement via the Flowminder Foundation. The authors would like to sincerely thank the Flowminder team for their support with this work, in particular Galina Veres and James Harrison for their help in designing a secure data-sharing pipeline to test the MIAs on the CDR dataset. The authors would like to further acknowledge Cyril Miras for his early work on MIAs on aggregate location data, including the implementation of the baseline rule MIA. Finally, the authors would like to thank the anonymous reviewers and shepherd for their feedback on the paper.References
- wp2, (2014) (2014). Article 29 data protection working party. opinion 05/2014 on anonymisation techniques. https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf.
- flo, (2024) (2024). Flowminder website. https://www.flowminder.org/.
- Aktay et al., (2020) Aktay, A., Bavadekar, S., Cossoul, G., Davis, J., Desfontaines, D., Fabrikant, A., Gabrilovich, E., Gadepalli, K., Gipson, B., Guevara, M., et al. (2020). Google covid-19 community mobility reports: anonymization process description (version 1.1). arXiv preprint arXiv:2004.04145.
- Apple, (2017) Apple, D. (2017). Learning with privacy at scale. Apple Machine Learning Journal, 1(8).
- Barlacchi et al., (2015) Barlacchi, G., De Nadai, M., Larcher, R., Casella, A., Chitic, C., Torrisi, G., Antonelli, F., Vespignani, A., Pentland, A., and Lepri, B. (2015). A multi-source dataset of urban life in the city of milan and the province of trentino. Scientific data, 2(1):1–15.
- Bauer and Bindschaedler, (2020) Bauer, L. A. and Bindschaedler, V. (2020). Towards realistic membership inferences: The case of survey data. In Annual Computer Security Applications Conference, pages 116–128.
- Bishop and Nasrabadi, (2006) Bishop, C. M. and Nasrabadi, N. M. (2006). Pattern recognition and machine learning, volume 4. Springer.
- Boorstein and Kelly, (2023) Boorstein, M. and Kelly, H. (2023). Colorado catholic group bought app data that tracked gay priests. The Washington Post.
- Chen et al., (2009) Chen, B.-C., Kifer, D., LeFevre, K., Machanavajjhala, A., et al. (2009). Privacy-preserving data publishing. Foundations and Trends® in Databases, 2(1–2):1–167.
- Creţu et al., (2021) Creţu, A.-M., Guépin, F., and de Montjoye, Y.-A. (2021). Correlation inference attacks against machine learning models. arXiv preprint arXiv:2112.08806.
- Cretu et al., (2022) Cretu, A.-M., Houssiau, F., Cully, A., and de Montjoye, Y.-A. (2022). Querysnout: Automating the discovery of attribute inference attacks against query-based systems. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 623–637.
- Csáji et al., (2013) Csáji, B. C., Browet, A., Traag, V. A., Delvenne, J.-C., Huens, E., Van Dooren, P., Smoreda, Z., and Blondel, V. D. (2013). Exploring the mobility of mobile phone users. Physica A: statistical mechanics and its applications, 392(6):1459–1473.
- de Montjoye et al., (2018) de Montjoye, Y.-A., Gambs, S., Blondel, V., Canright, G., De Cordes, N., Deletaille, S., Engø-Monsen, K., Garcia-Herranz, M., Kendall, J., Kerry, C., et al. (2018). On the privacy-conscientious use of mobile phone data. Scientific data, 5(1):1–6.
- de Montjoye et al., (2013) de Montjoye, Y.-A., Hidalgo, C. A., Verleysen, M., and Blondel, V. D. (2013). Unique in the crowd: The privacy bounds of human mobility. Scientific reports, 3(1):1–5.
- Delaunay et al., (1934) Delaunay, B. et al. (1934). Sur la sphere vide. Izv. Akad. Nauk SSSR, Otdelenie Matematicheskii i Estestvennyka Nauk, 7(793-800):1–2.
- Desfontaines, (2021) Desfontaines, D. (2021). A list of real-world uses of differential privacy.
- Dwork et al., (2019) Dwork, C., Kohli, N., and Mulligan, D. (2019). Differential privacy in practice: Expose your epsilons! Journal of Privacy and Confidentiality, 9(2).
- Dwork et al., (2006) Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pages 265–284. Springer.
- Farzanehfar et al., (2021) Farzanehfar, A., Houssiau, F., and de Montjoye, Y.-A. (2021). The risk of re-identification remains high even in country-scale location datasets. Patterns, 2(3):100204.
- Gadotti et al., (2019) Gadotti, A., Houssiau, F., Rocher, L., Livshits, B., and De Montjoye, Y.-A. (2019). When the signal is in the noise: Exploiting diffix’s sticky noise. In 28th USENIX Security Symposium (USENIX Security 19), pages 1081–1098.
- Ge and Fukuda, (2016) Ge, Q. and Fukuda, D. (2016). Updating origin–destination matrices with aggregated data of gps traces. Transportation Research Part C: Emerging Technologies, 69:291–312.
- Georgiadou et al., (2019) Georgiadou, Y., de By, R. A., and Kounadi, O. (2019). Location privacy in the wake of the gdpr. ISPRS international journal of geo-information, 8(3):157.
- Grantz et al., (2020) Grantz, K. H., Meredith, H. R., Cummings, D. A., Metcalf, C. J. E., Grenfell, B. T., Giles, J. R., Mehta, S., Solomon, S., Labrique, A., Kishore, N., et al. (2020). The use of mobile phone data to inform analysis of covid-19 pandemic epidemiology. Nature communications, 11(1):4961.
- Guépin et al., (2023) Guépin, F., Meeus, M., Cretu, A.-M., and de Montjoye, Y.-A. (2023). Synthetic is all you need: removing the auxiliary data assumption for membership inference attacks against synthetic data. arXiv preprint arXiv:2307.01701.
- Hara and Yamaguchi, (2021) Hara, Y. and Yamaguchi, H. (2021). Japanese travel behavior trends and change under covid-19 state-of-emergency declaration: Nationwide observation by mobile phone location data. Transportation Research Interdisciplinary Perspectives, 9:100288.
- Herdağdelen and Dow, (2021) Herdağdelen, A. and Dow, A. (2021). Protecting privacy in facebook mobility data during the covid-19 response (2020). URL https://research. fb. com/blog/2020/06/protecting-privacy-in-facebook-mobility-data-during-the-covid-19-response.
- Holmes et al., (2013) Holmes, A., Byrne, A., and Rowley, J. (2013). Mobile shop** behaviour: insights into attitudes, shop** process involvement and location. International Journal of Retail & Distribution Management, 42(1):25–39.
- Homer et al., (2008) Homer, N., Szelinger, S., Redman, M., Duggan, D., Tembe, W., Muehling, J., Pearson, J. V., Stephan, D. A., Nelson, S. F., and Craig, D. W. (2008). Resolving individuals contributing trace amounts of dna to highly complex mixtures using high-density snp genoty** microarrays. PLoS genetics, 4(8):e1000167.
- Hope, (2021) Hope, C. (2021). Millions ’unwittingly tracked’ by phone after vaccination to see if movements changed.
- Houssiau et al., (2022) Houssiau, F., Jordon, J., Cohen, S. N., Daniel, O., Elliott, A., Geddes, J., Mole, C., Rangel-Smith, C., and Szpruch, L. (2022). Tapas: A toolbox for adversarial privacy auditing of synthetic data. arXiv preprint arXiv:2211.06550.
- Humphries et al., (2023) Humphries, T., Oya, S., Tulloch, L., Rafuse, M., Goldberg, I., Hengartner, U., and Kerschbaum, F. (2023). Investigating membership inference attacks under data dependencies. In 2023 IEEE 36th Computer Security Foundations Symposium (CSF), pages 473–488. IEEE.
- Jagielski et al., (2020) Jagielski, M., Ullman, J., and Oprea, A. (2020). Auditing differentially private machine learning: How private is private sgd? Advances in Neural Information Processing Systems, 33:22205–22216.
- Jahromi et al., (2016) Jahromi, K. K., Zignani, M., Gaito, S., and Rossi, G. P. (2016). Simulating human mobility patterns in urban areas. Simulation Modelling Practice and Theory, 62:137–156.
- Jayaraman and Evans, (2019) Jayaraman, B. and Evans, D. (2019). Evaluating differentially private machine learning in practice. In 28th USENIX Security Symposium (USENIX Security 19), pages 1895–1912.
- Jeffrey et al., (2020) Jeffrey, B., Walters, C. E., Ainslie, K. E., Eales, O., Ciavarella, C., Bhatia, S., Hayes, S., Baguelin, M., Boonyasiri, A., Brazeau, N. F., et al. (2020). Anonymised and aggregated crowd level mobility data from mobile phones suggests that initial compliance with covid-19 social distancing interventions was high and geographically consistent across the uk. Wellcome Open Research, 5.
- Kakakhel, (2022) Kakakhel, S. (2022). Optimising urban planning with location intelligence. Quadrant Blog. Accessed: 2024-03-07.
- Karagiannis et al., (2007) Karagiannis, T., Le Boudec, J.-Y., and Vojnović, M. (2007). Power law and exponential decay of inter contact times between mobile devices. In Proceedings of the 13th annual ACM international conference on Mobile computing and networking, pages 183–194.
- Kohli et al., (2023) Kohli, N., Aiken, E., and Blumenstock, J. (2023). Privacy guarantees for personal mobility data in humanitarian response. arXiv preprint arXiv:2306.09471.
- Kulkarni and Garbinato, (2017) Kulkarni, V. and Garbinato, B. (2017). Generating synthetic mobility traffic using rnns. In Proceedings of the 1st Workshop on Artificial Intelligence and Deep Learning for Geographic Knowledge Discovery, pages 1–4.
- Kulkarni et al., (2018) Kulkarni, V., Tagasovska, N., Vatter, T., and Garbinato, B. (2018). Generative models for simulating mobility trajectories. arXiv preprint arXiv:1811.12801.
- Lee et al., (2009) Lee, K., Hong, S., Kim, S. J., Rhee, I., and Chong, S. (2009). Slaw: A new mobility model for human walks. In IEEE INFOCOM 2009, pages 855–863. IEEE.
- Li et al., (2013) Li, N., Qardaji, W., Su, D., Wu, Y., and Yang, W. (2013). Membership privacy: A unifying framework for privacy definitions. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security, pages 889–900.
- Martínez-Durive et al., (2023) Martínez-Durive, O. E., Mishra, S., Ziemlicki, C., Rubrichi, S., Smoreda, Z., and Fiore, M. (2023). The netmob23 dataset: A high-resolution multi-region service-level mobile data traffic cartography. arXiv preprint arXiv:2305.06933.
- Meeus et al., (2023) Meeus, M., Guepin, F., Creţu, A.-M., and de Montjoye, Y.-A. (2023). Achilles’ heels: vulnerable record identification in synthetic data publishing. In European Symposium on Research in Computer Security, pages 380–399. Springer.
- Miguel Alonso and Richard, (2004) Miguel Alonso, B. D. and Richard, G. (2004). Tempo and beat estimation of musical signals. In Proceedings of the International Conference on Music Information Retrieval (ISMIR), Barcelona, Spain.
- Morgan and Lovelace, (2021) Morgan, M. and Lovelace, R. (2021). Travel flow aggregation: Nationally scalable methods for interactive and online visualisation of transport behaviour at the road network level. Environment and Planning B: Urban Analytics and City Science, 48(6):1684–1696.
- Müller, (2015) Müller, M. (2015). Logarithmic compression. https://www.audiolabs-erlangen.de/resources/MIR/FMP/C3/C3S1_LogCompression.html.
- Nasr et al., (2019) Nasr, M., Shokri, R., and Houmansadr, A. (2019). Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning. In 2019 IEEE symposium on security and privacy (SP), pages 739–753. IEEE.
- Nasr et al., (2021) Nasr, M., Songi, S., Thakurta, A., Papernot, N., and Carlin, N. (2021). Adversary instantiation: Lower bounds for differentially private machine learning. In 2021 IEEE Symposium on security and privacy (SP), pages 866–882. IEEE.
- O2, (2019) O2 (2019). O2 transport smart steps product sheet. https://static-www.o2.co.uk/sites/default/files/2019-04/o2-transport-smart-steps-product-sheet.pdf. [Online].
- Oehmichen et al., (2019) Oehmichen, A., Jain, S., Gadotti, A., and de Montjoye, Y.-A. (2019). Opal: High performance platform for large-scale privacy-preserving location data analytics. In 2019 IEEE International Conference on Big Data (Big Data), pages 1332–1342. IEEE.
- Office of the Privacy Commissioner of Canada, (2023) Office of the Privacy Commissioner of Canada (2023). Investigation into the collection and use of de-identified mobility data in the course of the covid-19 pandemic. Accessed: 2023-09-14.
- Oli, (2021) Oli, S. (2021). Canada’s public health agency admits it tracked 33 million mobile devices during lockdown. National Post, 24.
- Ouyang et al., (2018) Ouyang, K., Shokri, R., Rosenblum, D. S., and Yang, W. (2018). A non-parametric generative model for human trajectories. In IJCAI, volume 18, pages 3812–3817.
- Popa et al., (2011) Popa, R. A., Blumberg, A. J., Balakrishnan, H., and Li, F. H. (2011). Privacy and accountability for location-based aggregate statistics. In Proceedings of the 18th ACM conference on Computer and communications security, pages 653–666.
- Precisely, (2024) Precisely (2024). Placeiq movement. https://www.precisely.com/product/precisely-placeiq/placeiq-movement. Accessed: 2024-02-13.
- Pyrgelis et al., (2017) Pyrgelis, A., Troncoso, C., and De Cristofaro, E. (2017). Knock knock, who’s there? membership inference on aggregate location data. arXiv preprint arXiv:1708.06145.
- Pyrgelis et al., (2020) Pyrgelis, A., Troncoso, C., and De Cristofaro, E. (2020). Measuring membership privacy on aggregate location time-series. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 4(2):1–28.
- SafeGraph, (2024) SafeGraph (2024). Enrich pois with aggregated transaction data. https://www.safegraph.com/products/spend. Accessed: 2024-02-13.
- Salem et al., (2018) Salem, A., Zhang, Y., Humbert, M., Berrang, P., Fritz, M., and Backes, M. (2018). Ml-leaks: Model and data independent membership inference attacks and defenses on machine learning models. arXiv preprint arXiv:1806.01246.
- Sankararaman et al., (2009) Sankararaman, S., Obozinski, G., Jordan, M. I., and Halperin, E. (2009). Genomic privacy and limits of individual detection in a pool. Nature genetics, 41(9):965–967.
- Savage, (2021) Savage, C. (2021). Intelligence analysts use u.s. smartphone location data without warrants, memo says. The New York Times. Available at {https://www.nytimes.com/2021/01/22/us/politics/dia-surveillance-data.html}.
- Schneider et al., (2013) Schneider, C. M., Belik, V., Couronné, T., Smoreda, Z., and González, M. C. (2013). Unravelling daily human mobility motifs. Journal of The Royal Society Interface, 10(84):20130246.
- Seshadri et al., (2008) Seshadri, M., Machiraju, S., Sridharan, A., Bolot, J., Faloutsos, C., and Leskove, J. (2008). Mobile call graphs: beyond power-law and lognormal distributions. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 596–604.
- Shokri et al., (2017) Shokri, R., Stronati, M., Song, C., and Shmatikov, V. (2017). Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pages 3–18. IEEE.
- Song et al., (2010) Song, C., Qu, Z., Blumm, N., and Barabási, A.-L. (2010). Limits of predictability in human mobility. Science, 327(5968):1018–1021.
- SpazioDati and di Milano, (2015) SpazioDati and di Milano, D. P. (2015). Social Pulse - Milano.
- Stadler et al., (2022) Stadler, T., Oprisanu, B., and Troncoso, C. (2022). Synthetic data–anonymisation groundhog day. In 31st USENIX Security Symposium (USENIX Security 22), pages 1451–1468.
- Telus, (2024) Telus (2024). Telus Insights Location API. https://docs.insights.telus.com/. [Online].
- Tournier and de Montjoye, (2022) Tournier, A. J. and de Montjoye, Y.-A. (2022). Expanding the attack surface: Robust profiling attacks threaten the privacy of sparse behavioral data. Science Advances, 8(33):eabl6464.
- Trasberg and Cheshire, (2023) Trasberg, T. and Cheshire, J. (2023). Spatial and social disparities in the decline of activities during the covid-19 lockdown in greater london. Urban Studies, 60(8):1427–1447.
- Truex et al., (2019) Truex, S., Liu, L., Gursoy, M. E., Yu, L., and Wei, W. (2019). Demystifying membership inference attacks in machine learning as a service. IEEE Transactions on Services Computing, 14(6):2073–2089.
- Van Zoonen, (2016) Van Zoonen, L. (2016). Privacy concerns in smart cities. Government Information Quarterly, 33(3):472–480.
- Xu et al., (2015) Xu, Y., Shaw, S.-L., Zhao, Z., Yin, L., Fang, Z., and Li, Q. (2015). Understanding aggregate human mobility patterns using passive mobile phone location data: a home-based approach. Transportation, 42:625–646.
- Yabe et al., (2022) Yabe, T., Jones, N. K., Rao, P. S. C., Gonzalez, M. C., and Ukkusuri, S. V. (2022). Mobile phone location data for disasters: A review from natural hazards and epidemics. Computers, Environment and Urban Systems, 94:101777.
- Yeom et al., (2018) Yeom, S., Giacomelli, I., Fredrikson, M., and Jha, S. (2018). Privacy risk in machine learning: Analyzing the connection to overfitting. In 2018 IEEE 31st computer security foundations symposium (CSF), pages 268–282. IEEE.
- Zang and Bolot, (2011) Zang, H. and Bolot, J. (2011). Anonymization of location data does not work: A large-scale measurement study. In Proceedings of the 17th annual international conference on Mobile computing and networking, pages 145–156.
- Zhang et al., (2020) Zhang, G., Zhang, A., and Zhao, P. (2020). Locmia: Membership inference attacks against aggregated location data. IEEE Internet of Things Journal, 7(12):11778–11788.
- Zhou, (2017) Zhou, T. (2017). Understanding location-based services users’ privacy concern: An elaboration likelihood model perspective. Internet Research, 27(3):506–519.
- Zhu et al., (2022) Zhu, K., Fioretto, F., and Van Hentenryck, P. (2022). Post-processing of differentially private data: A fairness perspective. arXiv preprint arXiv:2201.09425.
Appendix
Notation | Definition |
---|---|
Set of regions of interests (ROIs) | |
Set of epochs in observation period | |
Set of all users in the dataset | |
Location trace of user over | |
Aggregation group of users sampled from | |
Raw aggregate count matrix in over users in | |
An -DP aggregate | |
An aggregate with counts suppressed | |
An -DP aggregate with counts suppressed | |
The released aggregate count matrix | |
Number of users in the aggregation group | |
Target drawn from full population, | |
Adversary performing MIA on |
Default value | Definition |
---|---|
Number of training aggregates | |
Number of validation aggregates | |
Number of test aggregates | |
Number of targets | |
Aggregate size | |
(CDR), | Traces in ’s real (KK) |
(Milan) | or synthetic (ZK) reference |
Fraction of known by |
Appendix A Supplementary Proofs
Definition 0.
(Oracle average count) Given a raw aggregate , we define the oracle average count function as
(11) |
Letting corresponds to extending the observation period indefinitely. Thus, represents the expected number of users who visit ROI at a randomly selected epoch, given infinite location data over the ROIs .
Definition 0.
(Strong sparsity) We say that is strongly sparse if
(12) |
Equivalently, . This is a strong assumption, as it implies that the visitation rate to each ROI decreases at a sublinear rate.
Lemma A.3.
Given a fixed geographic region in which location data is collected,
(13) |
Proof.
corresponds to the number of users who registered a visit during epoch . Letting corresponds to increasing creating finer regional partitions within the fixed geographic region. is invariant to increasing , since the same users are observed over the same time. It follows that . ∎
Theorem A.4.
(Convergence of empirical marginals to uniform distribution under -DP) Let be the global sensitivity and suppose that -DP is applied on an aggregate release with post-processed non-negative counts. If the original raw counts are strongly sparse, then the empirical space and time marginals, and , each converge to discrete uniform distributions:
-
•
in distribution as
-
•
in distribution as
Proof.
We first consider . It suffices to show that as , for each .
Let and let . Recall that -DP with post-processed non-negative counts is obtained by , where is the true number of visits by users in to and are i.i.d Laplacian noise samples (Section 2.2.1). By definition,
We now express , for some , in order to apply Lemma A.5 later. Since , there are three cases:
We therefore have
(14) |
By sparsity, for each
(15) |
Also, by the Strong Law of Large Numbers, since are i.i.d., and , we have
By linearity,
Hence, in all three possible cases,
This allows us to simplify
Since for all , , Lemma A.5 implies . Hence, by the Strong Law of Large Numbers,
Finally, for any set of ROIs , and any ,
A symmetric argument proves in distribution as , using Lemma A.3 instead of strong sparsity. ∎
Remark. We note that strong sparsity is assumed in Eq. (14) to prove that . Although we expect the oracle average count to be very small for most , due to the sparsity of aggregate location data, it is unlikely to observe for real data. Substituting in place of in Eq. (14) will not yield the uniform probability , but it will be a close approximation, provided that and that the number of epochs is large.
In practice, fixed dimensions for and will prevent the empirical marginals from completely converging to the uniform distribution. This is demonstrated for different noise scales on the Milan dataset (which has and ) in Figure 17.
Lemma A.5.
Suppose that , with . Then, has mean
Proof.
Let . Then, its probability density function (pdf) is given by
which is symmetric about . Hence, . It follows that has the pdf
where is the Dirac delta function representing the accumulated probability mass at zero. We then evaluate
∎
Appendix B Algorithms
In this section, we present the main algorithms required to generate synthetic traces from the released aggregate for our ZK MIA.
Algorithm 1 describes how we adapted the unicity model from Farzanehfar et al., (2021) to generate synthetic traces for ZK MIA. We note that the procedure for generating a synthetic trace can also be interpreted as running a Markov chain over the state space of spatiotemporal pairs with transition probabilities to proportional to the product of the pmfs .
Algorithm 2 estimates the three marginal probability distributions required to run Algorithm 1: the space marginal , the time marginal , and the activity marginal from an aggregate release . We estimate the marginals via our denoising and debiasing techniques (from Section 4.3.1), depending on the application of privacy measures on .
Algorithm 3 describes our procedure for achieving an estimate for the mean number of visits per user in the dataset given a privacy-aware aggregate release. Recall that is set to . Algorithm 4 describes our procedure for computing which degree will work best in the power transformation, to correct the empirical marginal obtained directly from a -DP aggregate release.
Appendix C Accuracy Results
In this section, we present the accuracy scores of ZK MIA and KK MIA for the experiments on suppression of small counts and -DP noise addition.
Table 3 presents the accuracy scores obtained by ZK and KK from the experiments on suppression of small counts from Section 6.1. Table 4 presents the accuracy scores obtained by ZK and KK from the experiments on event level -DP from Section 6.2. Table 5 presents the accuracy scores obtained by ZK and KK from the experiments on user-day level -DP.
We observe that the accuracy scores of KK and ZK are close in each experiment, as observed already with the AUC metric in the main text.
CDR dataset | Milan dataset | |||
---|---|---|---|---|
KK | ZK | KK | ZK | |
0 | ||||
1 | ||||
2 | ||||
3 | ||||
4 | ||||
5 |
CDR dataset | Milan dataset | |||
---|---|---|---|---|
KK | ZK | KK | ZK | |
0.1 | ||||
0.5 | ||||
1.0 | ||||
5.0 | ||||
10.0 |
CDR dataset | Milan dataset | |||
---|---|---|---|---|
KK | ZK | KK | ZK | |
0.1 | ||||
0.5 | ||||
1.0 | ||||
5.0 | ||||
10.0 |
Appendix D Additional Experiments
D.1. Varying the size of the aggregate
Since ZK MIA requires the estimation of statistics from the aggregate, there may be concerns about its performance when the aggregate size is small. However, like previous MIAs, ZK MIA performs more effectively on smaller-scale aggregates compared to larger aggregates. This is shown in Figure 11 for aggregate sizes and different privacy budgets .
To further understand how MIA performance scales with aggregate size , we also consider in this experiment. To this end, we vary and compare the performance of KK MIA and ZK MIA on raw () and suppressed () aggregates. Results on the CDR dataset are reported in Tables 6 and 8 and results on the Milan dataset are reported in Tables 7 and 9. was not run on the Milan dataset due size limitations.
In these settings with mild privacy protection, the attacks always succeed regardless of the value of . We also observe a few intuitive trends. First, when raw aggregates () are attacked, increasing the size of the aggregates slowly decreases the performance of the attack. On the CDR dataset, KK and ZK attain AUCs and for , which decreases to and for . Second, when we apply suppression , the attacks initially perform poorly when the aggregate size is small. We hypothesize this to be due to a larger percentage of entries being suppressed when fewer traces are aggregated, leaving less information in the release. This effect gradually decrease as aggregate size increases. It is then counterbalanced by the first effect, that increasing the size of the aggregates slowly decreases the performance of the attack, when aggregate sizes increase. This is visible for in the CDR dataset. For the Milan dataset, AUC however still monotonically increases even beyond as the dataset is more sensitive to suppression with the average user has approximately times less visits, as shown in Table 14(b).
KK | ZK | |
---|---|---|
100 | ||
500 | ||
1000 | ||
2000 | ||
3000 |
KK | ZK | |
---|---|---|
100 | ||
500 | ||
1000 | ||
2000 |
KK | ZK | |
---|---|---|
100 | ||
500 | ||
1000 | ||
2000 | ||
3000 |
KK | ZK | |
---|---|---|
100 | ||
500 | ||
1000 | ||
2000 |
![Refer to caption](x17.png)
D.2. Increasing the size of ZK synthetic reference
Figure 12 illustrates how increasing the number of synthetic traces available to the attacker improves the MIA’s performance up to marginal returns.
![Refer to caption](x18.png)
D.3. No time information
In this experiment, we now assume that the adversary only has access to some of the locations that the target has visited, without knowing the epochs during which the visits were done. For example, the adversary may know the target’s home and work. To model this attack setting, we suppose that the adversary either knows the target’s top- most visited ROIs, for , or the full set of the target’s visited ROIs during the observation period. In one implementation, which we call ”greedy”, the adversary assumes that the target visits each known ROI during every epoch in the observation period. This ensures that the visits to these ROIs are reflected in the target trace, but it also sets many incorrect visits. Results are presented in Table 11. In our second implementation, which we call ”random sampling”, the adversary distributes the target’s visits uniformly across the known ROIs. For example, if the adversary knows the top- ROIs, and their estimate for the mean number of visits per user is , then they would sample visits for each of the top- ROIs. The corresponding epochs for each visit are sampled from the estimated time marginal. For simplicity, we assume that is the true mean number of visits and that the estimated time marginal is the true one. Results on raw aggregates of size are presented in Table 10.
Table 10 shows that both MIAs perform poorly () when the adversary uses random sampling to approximate the target trace. This suggests that random sampling fails to estimate the target trace, due to the omission of true target visits, and the inclusion of incorrect visits.
In contrast, Table 11 shows that KK was able to perform significantly better than random when the adversary knew more than of the target’s most visited ROIs and used the greedy implementation (ex. on Milan when knowing all visited ROIs). This suggests that, although the greedy implementation includes many incorrect visits, the guaranteed inclusion of some of the target’s actual visits enables membership inference to an extent. ZK, on the other hand, fails to attain AUC. Since ZK already replaces real individual traces with synthetic traces, we hypothesize that membership inference becomes too difficult if the estimated target trace contains significantly incorrect information.
We however note that our current implementation for sampling the visits under this prior knowledge might be suboptimal and that better implementations might exist. For example, (Zhang et al.,, 2020) uses a synthetic target trace, using social network information and the traces of the target’s friends. We leave this exploration for future work.
Dataset | Knock Knock | Zero Auxiliary Knowledge | ||||||
---|---|---|---|---|---|---|---|---|
Top 1 | Top 2 | Top 3 | All | Top 1 | Top 2 | Top 3 | All | |
CDR | ||||||||
Milan |
Dataset | Knock Knock | Zero Auxiliary Knowledge | ||||||
---|---|---|---|---|---|---|---|---|
Top 1 | Top 2 | Top 3 | All | Top 1 | Top 2 | Top 3 | All | |
CDR | ||||||||
Milan |
Appendix E Additional Plots
We present additional figures demonstrating statistics related to the location datasets.
![Refer to caption](x19.png)
![Refer to caption](x20.png)
![Refer to caption](x21.png)
![Refer to caption](x22.png)
![Refer to caption](x23.png)
![Refer to caption](x24.png)
![Refer to caption](x25.png)
![Refer to caption](x26.png)
![Refer to caption](x27.png)
![Refer to caption](x28.png)
![Refer to caption](x29.png)