\@printpermissiontrue\@printcopyrighttrue\@acmownedtrue\@acmownedfalse\@ACM@journal@bibstripfalse

A Zero Auxiliary Knowledge Membership Inference Attack on Aggregate Location Data

Vincent Guan 0009-0005-5218-9233 Imperial College London [email protected] Florent Guépin 0009-0008-5098-0963 Imperial College London [email protected] Ana-Maria Cretu 0000-0002-9009-7381 EPFL [email protected]  and  Yves-Alexandre de Montjoye 0000-0002-2559-5616 Imperial College London [email protected]
Abstract.

Location data is frequently collected from populations and shared in aggregate form to guide policy and decision making. However, the prevalence of aggregated data also raises the privacy concern of membership inference attacks (MIAs). MIAs infer whether an individual’s data contributed to the aggregate release. Although effective MIAs have been developed for aggregate location data, these require access to an extensive auxiliary dataset of individual traces over the same locations, which are collected from a similar population. This assumption is often impractical given common privacy practices surrounding location data. To measure the risk of an MIA performed by a realistic adversary, we develop the first Zero Auxiliary Knowledge (ZK) MIA on aggregate location data, which eliminates the need for an auxiliary dataset of real individual traces. Instead, we develop a novel synthetic approach, such that suitable synthetic traces are generated from the released aggregate. We also develop methods to correct for bias and noise, to show that our synthetic-based attack is still applicable when privacy mechanisms are applied prior to release. Using two large-scale location datasets, we demonstrate that our ZK MIA matches the state-of-the-art Knock-Knock (KK) MIA across a wide range of settings, including popular implementations of differential privacy (DP) and suppression of small counts. Furthermore, we show that ZK MIA remains highly effective even when the adversary only knows a small fraction (10%percent1010\%10 %) of their target’s location history. This demonstrates that effective MIAs can be performed by realistic adversaries, highlighting the need for strong DP protection.

Location data, Membership Inference Attack, Synthetic Data
journalyear: YYYYjournalvolume: YYYYjournalnumber: Xdoi: XXXXXXX.XXXXXXX11footnotetext: These authors contributed equally to this work.

1. Introduction

Human mobility and location data are widely used across many important domains, such as epidemiology (Hara and Yamaguchi,, 2021; Grantz et al.,, 2020), humanitarian response (Yabe et al.,, 2022), and finance (Holmes et al.,, 2013), as they offer insights into movement and density patterns. However, many people are concerned about the extensive collection of personal location data (Van Zoonen,, 2016; Zhou,, 2017; Hope,, 2021), particularly since this data may provide information regarding a person’s social, economic, and political life (Georgiadou et al.,, 2019).

Individual-level location datasets have been shown to be highly vulnerable to re-identification attacks, due to the unicity and temporal consistency of people’s mobility patterns (Zang and Bolot,, 2011; de Montjoye et al.,, 2013; Tournier and de Montjoye,, 2022). To address these privacy concerns, data practitioners commonly use aggregate statistics, instead of individual-level records (Aktay et al.,, 2020; Popa et al.,, 2011; Xu et al.,, 2015). For example, the Public Health Agency of Canada studied citizens’ movement during the COVID-19 pandemic, using aggregate location data from millions of mobile devices, provided by TELUS (Office of the Privacy Commissioner of Canada,, 2023; Oli,, 2021). British researchers conducted similar COVID-19 mobility analysis (Jeffrey et al.,, 2020; Trasberg and Cheshire,, 2023), using aggregate location data obtained from O2 and Facebook. Because aggregate location data is often considered to be sufficiently de-identified (Office of the Privacy Commissioner of Canada,, 2023), it is commonly sold by data brokers to interested parties (Savage,, 2021; Boorstein and Kelly,, 2023). Notably, the U.S. government has been criticized for using commercial aggregate location data for law enforcement purposes and military intelligence (Savage,, 2021). Aggregate location data is also used in other sectors, such as urban design, to optimize public transit networks (Morgan and Lovelace,, 2021; Kakakhel,, 2022; O2,, 2019), and finance, to understand consumer behaviour (SafeGraph,, 2024; Precisely,, 2024).

Refer to caption
Figure 1. Adversary’s prior knowledge in the previous work, Knock-Knock MIA (Pyrgelis et al.,, 2017) (left), and our work, Zero Auxiliary Knowledge MIA (right). The ZK adversary does not require knowledge of location traces of real people to run the MIA.

Motivation. As outlined in the E.U. Article 29 Working Party’s guidance on anonymization techniques, aggregation reduces the risk of re-identification but does not eliminate all privacy risks (wp2,, 2014). In particular, aggregates may still be vulnerable to membership inference attacks (MIAs), whose goal is to infer if an individual’s data was included in the data release, e.g. aggregate data. MIAs have become the de facto standard in privacy auditing due to their practical threat model and theoretical properties. From a practical perspective, a successful MIA is a direct privacy violation whenever participation in the data release is sensitive (Li et al.,, 2013). Furthermore, MIAs can be used as building blocks for other attacks, by first inferring a user’s participation and then inferring their sensitive attributes. From a theoretical perspective, the success rate of an MIA is upper bounded following the application of differential privacy (DP) (Dwork et al.,, 2006; Yeom et al.,, 2018; Humphries et al.,, 2023). Hence, MIAs can be used as an auditing tool for DP implementations (Jagielski et al.,, 2020; Nasr et al.,, 2021). Today, MIAs are widely used to assess the privacy risk of a broad range of data releases, including aggregate genetic data (Homer et al.,, 2008; Sankararaman et al.,, 2009), aggregate survey data (Bauer and Bindschaedler,, 2020), aggregate location data (Pyrgelis et al.,, 2017; Oehmichen et al.,, 2019), machine learning models  (Shokri et al.,, 2017; Jayaraman and Evans,, 2019; Nasr et al.,, 2021) and synthetic data releases (Stadler et al.,, 2022; Houssiau et al.,, 2022; Meeus et al.,, 2023; Guépin et al.,, 2023).

MIAs pose an especially strong privacy threat on aggregate location data, since location data is often processed alongside sensitive attributes, such as socioeconomic status (Trasberg and Cheshire,, 2023) and vaccination status (Hope,, 2021). In a notable example, a high-ranking priest resigned after being outed as homosexual by a radical group that matched his smartphone data with location data from Grindr, a popular dating app among the LGBTQ+ community (Boorstein and Kelly,, 2023). It is therefore important to understand the practical risk that MIAs pose on aggregate location data, particularly by a realistic adversary, who only possesses information about their target.

The first and most prominent MIA on aggregate location data was proposed by Pyrgelis et al., (2017). Their “Knock-Knock” (KK) MIA works by training a binary classifier on a set of aggregates, wherein the adversary includes the target trace half of the time, and labels the aggregates accordingly. However, in addition to knowing the target trace, KK MIA requires the adversary to have access to a large auxiliary dataset of individual-level traces over the same locations and from a similar population as in the aggregate release. This is, when it comes to location data, a very strong assumption. This reliance on a strong adversary has led companies and practitioners to dismiss the risk posed by MIAs on location data. To the best of our knowledge, all previous works studying MIAs on aggregate location data require a similar auxiliary dataset (Pyrgelis et al.,, 2020; Zhang et al.,, 2020; Oehmichen et al.,, 2019).

Contributions. To assess the realistic privacy risk of releasing aggregate location data, we introduce the Zero Auxiliary Knowledge (ZK) MIA. ZK MIA is the first MIA on aggregate location data that does not require the adversary to have access to an auxiliary dataset. To remove this strong assumption, we develop a novel synthetic data-based approach, in which the adversary generates a reference dataset of synthetic traces, using only statistical parameters estimated from the aggregate. Training aggregates are then created using the synthetic reference. To account for privacy mechanisms applied to the release, we develop techniques to correct the parameter estimation for bias and noise, which enables ZK MIA to effectively attack privacy-aware aggregates as well. We also demonstrate that a paired sampling technique further improves MIA performance by isolating the contribution of the target trace within the high-dimensional aggregate. In the setting of ε𝜀\varepsilonitalic_ε-DP aggregate location data, we show that paired sampling enables MIAs to approach the worst-case ε𝜀\varepsilonitalic_ε-DP bound, offering a significant increase in performance to previous implementations.

We evaluate our Zero Auxiliary Knowledge (ZK) MIA against the state-of-the-art Knock-Knock (KK) MIA from Pyrgelis et al., (2017) using two location datasets: i) a large-scale call detail record (CDR) dataset, and ii) the Milan Twitter dataset (SpazioDati and di Milano,, 2015) from the Telecom Italia Big Data Challenge  (Barlacchi et al.,, 2015). We apply the MIAs on raw and privacy-aware aggregates computed over 1000100010001000 users. Our results show that our ZK MIA closely matches the performance of KK MIA, without depending on extensive prior knowledge. On raw aggregates, both MIAs achieve 0.990.990.990.99 AUC on both datasets, suggesting that aggregation in itself is an ineffective safeguard. Both MIAs also surpass 0.90.90.90.9 AUC on both datasets under common privacy settings, including ϵ=1italic-ϵ1\epsilon=1italic_ϵ = 1 event-level DP noise addition.

We further relax assumptions and show that the adversary does not need the full target trace for ZK MIA to succeed. Indeed, ZK MIA still achieved 0.840.840.840.84 AUC on the CDR dataset, with ϵ=1italic-ϵ1\epsilon=1italic_ϵ = 1 event-level DP in place, when the adversary only knew a random 10% of the target trace.

After extensive evaluations across different privacy mechanisms, namely the suppression of small counts  (Chen et al.,, 2009) and ε𝜀\varepsilonitalic_ε-DP noise addition  (Dwork et al.,, 2006), we argue that the commonly used ϵitalic-ϵ\epsilonitalic_ϵ-DP implementations on aggregate location data (Desfontaines,, 2021) do not protect against realistic privacy threats, such as our ZK MIA. We conclude that the only effective mitigation is the application of strong user level DP or user-day level DP guarantees, which is not yet a common practice  (Telus,, 2024; Martínez-Durive et al.,, 2023; O2,, 2019; SafeGraph,, 2024; Precisely,, 2024).

2. Definitions and Threat Model

We formally define location traces and aggregates in Section 2.1 and overview aggregate-level privacy measures in Section 2.2. In Section 2.3 and 2.4, we outline the membership inference problem on aggregate location data and introduce the concept of a membership classifier. We present the threat model for our Zero Auxiliary Knowledge MIA and compare it against previous threat models for MIAs on location aggregates in Section 2.5. Table 1 of the Appendix contains a glossary of common terms.

2.1. Location Traces and Aggregates

Let 𝒮={s1,,s|𝒮|}𝒮subscript𝑠1subscript𝑠𝒮\mathcal{S}=\{s_{1},...,s_{|\mathcal{S}|}\}caligraphic_S = { italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT | caligraphic_S | end_POSTSUBSCRIPT } represent the set of all regions of interest (ROIs) where location data is collected. Similarly, 𝒯={t1,,t|𝒯|}𝒯subscript𝑡1subscript𝑡𝒯\mathcal{T}=\{t_{1},...,t_{|\mathcal{T}|}\}caligraphic_T = { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT | caligraphic_T | end_POSTSUBSCRIPT } denotes the set of time intervals, also known as epochs, during which data collection occurs. In this paper, we assume that the geographic positions (i.e. approximate longitude and latitude) of the ROIs are known. For example, 𝒮𝒮\mathcal{S}caligraphic_S may represent a set of square regions that partition a city into a grid, and 𝒯𝒯\mathcal{T}caligraphic_T may represent contiguous hours over one month.

We focus on the scenario where location data of a set of users ΩΩ\Omegaroman_Ω is collected over the ROIs 𝒮𝒮\mathcal{S}caligraphic_S and the epochs 𝒯𝒯\mathcal{T}caligraphic_T. We define the location trace Lusuperscript𝐿𝑢L^{u}italic_L start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT of a user uΩ𝑢Ωu\in\Omegaitalic_u ∈ roman_Ω as the set of geo-tagged and time-stamped visits (s,t)𝑠𝑡(s,t)( italic_s , italic_t ) that u𝑢uitalic_u made within 𝒮𝒮\mathcal{S}caligraphic_S during 𝒯𝒯\mathcal{T}caligraphic_T. We formally represent a user’s location trace as the binary matrix

(1) Ls,tu={1if user u visited ROI s during epoch t0otherwise.subscriptsuperscript𝐿𝑢𝑠𝑡cases1if user u visited ROI s during epoch t0otherwiseL^{u}_{s,t}=\begin{cases}1&\text{if user $u$ visited ROI $s$ during epoch $t$}% \\ 0&\text{otherwise}.\end{cases}italic_L start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT = { start_ROW start_CELL 1 end_CELL start_CELL if user italic_u visited ROI italic_s during epoch italic_t end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise . end_CELL end_ROW
Refer to caption
Figure 2. Example of how suppression of small counts and differential privacy may be applied to an aggregate with 3 ROIs (rows) and 3 epochs (columns).

Let 𝒰Ω𝒰Ω\mathcal{U}\subset\Omegacaligraphic_U ⊂ roman_Ω be a group of m𝑚mitalic_m users whose location data is aggregated. We define an aggregate A𝒰superscript𝐴𝒰A^{\mathcal{U}}italic_A start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT to be the aggregate count statistics for 𝒰𝒰\mathcal{U}caligraphic_U over 𝒮×𝒯𝒮𝒯\mathcal{S}\times\mathcal{T}caligraphic_S × caligraphic_T. Formally, this is defined by the sum

(2) A𝒰=u𝒰Lu.superscript𝐴𝒰subscript𝑢𝒰superscript𝐿𝑢A^{\mathcal{U}}=\sum\limits_{u\in\mathcal{U}}L^{u}.italic_A start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_u ∈ caligraphic_U end_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT .

The entry As,t𝒰subscriptsuperscript𝐴𝒰𝑠𝑡A^{\mathcal{U}}_{s,t}italic_A start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT therefore corresponds to the number of users in 𝒰𝒰\mathcal{U}caligraphic_U who visited ROI s𝑠sitalic_s during epoch t𝑡titalic_t.

2.2. Privacy Measures on Location Aggregates

The data collector may be wary of the privacy risks of releasing the raw aggregate A𝒰superscript𝐴𝒰A^{\mathcal{U}}italic_A start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT, and therefore apply privacy measures before releasing it.

2.2.1. Differential Privacy

Differential privacy (DP) (Dwork et al.,, 2006) is considered the gold standard for releasing information while protecting the privacy of individuals with formal guarantees. In essence, DP requires that the output of a computation over a dataset should not depend too much on the inclusion of any one record.

Definition 0 (ε𝜀\varepsilonitalic_ε-DP (Dwork et al.,, 2006)).

A randomized algorithm M𝑀Mitalic_M satisfies ε𝜀\varepsilonitalic_ε-DP if for all neighbouring datasets D1D2similar-tosubscript𝐷1subscript𝐷2D_{1}\sim D_{2}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (i.e., differing in exactly one record), and all possible outputs SRange(M)𝑆𝑅𝑎𝑛𝑔𝑒𝑀S\subset Range(M)italic_S ⊂ italic_R italic_a italic_n italic_g italic_e ( italic_M ):

(3) Pr(M(D1)S)eεPr(M(D2)S)Pr𝑀subscript𝐷1𝑆superscript𝑒𝜀Pr𝑀subscript𝐷2𝑆\Pr(M(D_{1})\in S)\leq e^{\varepsilon}\Pr(M(D_{2})\in S)roman_Pr ( italic_M ( italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∈ italic_S ) ≤ italic_e start_POSTSUPERSCRIPT italic_ε end_POSTSUPERSCRIPT roman_Pr ( italic_M ( italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ italic_S )

Thus, ε𝜀\varepsilonitalic_ε-DP limits the amount of information that can be inferred about individual records in the dataset, according to the privacy budget ε𝜀\varepsilonitalic_ε (Dwork et al.,, 2006). However, the privacy protection depends on what one considers as a “record”, or privacy unit, when defining the neighbouring datasets. The most common definitions, in increasing level of privacy protection, are event-level, user-day level, and user-level DP (see  Desfontaines, (2021) for an overview). The privacy unit for event-level DP is an individual data entry by any given user. For aggregate location data, this would be a single visit by a user to a ROI s𝑠sitalic_s during an epoch t𝑡titalic_t. The privacy unit for user-day level would be all visits registered by any given user over a day. Finally, for user-level, the unit would be all visits in any given user’s trace.

Randomised ε𝜀\varepsilonitalic_ε-DP mechanisms can be designed by adding noise sampled from the Laplace distribution (Dwork et al.,, 2006). ADP𝒰(ε)=A𝒰+Lap(Δε)superscriptsubscript𝐴𝐷𝑃𝒰𝜀superscript𝐴𝒰𝐿𝑎𝑝Δ𝜀A_{DP}^{\mathcal{U}}(\varepsilon)=A^{\mathcal{U}}+Lap(\frac{\Delta}{% \varepsilon})italic_A start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT ( italic_ε ) = italic_A start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT + italic_L italic_a italic_p ( divide start_ARG roman_Δ end_ARG start_ARG italic_ε end_ARG ) would satisfy (3) and be an ε𝜀\varepsilonitalic_ε-DP aggregation mechanism, where ΔΔ\Deltaroman_Δ is the global sensitivity, determined by the privacy unit.

In this paper, we assume the common practice of post-processing to ensure legitimate aggregate counts (Ge and Fukuda,, 2016; Pyrgelis et al.,, 2017, 2020; Zhu et al.,, 2022). Negative counts are set to 00, counts exceeding the group size m𝑚mitalic_m are set to m𝑚mitalic_m, and counts are rounded down to the nearest integer. These transformations will preserve ε𝜀\varepsilonitalic_ε-DP due to the post-processing theorem (Dwork et al.,, 2019). We note that the adversary can always apply these transformations themselves if the data collector does not do so already.

2.2.2. Suppression of Small Counts (SSC)

SSC is a privacy mechanism that aims to protect user privacy by hiding rare values. It has been frequently used across different types of datasets (Cretu et al.,, 2022; Gadotti et al.,, 2019; Pyrgelis et al.,, 2020), including mobility datasets (Kohli et al.,, 2023; Aktay et al.,, 2020). Instead of releasing the raw aggregate A𝒰superscript𝐴𝒰A^{\mathcal{U}}italic_A start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT, the data collector may choose a threshold k{0}𝑘0k\in\mathbb{N}\cup\{0\}italic_k ∈ blackboard_N ∪ { 0 }, and release the suppressed aggregate

(4) ASSC𝒰(k)s,t={As,t𝒰 if As,t𝒰>k0 if As,t𝒰k.superscriptsubscript𝐴𝑆𝑆𝐶𝒰subscript𝑘𝑠𝑡casessubscriptsuperscript𝐴𝒰𝑠𝑡 if subscriptsuperscript𝐴𝒰𝑠𝑡𝑘0 if subscriptsuperscript𝐴𝒰𝑠𝑡𝑘A_{SSC}^{\mathcal{U}}(k)_{s,t}=\begin{cases}A^{\mathcal{U}}_{s,t}&\text{ if }A% ^{\mathcal{U}}_{s,t}>k\\ 0&\text{ if }A^{\mathcal{U}}_{s,t}\leq k.\end{cases}italic_A start_POSTSUBSCRIPT italic_S italic_S italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT ( italic_k ) start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT = { start_ROW start_CELL italic_A start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT end_CELL start_CELL if italic_A start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT > italic_k end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if italic_A start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ≤ italic_k . end_CELL end_ROW

ASSC𝒰(k)superscriptsubscript𝐴𝑆𝑆𝐶𝒰𝑘A_{SSC}^{\mathcal{U}}(k)italic_A start_POSTSUBSCRIPT italic_S italic_S italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT ( italic_k ) therefore contains the true count of users who visited a ROI s𝑠sitalic_s during epoch t𝑡titalic_t as long as the count exceeds k𝑘kitalic_k. Lesser visited pairs (s,t)𝑠𝑡(s,t)( italic_s , italic_t ) that record k𝑘kitalic_k or less visits are reported as 00 instead. Suppression can also be applied following ε𝜀\varepsilonitalic_ε-DP noise addition, such that (4) is applied on a noisy aggregate ADP𝒰(ε)superscriptsubscript𝐴𝐷𝑃𝒰𝜀{A}_{DP}^{\mathcal{U}}(\varepsilon)italic_A start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT ( italic_ε ). This produces ADP,SSC𝒰(ε,k)superscriptsubscript𝐴𝐷𝑃𝑆𝑆𝐶𝒰𝜀𝑘{A}_{DP,SSC}^{\mathcal{U}}(\varepsilon,k)italic_A start_POSTSUBSCRIPT italic_D italic_P , italic_S italic_S italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT ( italic_ε , italic_k ), an ε𝜀\varepsilonitalic_ε-DP aggregate whose final counts have been suppressed with threshold k𝑘kitalic_k. This transformation would preserve ε𝜀\varepsilonitalic_ε-DP due to post-processing (Dwork et al.,, 2019), and may add a layer of complexity that mitigates attacks in practice.

2.3. Problem Formulation

We assume that the data collector releases aggregate count statistics A¯𝒰superscript¯𝐴𝒰\overline{A}^{\mathcal{U}}over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT over the ROIs 𝒮𝒮\mathcal{S}caligraphic_S and the epochs 𝒯𝒯\mathcal{T}caligraphic_T, for the m𝑚mitalic_m users in the group 𝒰𝒰\mathcal{U}caligraphic_U . There are various cases depending on the privacy measures applied prior to release:

A¯𝒰={A𝒰,if the raw aggregate counts are released,ADP𝒰(ε),if only ε-DP is applied,ASSC𝒰(k),if only threshold k SSC is applied,ADP,SSC𝒰(ε,k),if threshold k SSC is applied after ε-DP.superscript¯𝐴𝒰casessuperscript𝐴𝒰if the raw aggregate counts are released,superscriptsubscript𝐴𝐷𝑃𝒰𝜀if only ε-DP is applied,superscriptsubscript𝐴𝑆𝑆𝐶𝒰𝑘if only threshold k SSC is applied,superscriptsubscript𝐴𝐷𝑃𝑆𝑆𝐶𝒰𝜀𝑘if threshold k SSC is applied after ε-DP.\overline{A}^{\mathcal{U}}=\begin{cases}A^{\mathcal{U}},&\text{if the raw % aggregate counts are released,}\\ {A}_{DP}^{\mathcal{U}}(\varepsilon),&\text{if only $\varepsilon$-DP is applied% ,}\\ {A}_{SSC}^{\mathcal{U}}(k),&\text{if only threshold $k$ SSC is applied,}\\ {A}_{DP,SSC}^{\mathcal{U}}(\varepsilon,k),&\text{if threshold $k$ SSC is % applied after $\varepsilon$-DP.}\end{cases}over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT = { start_ROW start_CELL italic_A start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT , end_CELL start_CELL if the raw aggregate counts are released, end_CELL end_ROW start_ROW start_CELL italic_A start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT ( italic_ε ) , end_CELL start_CELL if only italic_ε -DP is applied, end_CELL end_ROW start_ROW start_CELL italic_A start_POSTSUBSCRIPT italic_S italic_S italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT ( italic_k ) , end_CELL start_CELL if only threshold italic_k SSC is applied, end_CELL end_ROW start_ROW start_CELL italic_A start_POSTSUBSCRIPT italic_D italic_P , italic_S italic_S italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT ( italic_ε , italic_k ) , end_CELL start_CELL if threshold italic_k SSC is applied after italic_ε -DP. end_CELL end_ROW

The goal of an adversary Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v performing an MIA on A¯𝒰superscript¯𝐴𝒰\overline{A}^{\mathcal{U}}over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT is to determine whether their target usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT contributed to A¯𝒰superscript¯𝐴𝒰\overline{A}^{\mathcal{U}}over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT, inferring IN for u𝒰superscript𝑢𝒰u^{*}\in\mathcal{U}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_U and OUT for u𝒰superscript𝑢𝒰u^{*}\not\in\mathcal{U}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∉ caligraphic_U.

2.4. Membership Classifier

Given an aggregate release A¯𝒰superscript¯𝐴𝒰\overline{A}^{\mathcal{U}}over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT over m𝑚mitalic_m users, an adversary infers membership of the target usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT within the aggregation group 𝒰𝒰\mathcal{U}caligraphic_U by using a binary membership classifier. Classifiers are commonly instantiated as machine learning models (Pyrgelis et al.,, 2017, 2020; Zhang et al.,, 2020), but statistical models, like the log-likelihood function, have been applied as well (Homer et al.,, 2008; Bauer and Bindschaedler,, 2020). To train the classifier, Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v typically creates a balanced set of labeled size m𝑚mitalic_m training aggregates (Pyrgelis et al.,, 2017; Zhang et al.,, 2020; Pyrgelis et al.,, 2020; Oehmichen et al.,, 2019). Half of the aggregates include the target trace Lusuperscript𝐿superscript𝑢L^{u^{*}}italic_L start_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and are labeled IN𝐼𝑁INitalic_I italic_N, and the other half are labeled OUT𝑂𝑈𝑇OUTitalic_O italic_U italic_T. Training the classifier will create a decision boundary in the underlying space of aggregate releases (Bishop and Nasrabadi,, 2006). In the case of aggregate location data over ROIs 𝒮𝒮\mathcal{S}caligraphic_S and epochs 𝒯𝒯\mathcal{T}caligraphic_T, the decision boundary is characterized by a hypersurface that partitions the matrix space |𝒮|×|𝒯|superscript𝒮𝒯\mathbb{R}^{|\mathcal{S}|\times|\mathcal{T}|}blackboard_R start_POSTSUPERSCRIPT | caligraphic_S | × | caligraphic_T | end_POSTSUPERSCRIPT into two sets.

2.5. Threat Model

In this section, we present our Zero Auxiliary Knowledge (ZK) MIA threat model. It is commonly assumed in MIAs across various domains that the adversary has access to an auxiliary dataset and complete knowledge of the target record (Pyrgelis et al.,, 2017; Nasr et al.,, 2019; Salem et al.,, 2018; Truex et al.,, 2019; Yeom et al.,, 2018; Shokri et al.,, 2017). Our ZK threat model relaxes both assumptions by eliminating the need for an auxiliary dataset and allowing for only partial knowledge of the target trace.

For context, we also describe threat models of previous MIAs on aggregate location data. All threat models consider an adversary Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v, whose goal is to determine whether a specific target user usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is included in the released aggregate A¯𝒰superscript¯𝐴𝒰\overline{A}^{\mathcal{U}}over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT. The aggregate is computed across m𝑚mitalic_m users over ROIs 𝒮𝒮\mathcal{S}caligraphic_S and epochs 𝒯𝒯\mathcal{T}caligraphic_T. We assume that the locations of the ROIs 𝒮𝒮\mathcal{S}caligraphic_S are known.

Knock-Knock (Pyrgelis et al.,, 2017): The adversary has an auxiliary dataset Ref={L(u1),,L(u|Ref|)}𝑅𝑒𝑓𝐿subscript𝑢1𝐿subscript𝑢𝑅𝑒𝑓Ref=\{L(u_{1}),...,L(u_{|Ref|})\}italic_R italic_e italic_f = { italic_L ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_L ( italic_u start_POSTSUBSCRIPT | italic_R italic_e italic_f | end_POSTSUBSCRIPT ) } of user traces, over the same locations and a similar population as the released aggregate. Ref𝑅𝑒𝑓Refitalic_R italic_e italic_f has at least m𝑚mitalic_m traces, including the full target trace Lusuperscript𝐿superscript𝑢L^{u^{*}}italic_L start_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT.

LocMIA (Zhang et al.,, 2020): The adversary knows usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT’s social network and has an auxiliary dataset Ref={L(u1),,L(u|Ref|)}𝑅𝑒𝑓𝐿subscript𝑢1𝐿subscript𝑢𝑅𝑒𝑓Ref=\{L(u_{1}),...,L(u_{|Ref|})\}italic_R italic_e italic_f = { italic_L ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_L ( italic_u start_POSTSUBSCRIPT | italic_R italic_e italic_f | end_POSTSUBSCRIPT ) } of user traces, over the same locations and a similar population as the released aggregate A¯𝒰superscript¯𝐴𝒰\overline{A}^{\mathcal{U}}over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT. Ref𝑅𝑒𝑓Refitalic_R italic_e italic_f has at least m𝑚mitalic_m traces, including the traces of usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT’s friends, but not Lusuperscript𝐿superscript𝑢L^{u^{*}}italic_L start_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT.

Zero Auxiliary Knowledge (ours): The adversary knows a subset of the target usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT’s visits. Equivalently, the adversary knows a partial target trace L~usuperscript~𝐿superscript𝑢\tilde{L}^{u^{*}}over~ start_ARG italic_L end_ARG start_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, such that supp(L~u)supp(Lu)𝑠𝑢𝑝𝑝superscript~𝐿superscript𝑢𝑠𝑢𝑝𝑝superscript𝐿superscript𝑢supp(\tilde{L}^{u^{*}})\subset supp(L^{u^{*}})italic_s italic_u italic_p italic_p ( over~ start_ARG italic_L end_ARG start_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ⊂ italic_s italic_u italic_p italic_p ( italic_L start_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ).

KK MIA and LocMIA are reliant on the adversary’s access to an extensive auxiliary dataset Ref𝑅𝑒𝑓Refitalic_R italic_e italic_f. In particular, Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v samples individual traces from Ref𝑅𝑒𝑓Refitalic_R italic_e italic_f to create training aggregates. These traces are also assumed to range over the same locations, and belong to a similar population as the traces aggregated in the release A¯𝒰superscript¯𝐴𝒰\overline{A}^{\mathcal{U}}over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT, in order to properly train the membership classifier (Section 2.4). However, individual traces are known to be sensitive (de Montjoye et al.,, 2013), and are unlikely to be made available, particularly when the data is aggregated as a privacy measure. Furthermore, because Ref𝑅𝑒𝑓Refitalic_R italic_e italic_f must contain at least m𝑚mitalic_m traces, this assumption is impractical for even moderately sized aggregates. Although LocMIA removes prior knowledge about the target trace Lusuperscript𝐿superscript𝑢L^{u^{*}}italic_L start_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, it must assume knowledge of usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT’s friends’ traces to create a suitable proxy. More importantly, LocMIA still requires a large auxiliary dataset of individual traces.

In contrast, our Zero Auxiliary Knowledge adversary only requires that the adversary has knowledge about the target’s location history. We emphasize that the adversary does not need to know the full trace Lusuperscript𝐿superscript𝑢L^{u^{*}}italic_L start_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. Our threat model encompasses the case where only a few of the target’s visits are known to the adversary. For example, the adversary may infer some of usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT’s visits from social media activity or direct observation.

3. Related Work

MIAs on Aggregate Location Data. MIAs on aggregate location data have been shown to be successful on multiple location datasets (Pyrgelis et al.,, 2017, 2020; Zhang et al.,, 2020), using a binary classifier to perform the inference task. The performance of the MIAs on small aggregates (<500absent500<500< 500) is especially well-studied, as the influence of the target is easier to distinguish (Pyrgelis et al.,, 2017). For example, the KK MIA by Pyrgelis et al., (2017), achieved AUC>0.83𝐴𝑈𝐶0.83AUC>0.83italic_A italic_U italic_C > 0.83 when attacking size 100100100100 aggregates across two different mobility datasets. LocMIA (Zhang et al.,, 2020) is another MIA on aggregate location data, which removes prior knowledge about the target trace. Instead, LocMIA assumes access to social network information, and the traces of the target’s friends, in order to construct a proxy for the target’s real trace. However, both KK MIA and LocMIA crucially require the adversary to have access to a large auxiliary dataset to train the binary classifier. In contrast, our ZK MIA does not require any auxiliary dataset, and only requires partial information about the target trace (e.g. A random 10%percent1010\%10 % proportion of their visits). ZK MIA therefore addresses the research gap of the MIA risk posed by a less knowledgeable attacker. The distinctions in prior knowledge are discussed in depth in Section 2.5. Our ZK MIA also features a novel approach, being the first MIA on aggregate location data to use synthetic trace generation.

Generation of Synthetic Location Traces. There are many techniques for generating synthetic location traces that capture realistic human mobility patterns  (Kulkarni and Garbinato,, 2017; Kulkarni et al.,, 2018; Ouyang et al.,, 2018; Karagiannis et al.,, 2007; Jahromi et al.,, 2016; Lee et al.,, 2009). However, since our ZK MIA requires generating suitable synthetic traces without using additional information, this heavily limits the scope of applicable techniques. RNNs, GANs, and copulas have been used to generate synthetic traces that collectively approximate a real mobility dataset (Kulkarni and Garbinato,, 2017; Kulkarni et al.,, 2018; Ouyang et al.,, 2018). However, these techniques require real traces to train the model, which the ZK adversary does not have. Many of the state-of-the-art mobility models are also unsuitable because they simulate small-scale continuous trajectories (e.g. walks on campus) (Karagiannis et al.,, 2007; Lee et al.,, 2009). In contrast, location aggregates typically comprise discrete traces over a metropolitan region. We therefore identified a probabilistic unicity model by Farzanehfar et al., (2021), which requires only four statistical parameters to guide the synthetic generation. We demonstrate that we can non-trivially adapt this model for the ZK MIA. In particular, we develop methods to precisely estimate these parameters from the aggregates in order to produce realistic synthetic location aggregates.

MIAs with Reduced Auxiliary Data. Previous attempts have been made (Shokri et al.,, 2017; Salem et al.,, 2018; Truex et al.,, 2019; Yeom et al.,, 2018; Creţu et al.,, 2021; Guépin et al.,, 2023) to relax the standard assumption of an adversary’s access to an auxiliary dataset that has high statistical similarity with the attacked dataset, e.g. sampled from the same distribution (Shokri et al.,, 2017; Nasr et al.,, 2019). In the setting of machine learning (ML) models, where an MIA infers whether a record was a part of the ML model’s training set, Shokri et al., (2017) proposed an MIA without auxiliary data. Instead, they use synthetic data, which they generate using the ML model’s confidence scores. Similarly, Salem et al., (2018) trained an MIA using unrelated data, e.g., training on text data to attack an image model. These approaches require access to the ML model, and they train on features that are specific to ML models, such as the top-k𝑘kitalic_k confidence scores, which may be shared across ML models pertaining to different types of data.

Refer to caption
Figure 3. ZK MIA architecture: Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v first creates synthetic traces, then uses them with the partial target trace to train the membership classifier, before predicting membership.

In contrast, the features used to train an MIA on aggregate location data explicitly depend on the specific regions, times, and population over which the aggregates are computed. In the context of synthetic generators, Guépin et al., (2023) performed an MIA against synthetic data using the released synthetic dataset as the auxiliary dataset. However, in the case of aggregate location data, the release cannot be directly used as a reference dataset, since it does not contain individual records.

4. Methodology

We dedicate Sections 4.1-4.3 to explaining the synthetic-based methodology of our Zero Auxiliary Knowledge MIA. Section 4.4 explains the paired sampling mechanism on training aggregates, which boosts the performance of ZK MIA and KK MIA, as shown in Section 6.4.

4.1. Zero Auxiliary Knowledge MIA Framework

We implement the Zero Auxiliary Knowledge MIA as a binary classifier. However, whereas Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v uses their auxiliary dataset as a reference for creating training aggregates for KK MIA and LocMIA, Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v instead uses the reference of synthetic traces that they created from the released aggregate. Furthermore, if the full target trace Lusuperscript𝐿superscript𝑢L^{u^{*}}italic_L start_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is not known, Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v may instead use a partial trace when creating the IN𝐼𝑁INitalic_I italic_N training aggregates. Figure 3 illustrates ZK MIA’s overall attack architecture.

4.2. Generating Synthetic Traces from Aggregate Location Data

In order to generate synthetic traces for our Zero Auxiliary Knowledge MIA, we adapt a probabilistic mobility model (Farzanehfar et al.,, 2021). Farzanehfar et al., (2021) developed this model to reproduce unicity patterns in large populations. The model requires four statistical parameters, described below and illustrated in Figure 16 of the Appendix. Recall that for a discrete random variable X𝑋Xitalic_X taking values in x1,,xNsubscript𝑥1subscript𝑥𝑁x_{1},...,x_{N}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, its probability mass function (p.m.f.) 𝒫:{x1,,xN}[0,1]:𝒫subscript𝑥1subscript𝑥𝑁01\mathcal{P}:\{x_{1},...,x_{N}\}\to[0,1]caligraphic_P : { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } → [ 0 , 1 ] maps each possible value to its corresponding probability.

  1. (1)

    The marginal space distribution 𝒫S:𝒮[0,1]:subscript𝒫𝑆𝒮01\mathcal{P}_{S}:\mathcal{S}\to[0,1]caligraphic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT : caligraphic_S → [ 0 , 1 ] is a p.m.f. such that 𝒫S(s)subscript𝒫𝑆𝑠\mathcal{P}_{S}(s)caligraphic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_s ) is proportional to the number of visits to ROI s𝑠sitalic_s by users in ΩΩ\Omegaroman_Ω across all epochs in 𝒯𝒯\mathcal{T}caligraphic_T.

  2. (2)

    The marginal time distribution 𝒫T:𝒯[0,1]:subscript𝒫𝑇𝒯01\mathcal{P}_{T}:\mathcal{T}\to[0,1]caligraphic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT : caligraphic_T → [ 0 , 1 ] is a p.m.f. such that 𝒫T(t)subscript𝒫𝑇𝑡\mathcal{P}_{T}(t)caligraphic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_t ) is proportional to the number of visits during epoch t𝑡titalic_t by users in ΩΩ\Omegaroman_Ω across all ROIs in 𝒮𝒮\mathcal{S}caligraphic_S.

  3. (3)

    The marginal activity distribution 𝒫Asubscript𝒫𝐴\mathcal{P}_{A}caligraphic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT models the total number of visits recorded within 𝒮𝒮\mathcal{S}caligraphic_S during 𝒯𝒯\mathcal{T}caligraphic_T by a user drawn from ΩΩ\Omegaroman_Ω.

  4. (4)

    The Delaunay triangulation, denoted DT(𝒮)𝐷𝑇𝒮DT(\mathcal{S})italic_D italic_T ( caligraphic_S ), is a triangulation with vertices corresponding to the set of positions (longitude and latitude) of ROIs in 𝒮𝒮\mathcal{S}caligraphic_S. DT(𝒮)𝐷𝑇𝒮DT(\mathcal{S})italic_D italic_T ( caligraphic_S ) has the property that no vertex lies inside the circumcircle of any triangle in DT(𝒮)𝐷𝑇𝒮DT(\mathcal{S})italic_D italic_T ( caligraphic_S ) (Delaunay et al.,, 1934).

We note that the Delaunay triangulation DT(𝒮)𝐷𝑇𝒮DT(\mathcal{S})italic_D italic_T ( caligraphic_S ) is determined by the locations of the ROIs. Since the locations of the ROIs are assumed to be known (Section 2.5), DT(𝒮)𝐷𝑇𝒮DT(\mathcal{S})italic_D italic_T ( caligraphic_S ) can be immediately obtained from the release. We explain how the other statistical inputs, the three marginal distributions, can be approximated from the released aggregate in Section 4.3.

We now describe our procedure, adapted from  Farzanehfar et al., (2021), for generating synthetic traces using the four inputs. For each synthetic trace Lsynsuperscript𝐿𝑠𝑦𝑛L^{syn}italic_L start_POSTSUPERSCRIPT italic_s italic_y italic_n end_POSTSUPERSCRIPT, we first sample the number of visits nvisitssubscript𝑛𝑣𝑖𝑠𝑖𝑡𝑠n_{visits}italic_n start_POSTSUBSCRIPT italic_v italic_i italic_s italic_i italic_t italic_s end_POSTSUBSCRIPT from the activity marginal 𝒫Asubscript𝒫𝐴\mathcal{P}_{A}caligraphic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT. This determines the number of nonzero entries in the 𝒮×𝒯𝒮𝒯\mathcal{S}\times\mathcal{T}caligraphic_S × caligraphic_T matrix Ls,tsynsubscriptsuperscript𝐿𝑠𝑦𝑛𝑠𝑡L^{syn}_{s,t}italic_L start_POSTSUPERSCRIPT italic_s italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT. Second, we sample an origin ROI s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from the space marginal 𝒫Ssubscript𝒫𝑆\mathcal{P}_{S}caligraphic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, and a connected sub-graph C(s0)𝐶subscript𝑠0C(s_{0})italic_C ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) from the Delaunay triangulation DT(𝒮)𝐷𝑇𝒮DT(\mathcal{S})italic_D italic_T ( caligraphic_S ), such that s0C(s0)subscript𝑠0𝐶subscript𝑠0s_{0}\in C(s_{0})italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_C ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). C(s0)𝐶subscript𝑠0C(s_{0})italic_C ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) will correspond to the set of ROIs that may be visited in Lsynsuperscript𝐿𝑠𝑦𝑛L^{syn}italic_L start_POSTSUPERSCRIPT italic_s italic_y italic_n end_POSTSUPERSCRIPT. This is done to emulate the natural tendency to move to and from the same proximate locations (e.g. home and work). Finally, we sample nvisitssubscript𝑛𝑣𝑖𝑠𝑖𝑡𝑠n_{visits}italic_n start_POSTSUBSCRIPT italic_v italic_i italic_s italic_i italic_t italic_s end_POSTSUBSCRIPT spatiotemporal visits (s,t)𝑠𝑡(s,t)( italic_s , italic_t ) for which we set Ls,tsyn=1subscriptsuperscript𝐿𝑠𝑦𝑛𝑠𝑡1L^{syn}_{s,t}=1italic_L start_POSTSUPERSCRIPT italic_s italic_y italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT = 1. For each visit, s𝑠sitalic_s is sampled from the space marginal 𝒫Ssubscript𝒫𝑆\mathcal{P}_{S}caligraphic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT restricted to C(s0)𝐶subscript𝑠0C(s_{0})italic_C ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), and t𝑡titalic_t is sampled from the time marginal 𝒫Tsubscript𝒫𝑇\mathcal{P}_{T}caligraphic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. All the sampling steps are independent.

We make two modifications of the original algorithm (Farzanehfar et al.,, 2021). First, to avoid over-saturating unpopular regions, we sample the origin ROI s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT according to the space marginal 𝒫Ssubscript𝒫𝑆\mathcal{P}_{S}caligraphic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT rather than uniformly. Second, to allow users to visit multiple ROIs within the same epoch, we sample the epochs with replacement rather than without replacement. Our procedure is summarized in Algorithm 1 of the appendix.

4.3. Obtaining Accurate Marginals

In order to generate suitable synthetic traces for ZK MIA, Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v must estimate the marginal distributions 𝒫S,𝒫T,𝒫Asubscript𝒫𝑆subscript𝒫𝑇subscript𝒫𝐴\mathcal{P}_{S},\mathcal{P}_{T},\mathcal{P}_{A}caligraphic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, computed over the full population ΩΩ\Omegaroman_Ω, using the aggregate release A¯𝒰superscript¯𝐴𝒰\overline{A}^{\mathcal{U}}over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT. In this section, we motivate and justify the techniques that we developed to obtain strong estimates 𝒫^S,𝒫^T,𝒫^Asubscript^𝒫𝑆subscript^𝒫𝑇subscript^𝒫𝐴\widehat{\mathcal{P}}_{S},\widehat{\mathcal{P}}_{T},\widehat{\mathcal{P}}_{A}over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT. This task is especially challenging when privacy measures distort the aggregate data. We develop separate techniques to correct for bias in the case of SSC, and to correct for noise in the case of DP. The effects are shown in Figures 4 and 5 respectively. Algorithm 2 in the Appendix summarizes how we approximate all three marginals from A¯𝒰superscript¯𝐴𝒰\overline{A}^{\mathcal{U}}over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT.

4.3.1. Estimating Space and Time Marginals

Refer to caption
Figure 4. Log compression for SSC aggregates: SSC biases the estimate obtained from the aggregate by creating more extreme values. The true time marginal from the CDR dataset (plotted for the first week) is better approximated after the empirical estimate from the aggregate (m=1000𝑚1000m=1000italic_m = 1000, k=1𝑘1k=1italic_k = 1) undergoes log compression log(1+γx)1𝛾𝑥\log(1+\gamma x)roman_log ( 1 + italic_γ italic_x ) with γ𝛾\gammaitalic_γ chosen as in (8).

Suppose that the data collector releases the aggregate A¯𝒰superscript¯𝐴𝒰\overline{A}^{\mathcal{U}}over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT, which may or may not have privacy measures. Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v can directly compute the empirical space and time marginals, which we denote 𝒫^S0superscriptsubscript^𝒫𝑆0\widehat{\mathcal{P}}_{S}^{0}over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and 𝒫^T0superscriptsubscript^𝒫𝑇0\widehat{\mathcal{P}}_{T}^{0}over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, from the released aggregate matrix A¯𝒰superscript¯𝐴𝒰\overline{A}^{\mathcal{U}}over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT.

(5) 𝒫^S0(s)=1A¯𝒰1t=1|𝒯|A¯s,t𝒰subscriptsuperscript^𝒫0𝑆𝑠1subscriptdelimited-∥∥superscript¯𝐴𝒰1superscriptsubscript𝑡1𝒯subscriptsuperscript¯𝐴𝒰𝑠𝑡\displaystyle\widehat{\mathcal{P}}^{0}_{S}(s)=\frac{1}{\lVert\overline{A}^{% \mathcal{U}}\rVert_{1}}\sum_{t=1}^{|\mathcal{T}|}\overline{A}^{\mathcal{U}}_{s% ,t}over^ start_ARG caligraphic_P end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_s ) = divide start_ARG 1 end_ARG start_ARG ∥ over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_T | end_POSTSUPERSCRIPT over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT
(6) 𝒫^T0(t)=1A¯𝒰1s=1|𝒮|A¯s,t𝒰superscriptsubscript^𝒫𝑇0𝑡1subscriptdelimited-∥∥superscript¯𝐴𝒰1superscriptsubscript𝑠1𝒮subscriptsuperscript¯𝐴𝒰𝑠𝑡\displaystyle\widehat{\mathcal{P}}_{T}^{0}(t)=\frac{1}{\lVert\overline{A}^{% \mathcal{U}}\rVert_{1}}\sum_{s=1}^{|\mathcal{S}|}\overline{A}^{\mathcal{U}}_{s% ,t}over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_t ) = divide start_ARG 1 end_ARG start_ARG ∥ over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_S | end_POSTSUPERSCRIPT over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT

Raw Aggregate: In the case where the released aggregate provide the raw counts, i.e. A¯𝒰=A𝒰superscript¯𝐴𝒰superscript𝐴𝒰\overline{A}^{\mathcal{U}}=A^{\mathcal{U}}over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT = italic_A start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT, the empirical marginals tend to be highly accurate. An example is shown in Figure 13 in the Appendix. The accuracy of these estimates is intuitive because we expect the mobility patterns of an aggregation group 𝒰𝒰\mathcal{U}caligraphic_U to resemble those of the population ΩΩ\Omegaroman_Ω. Thus, we set 𝒫^ =𝒫^ 0subscript^𝒫 subscriptsuperscript^𝒫0 \widehat{\mathcal{P}}_{\rule{3.5pt}{0.3pt}}=\widehat{\mathcal{P}}^{0}_{\rule{3% .5pt}{0.3pt}}over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT end_POSTSUBSCRIPT = over^ start_ARG caligraphic_P end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT end_POSTSUBSCRIPT if the released aggregate is unmodified. We use the subscript   to indicate generality for both the space S𝑆Sitalic_S and time T𝑇Titalic_T marginals.

Suppressed Aggregate: However, if the data collector applies SSC with threshold k𝑘kitalic_k, i.e. A¯𝒰=ASSC𝒰(k)superscript¯𝐴𝒰superscriptsubscript𝐴𝑆𝑆𝐶𝒰𝑘\overline{A}^{\mathcal{U}}=A_{SSC}^{\mathcal{U}}(k)over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT = italic_A start_POSTSUBSCRIPT italic_S italic_S italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT ( italic_k ), then this will systematically bias the empirical marginal 𝒫^ 0superscriptsubscript^𝒫 0\widehat{\mathcal{P}}_{\rule{3.5pt}{0.3pt}}^{0}over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, because popular ROIs and epochs are more likely to evade suppression. It is therefore easy to see that suppression will reduce the observed probabilities of less popular entries and boost the probabilities of more popular entries.

To correct the bias, we flatten the empirical estimate 𝒫^ 0superscriptsubscript^𝒫 0\widehat{\mathcal{P}}_{\rule{3.5pt}{0.3pt}}^{0}over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT by boosting low frequency counts and reducing high frequency counts. Upon the insight that 𝒫^ 0superscriptsubscript^𝒫 0\widehat{\mathcal{P}}_{\rule{3.5pt}{0.3pt}}^{0}over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT can be likened to an audio signal, we adapt the logarithmic compression technique used to reduce dynamic range (Müller,, 2015; Miguel Alonso and Richard,, 2004)

(7) xlog(1+γx), x0,formulae-sequence𝑥1𝛾𝑥 𝑥0\displaystyle x\to\log{(1+\gamma x)},\text{ }x\geq 0,italic_x → roman_log ( 1 + italic_γ italic_x ) , italic_x ≥ 0 ,

where the scaling factor γ0𝛾0\gamma\geq 0italic_γ ≥ 0 regulates the compression level (Müller,, 2015). In music signal processing, x0𝑥0x\geq 0italic_x ≥ 0 corresponds to the intensity of a given frequency. In our case, x0𝑥0x\geq 0italic_x ≥ 0 corresponds to probabilities within the empirical marginal 𝒫^ 0superscriptsubscript^𝒫 0\widehat{\mathcal{P}}_{\rule{3.5pt}{0.3pt}}^{0}over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. We choose

(8) γ(𝒫^ 0)=maxx𝒫^ 0:x>01x𝛾superscriptsubscript^𝒫 0subscript:𝑥superscriptsubscript^𝒫 0𝑥01𝑥\displaystyle\gamma(\widehat{\mathcal{P}}_{\rule{3.5pt}{0.3pt}}^{0})=\max% \limits_{x\in\widehat{\mathcal{P}}_{\rule{2.5pt}{0.3pt}}^{0}:x>0}\frac{1}{x}italic_γ ( over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) = roman_max start_POSTSUBSCRIPT italic_x ∈ over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT : italic_x > 0 end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_x end_ARG
Refer to caption
Figure 5. Power transformation for DP aggregates: DP noise compresses the estimate obtained from the aggregate. The true space marginal from the CDR dataset (organized by popularity) is better approximated after the empirical estimate from the aggregate (m=1000𝑚1000m=1000italic_m = 1000, Δε=1Δ𝜀1\frac{\Delta}{\varepsilon}=1divide start_ARG roman_Δ end_ARG start_ARG italic_ε end_ARG = 1) undergoes power transformation xpsuperscript𝑥𝑝x^{p}italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT with p𝑝pitalic_p selected according to Algorithm 4.

to automatically parameterize γ𝛾\gammaitalic_γ based on the smallest observed non-zero probability. We therefore estimate 𝒫^ =log(1+γ(𝒫 0)\widehat{\mathcal{P}}_{\rule{3.5pt}{0.3pt}}=\log(1+\gamma(\mathcal{P}_{\rule{3% .5pt}{0.3pt}}^{0})over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT end_POSTSUBSCRIPT = roman_log ( 1 + italic_γ ( caligraphic_P start_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ), where we omit the normalization constant.

We do not argue that our choice of method and parameter γ𝛾\gammaitalic_γ is optimal. However, the debiasing substantially improves the estimate, as shown in Figure 4, and it is done without additional information.

DP Aggregate. If ε𝜀\varepsilonitalic_ε-DP noise is added to each entry in the aggregate release, i.e. A¯s,t𝒰=ADP𝒰(ε)s,t=As,t𝒰+Lap(Δε)subscriptsuperscript¯𝐴𝒰𝑠𝑡superscriptsubscript𝐴𝐷𝑃𝒰subscript𝜀𝑠𝑡subscriptsuperscript𝐴𝒰𝑠𝑡𝐿𝑎𝑝Δ𝜀\overline{A}^{\mathcal{U}}_{s,t}=A_{DP}^{\mathcal{U}}(\varepsilon)_{s,t}={A}^{% \mathcal{U}}_{s,t}+Lap(\frac{\Delta}{\varepsilon})over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT ( italic_ε ) start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT = italic_A start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT + italic_L italic_a italic_p ( divide start_ARG roman_Δ end_ARG start_ARG italic_ε end_ARG ), then the noise will overpower the signal in the computation of the empirical marginals (Eq. 6). This follows from the fact that location aggregates are high-dimensional sparse matrices (Pyrgelis et al.,, 2020). Therefore, conversely to the SSC case, the probabilities within the empirical marginals are compressed, since each probability is characterized mostly by thousands of independent noise samples. This effect is visualized for several different noise scales in Figure 17(a) of the Appendix. We also prove that under strong sparsity assumptions, 𝒫^S0superscriptsubscript^𝒫𝑆0\widehat{\mathcal{P}}_{S}^{0}over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT converges to the discrete uniform distribution on 𝒮𝒮\mathcal{S}caligraphic_S as the number of epochs in the observation period |𝒯|𝒯|\mathcal{T}|\to\infty| caligraphic_T | → ∞, in Theorem A.4 of the Appendix.

To correct the low variance of the observed probabilities, we propose the power transformation xpsuperscript𝑥𝑝x^{p}italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT with p>1𝑝1p>1italic_p > 1, followed by renormalization. It is easy to see that this will increase the variance since the probabilities are in [0,1]01[0,1][ 0 , 1 ]. Automatically calibrating the power p>1𝑝1p>1italic_p > 1 is a delicate matter. To do so, we start with p=1𝑝1p=1italic_p = 1 and augment p𝑝pitalic_p gradually until the transformed distribution achieves the target variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Without any prior knowledge, we consider the case where each probability is randomly drawn. Equivalently, each probablity xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is sampled from Unif(0,1)𝑈𝑛𝑖𝑓01Unif(0,1)italic_U italic_n italic_i italic_f ( 0 , 1 ), and then renormalized so that the total probability is 1111. Let pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the probabilities after normalization, and p¯¯𝑝\bar{p}over¯ start_ARG italic_p end_ARG be the mean of the normalized probabilities. For the space marginal, the variance is

(9) σ2=i=1|S|(pip¯)2|S|.superscript𝜎2superscriptsubscript𝑖1𝑆superscriptsubscript𝑝𝑖¯𝑝2𝑆\displaystyle\sigma^{2}=\frac{\sum_{i=1}^{|S|}(p_{i}-\bar{p})^{2}}{|S|}.italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_S | end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_p end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG | italic_S | end_ARG .
Refer to caption
Figure 6. Exponential estimate for activity marginal: Once the mean number of visits is approximated, the activity marginal from the CDR dataset is best approximated by a lognormal distribution with an optimal skew parameter. However, it can also be approximated without additional parameters by an exponential distribution.

For the variance from the time marginal, we replace |S|𝑆|S|| italic_S | with |T|𝑇|T|| italic_T | in the above equation. The algorithm for selecting p𝑝pitalic_p is given in Algorithm 4 of the Appendix. Figure 5 shows that our automatically parameterized power transformation significantly improves the estimate. In contrast, the exponential transformation (ex1)/γsuperscript𝑒𝑥1𝛾(e^{x}-1)/\gamma( italic_e start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT - 1 ) / italic_γ, which is the inverse of the log compression function log(1+γx)1𝛾𝑥\log(1+\gamma x)roman_log ( 1 + italic_γ italic_x ), fails to denoise because the inverse is inapplicable after considering normalization.

4.3.2. Estimating the activity marginal

The released aggregate A¯𝒰superscript¯𝐴𝒰\overline{A}^{\mathcal{U}}over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT does not leak granular information about the activity marginal 𝒫Asubscript𝒫𝐴\mathcal{P}_{A}caligraphic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT. However, Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v may obtain the empirical mean number of visits per user according to the released aggregate,

(10) μ^0=1ms,t𝒮×𝒯A¯s,t𝒰.subscript^𝜇01𝑚subscript𝑠𝑡𝒮𝒯subscriptsuperscript¯𝐴𝒰𝑠𝑡\displaystyle\widehat{\mu}_{0}=\frac{1}{m}\sum_{s,t\in\mathcal{S}\times% \mathcal{T}}\overline{A}^{\mathcal{U}}_{s,t}.over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_s , italic_t ∈ caligraphic_S × caligraphic_T end_POSTSUBSCRIPT over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT .

If the aggregate is raw, then we expect μ^0subscript^𝜇0\widehat{\mu}_{0}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to be a strong estimate due to well-known regularity results about population-wide mobility activity (Schneider et al.,, 2013; Seshadri et al.,, 2008; Farzanehfar et al.,, 2021). In this case, Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v sets μ^=μ^0^𝜇subscript^𝜇0\hat{\mu}=\widehat{\mu}_{0}over^ start_ARG italic_μ end_ARG = over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

However, if either SSC or ε𝜀\varepsilonitalic_ε-DP is applied, then the estimate μ^0subscript^𝜇0\widehat{\mu}_{0}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT would fail. Algorithm 3 in the Appendix describes how Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v can obtain a better estimate μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG. Given an aggregate release A¯𝒰superscript¯𝐴𝒰\overline{A}^{\mathcal{U}}over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT of size m𝑚mitalic_m, Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v can use 𝒫^Ssubscript^𝒫𝑆\widehat{\mathcal{P}}_{S}over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and 𝒫^Tsubscript^𝒫𝑇\widehat{\mathcal{P}}_{T}over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to iteratively improve their estimate, starting with μ0subscript𝜇0{\mu}_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Given guess μ^nsubscript^𝜇𝑛\widehat{\mu}_{n}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v creates a synthetic aggregate by generating m𝑚mitalic_m synthetic traces, parameterized by 𝒫^S,𝒫^Tsubscript^𝒫𝑆subscript^𝒫𝑇\widehat{\mathcal{P}}_{S},\widehat{\mathcal{P}}_{T}over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and 𝒫A^μ^nsimilar-to^subscript𝒫𝐴subscript^𝜇𝑛\widehat{\mathcal{P}_{A}}\sim\widehat{\mu}_{n}over^ start_ARG caligraphic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_ARG ∼ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v may then apply the same privacy measures that were applied on A¯𝒰superscript¯𝐴𝒰\overline{A}^{\mathcal{U}}over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT. μ^n+1subscript^𝜇𝑛1\widehat{\mu}_{n+1}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT is obtained by increasing or decreasing μ^nsubscript^𝜇𝑛\widehat{\mu}_{n}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT relative to the difference in counts with A¯𝒰superscript¯𝐴𝒰\overline{A}^{\mathcal{U}}over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT.

Once μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG is obtained, Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v can simply pick 𝒫A^μ^similar-to^subscript𝒫𝐴^𝜇\widehat{\mathcal{P}_{A}}\sim\widehat{\mu}over^ start_ARG caligraphic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_ARG ∼ over^ start_ARG italic_μ end_ARG, such that each synthetic trace has μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG visits. However, it is well known that human mobility activity follows a heavy-tailed distribution, i.e., a heavier tail than the exponential distribution. The best approximations are lognormal, beta, or power-law distributions (Farzanehfar et al.,, 2021; Schneider et al.,, 2013; Seshadri et al.,, 2008).

It would be reasonable for Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v to use a heavy-tailed distribution with mean μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG, but these distributions require a second parameter, e.g. skewness, to determine the distribution shape. Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v can use well-known parameters from other cities’ datasets to complete the estimate (Schneider et al.,, 2013; Seshadri et al.,, 2008), but to ensure that Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v does not use additional knowledge, we assume that they use the sub-optimal estimate 𝒫^AExp(μ^)similar-tosubscript^𝒫𝐴𝐸𝑥𝑝^𝜇\hat{\mathcal{P}}_{A}\sim Exp(\hat{\mu})over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ∼ italic_E italic_x italic_p ( over^ start_ARG italic_μ end_ARG ), as shown in Figure 6.

4.4. Paired Sampling for Training

When MIAs target high-dimensional aggregate data, such as location data, the membership classifier must handle noise arising from thousands of entries, which are unrelated to the target record. For example, many of the IN training aggregates may coincidentally have high counts in entries that are absent in the target record. This would skew the decision boundary of the membership classifier, which may lead to false positives when testing. We may similarly obtain false negatives due to spurious patterns within the OUT training aggregates. These challenges are compounded by the implementation of privacy measures. For example, εlimit-from𝜀\varepsilon-italic_ε -DP would add noise to each entry. Given the dimensionality and nature of the aggregate data, spurious patterns will likely skew the decision boundary, even when hundreds or thousands of training aggregates are sampled. Given a fixed number of training samples, we demonstrate that the way in which the training set is sampled strongly influences the performance of the MIA. In particular, the sampling technique can guide the convergence of the decision boundary in order to prevent misclassification due to noise.

To the best of our knowledge, all previous MIAs on aggregate location data sampled their training set via independent random sampling (Pyrgelis et al.,, 2017; Oehmichen et al.,, 2019; Zhang et al.,, 2020; Pyrgelis et al.,, 2020). Training aggregates are created by independently sampling groups of m𝑚mitalic_m users from the population ΩΩ\Omegaroman_Ω and labeling them according to the target usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT’s presence.

On the one hand, independent sampling discourages overfitting to the training data by exposing the classifier to a wide variation of samples. On the other hand, independent sampling does nothing to prevent spurious patterns from distorting the decision boundary.

We propose a paired sampling technique to guide the convergence of the decision boundary. The idea is to use sampling to help the classifier identify the differential impact of the target record at the aggregate level. Paired sampling independently samples groups of m1𝑚1m-1italic_m - 1 users from Ω{u}Ωsuperscript𝑢\Omega\setminus\{u^{*}\}roman_Ω ∖ { italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT }. Then, an IN sample is created by adding usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as the group’s last member, and an OUT sample is created by adding another randomly selected user. The training set is therefore characterized by a set of IN/OUT pairs, which differ in exactly one record (the target’s). If noise is added to aggregates prior to release, then Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v must inject the same noise sample ε𝜀\varepsilonitalic_ε to each paired sample, A𝒰INsuperscript𝐴superscript𝒰𝐼𝑁A^{\mathcal{U}^{IN}}italic_A start_POSTSUPERSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_I italic_N end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and A𝒰OUTsuperscript𝐴superscript𝒰𝑂𝑈𝑇A^{\mathcal{U}^{OUT}}italic_A start_POSTSUPERSCRIPT caligraphic_U start_POSTSUPERSCRIPT italic_O italic_U italic_T end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, to ensure that the target’s differential impact is preserved between each IN/OUT pair.

Refer to caption
(a) CDR dataset
Refer to caption
(b) Milan
Figure 7. Mean AUC scores with standard error for ZK and KK on size 1000100010001000 aggregates with various suppression thresholds

Paired sampling therefore actively encourages the membership decision boundary to be formed based on relevant criteria related to the target. It also discourages spurious decision boundaries because of the high degree of similarity between IN/OUT pairs. An extreme value in an aggregate entry from an IN sample will be matched with a similar value from its paired OUT sample, with high probability. However, we note that using paired sampling effectively halves the training variation compared to independent sampling. Our experiments in Section 6.4 demonstrate that paired sampling outperforms independent sampling across all tested settings of ε𝜀\varepsilonitalic_ε-DP noise addition. Hence, guiding the decision boundary towards relevant membership criteria often takes precedence over maximizing training variation. For completeness, we note that while we developed, studied, and named paired sampling independently, we later found that a similar idea was used in Bauer and Bindschaedler, (2020) but had not been compared to independent sampling nor used elsewhere in the literature so far to the best of our knowledge.

5. Experimental Setup

To evaluate the efficacy of our ZK MIA, we compare it against the state-of-the-art Knock Knock (KK) MIA (Pyrgelis et al.,, 2017, 2020) using aggregated location data from two different datasets.

5.1. Datasets.

In this section, we describe the two location datasets used for evaluating the MIAs, and discuss ethical considerations of the data collection and usage.

CDR: The first dataset, which we refer to as ”CDR”, is a private dataset, shared with us by Flowminder (flo,, 2024) for the purpose of this research. The raw dataset comprises timestamped and geo-tagged call records of approximately 11,0001100011,00011 , 000 mobile phone users within a Latin American metropolitan area. The observation period is June 2021, with epochs defined by the 720720720720 hourly timeslots. The ROIs are defined by the service regions of approximately 500500500500 cellular antenna towers within the metropolitan area, which spans 150km2similar-toabsent150superscriptkm2\sim 150\text{km}^{2}∼ 150 km start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The users were selected such that they registered at least one visit per week, to omit users who changed SIM cards, and such that the majority of their visits are within the region, to ensure that they are residents. 50505050 target users for the MIAs were randomly selected by Flowminder. A histogram of the number of visits over the target traces is plotted in Figure 14(a) in the Appendix.

Milan: The second dataset is the Milan Social Pulse dataset (SpazioDati and di Milano,, 2015), made publicly available as part of the Telecom Italia Big Data Challenge (Barlacchi et al.,, 2015). This dataset comprises timestamped and geo-tagged tweets from 4840484048404840 mobile phone users within the Milano region. The ROIs are defined by a grid of 100100100100 points, each with an approximate area of 256 km2superscriptkm2\text{km}^{2}km start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We consider the location data from the first week of data, yielding 168168168168 hourly epochs. We do not delete any users from the dataset prior to aggregation. We randomly select 50505050 targets among users who tweeted at least 10101010 times during the observation period. A histogram of the number of visits over the target traces is plotted in Figure 14(b) in the Appendix.

Ethical Considerations: Because of the sensitivity of location data, we did not access raw individual-level data, and instead collaborated with Flowminder (FM) to develop a privacy-preserving data-sharing pipeline for the purpose of this research (de Montjoye et al.,, 2018). More specifically, data sharing was restricted to pre-computed aggregate matrices (labeled according to target membership, computed by FM on the data provider server) and 50505050 target traces randomly chosen by FM. To further mitigate the privacy risk, the 500similar-toabsent500\sim 500∼ 500 ROI and 720720720720 epoch indices were randomly permuted in the shared aggregate and target trace matrices, according to a map** known only by FM. This random permutation relabeled the space and time indices, enabling us to test the MIAs without knowing the true times or locations. The graphs of the marginal statistics (see Figures 4, 5, 6) were plotted by FM and shared with us. All the data shared by FM with us is subject to a research contract between FM and our institution and was kept on our segregated server. The Milan dataset, derived from geo-tagged tweets, remains publicly available, and was only used for the purpose of testing the MIAs.

5.2. MIA Implementation

We perform a fair comparison between KK MIA and ZK MIA by training the binary membership classifier using the same parameters and architecture. This also helps us isolate the effect of removing auxiliary data on performance. We use a Logistic Regression binary classifier with default hyperparameters and 𝕃1subscript𝕃1\mathbb{L}_{1}blackboard_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT regularization, implemented with sklearn𝑠𝑘𝑙𝑒𝑎𝑟𝑛sklearnitalic_s italic_k italic_l italic_e italic_a italic_r italic_n. The number of training groups ntrain=400subscript𝑛𝑡𝑟𝑎𝑖𝑛400n_{train}=400italic_n start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT = 400 matches previous implementations (Pyrgelis et al.,, 2017, 2020), and the groups are selected using paired sampling, unless specified otherwise. We additionally fine tune the decision boundary using nval=100subscript𝑛𝑣𝑎𝑙100n_{val}=100italic_n start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT = 100 balanced independently sampled validation groups. We flatten the aggregates and feed them directly into the classifier as a vector, without any processing, such as PCA or feature extraction. Finally, as done in Pyrgelis et al., (2017), Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v applies the same privacy measures to training and validation aggregates, if the released aggregate is privacy-aware.

Knock-Knock. To implement the Knock-Knock MIA, we provide Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v with a reference group Ref𝑅𝑒𝑓Refitalic_R italic_e italic_f of 5000500050005000 real user traces (including the target trace Lusuperscript𝐿superscript𝑢L^{u^{*}}italic_L start_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT) when attacking the larger CDR dataset. We set |Ref|=2500𝑅𝑒𝑓2500|Ref|=2500| italic_R italic_e italic_f | = 2500 for the Milan dataset. We note that this significantly surpasses previous reference sizes (|Ref|=1100𝑅𝑒𝑓1100|Ref|=1100| italic_R italic_e italic_f | = 1100) implemented by Pyrgelis et al., (2017).

Refer to caption
(a) Event level DP: CDR dataset
Refer to caption
(b) User-day level DP: CDR dataset
Refer to caption
(c) Event level DP: Milan
Refer to caption
(d) User-day level DP: Milan
Figure 8. Mean AUC scores with standard error for ZK and KK on size 1000100010001000 aggregates with various privacy units and budgets ε𝜀\varepsilonitalic_ε.

Zero Auxiliary Knowledge.Our Zero Auxiliary Knowledge MIA is structurally identical to KK (PS), but ZK has a synthetic reference Ref𝑅𝑒𝑓Refitalic_R italic_e italic_f, rather than a set of real traces. This reference is created according to the methodology detailed in Section 4. Setting |Ref|=5000𝑅𝑒𝑓5000|Ref|=5000| italic_R italic_e italic_f | = 5000 allows for a direct comparison of the functionality of synthetic traces to real ones for the purpose of the MIA. However, we remark that cap** the number of synthetic traces at 5000500050005000 is an artificial restriction, since Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v may generate arbitrarily more synthetic traces and achieve better performance, as shown in Figure 12 of the Appendix. By default, we assume full access to the target trace Lusuperscript𝐿superscript𝑢L^{u^{*}}italic_L start_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. This assumption is relaxed for experiment 6.3.

5.3. Evaluation

Default Experimental Parameters. We randomly select ntargets=50subscript𝑛𝑡𝑎𝑟𝑔𝑒𝑡𝑠50n_{targets}=50italic_n start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t italic_s end_POSTSUBSCRIPT = 50 targets from the dataset for evaluation. These targets are re-used for each experiment. Furthermore, in each experiment, ntest=100subscript𝑛𝑡𝑒𝑠𝑡100n_{test}=100italic_n start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT = 100 independently sampled balanced test aggregates are created for each target, and shared across both MIAs to ensure that the test sets are identical. As done in  (Pyrgelis et al.,, 2017), the test aggregate user groups are sampled from a disjoint set of users to the Knock-Knock adversarial reference Ref𝑅𝑒𝑓Refitalic_R italic_e italic_f, plus the target usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. This corresponds to roughly 11,0005000=6000110005000600011,000-5000=600011 , 000 - 5000 = 6000 user traces for CDR and 50002500=25005000250025005000-2500=25005000 - 2500 = 2500 user traces for Milan.

We perform all experiments on size m=1000𝑚1000m=1000italic_m = 1000 aggregates, which matches the largest aggregate size tested in  Pyrgelis et al., (2017) and exceeds the largest aggregate size tested in  Zhang et al., (2020) (800). We do not vary m𝑚mitalic_m since the relationship between aggregate group size and MIA effectiveness has already been documented extensively  (Pyrgelis et al.,, 2017, 2020; Zhang et al.,, 2020). We perform the Knock-Knock and Zero Auxiliary Knowledge MIAs in this setting under different privacy measures. We also perform experiments such that Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v only knows a fraction pusubscript𝑝superscript𝑢p_{u^{*}}italic_p start_POSTSUBSCRIPT italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT of the target trace Lusuperscript𝐿superscript𝑢L^{u^{*}}italic_L start_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, but by default, we assume that they know the full trace, i.e. pu=1subscript𝑝superscript𝑢1p_{u^{*}}=1italic_p start_POSTSUBSCRIPT italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = 1.

Evaluation Metrics. In the past, MIAs on aggregate location data have been primarily evaluated using the area under the ROC curve (AUC score) as a metric  (Pyrgelis et al.,, 2017; Zhang et al.,, 2020). For this reason, and its suitability for assessing the strength of a binary classifier, we use the mean AUC over all targets as our primary metric. However, we also include the mean attack accuracy over all targets as a secondary metric, listing these scores in Section C of the Appendix.

6. Experimental Results

6.1. Against Suppression of Small Counts

We first compare the performances of KK and ZK on aggregates whose counts have been suppressed according to threshold k𝑘kitalic_k. We apply SSC with thresholds k{0,1,2,3,4,5}𝑘012345k\in\{0,1,2,3,4,5\}italic_k ∈ { 0 , 1 , 2 , 3 , 4 , 5 } on test aggregates of size m=1000𝑚1000m=1000italic_m = 1000. We note that the case k=0𝑘0k=0italic_k = 0 corresponds to releasing a raw aggregate. We remark that there is a trivial rule that sufficiently determines non-membership in this special case.

Rule (k=0𝑘0k=0italic_k = 0): If usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT visits ROI s𝑠sitalic_s during epoch t𝑡titalic_t and no users in the aggregation group 𝒰𝒰\mathcal{U}caligraphic_U visit (s,t)𝑠𝑡(s,t)( italic_s , italic_t ), then usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT cannot be in 𝒰𝒰\mathcal{U}caligraphic_U, i.e.,

s𝒮,t𝒯:As,t𝒰=0Ls,tu=1u𝒰:formulae-sequence𝑠𝒮𝑡𝒯subscriptsuperscript𝐴𝒰𝑠𝑡0subscriptsuperscript𝐿superscript𝑢𝑠𝑡1superscript𝑢𝒰\exists s\in\mathcal{S},t\in\mathcal{T}:A^{\mathcal{U}}_{s,t}=0\wedge L^{u^{*}% }_{s,t}=1\implies u^{*}\not\in\mathcal{U}∃ italic_s ∈ caligraphic_S , italic_t ∈ caligraphic_T : italic_A start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT = 0 ∧ italic_L start_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT = 1 ⟹ italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∉ caligraphic_U
Refer to caption
(a) CDR dataset
Refer to caption
(b) Milan
Figure 9. Mean AUC scores with standard error for ZK and KK on size 1000100010001000 aggregates with ε=1𝜀1\varepsilon=1italic_ε = 1 event DP protection and k=1𝑘1k=1italic_k = 1 suppression for varying fractions of the known target trace.

We therefore incorporate this rule when k=0𝑘0k=0italic_k = 0, such that both MIAs first check if the released aggregate elicits the contradiction. If so, we immediately predict OUT. Otherwise, we train, validate, and test the classifier as usual. The rule is invalid for k>0𝑘0k>0italic_k > 0, since it would predict OUT whenever usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT has a visit to a suppressed entry.

Results. Figure 7 shows that membership inference is a trivial task when applied to raw aggregates (k=0𝑘0k=0italic_k = 0). Both ZK and KK achieve near perfect AUC (0.99absent0.99\geq 0.99≥ 0.99) on both datasets. This implies that aggregation is not an effective privacy mechanism in itself to protect high-dimensional location data from MIAs by weak or strong adversaries.

The results for k>0𝑘0k>0italic_k > 0 reveal two general patterns. First, ZK compares closely with KK across different levels of SSC. On the CDR dataset, ZK’s AUC stays within 0.020.020.020.02 of KK for each k𝑘kitalic_k. We observe slightly worse results on the Milan dataset, but ZK still stays within 0.90.90.90.9 AUC of KK for each k𝑘kitalic_k. Second, there is a monotonic decrease in performance when the threshold k𝑘kitalic_k is increased. For k=5𝑘5k=5italic_k = 5, the AUC is always less than 0.550.550.550.55. This follows from the fact that suppression reduces the amount of available information, and 99%percent9999\%99 % of all nonzero entries are suppressed by k=5𝑘5k=5italic_k = 5, as shown in Figure 15 of the Appendix. Therefore, although SSC eventually mitigates the MIAs, it may come at the cost of destroying virtually all utility.

Both MIAs perform worse on the Milan dataset compared to the CDR dataset. This is expected because we observe  6666 times less data per target in the Milan dataset, as shown in Figure 14. Indeed, people generally tweet less often than they text and call. ZK MIA is more affected by dataset sparsity, given its dependence on marginal distribution estimates, which become less reliable in sparser datasets.

6.2. Against ε𝜀\varepsilonitalic_ε-DP Noise Addition

Informed by practical applications of DP  (Desfontaines,, 2021), we consider event level and user-day level to be the privacy units, and we vary the privacy budget ε{0.1,0.5,1.0,5.0,10.0}𝜀0.10.51.05.010.0\varepsilon\in\{0.1,0.5,1.0,5.0,10.0\}italic_ε ∈ { 0.1 , 0.5 , 1.0 , 5.0 , 10.0 } for each unit.

Event Level DP. An event is equivalent to a visit by a user to (s,t)𝒮×𝒯𝑠𝑡𝒮𝒯(s,t)\in\mathcal{S}\times\mathcal{T}( italic_s , italic_t ) ∈ caligraphic_S × caligraphic_T. To offer privacy protection over an event, we set the global sensitivity Δ=1Δ1\Delta=1roman_Δ = 1. ϵitalic-ϵ\epsilonitalic_ϵ-DP is then ensured by adding Lap(1ϵ)𝐿𝑎𝑝1italic-ϵLap(\frac{1}{\epsilon})italic_L italic_a italic_p ( divide start_ARG 1 end_ARG start_ARG italic_ϵ end_ARG ) noise to each count in the aggregate matrix.

User-day Level DP. In order to protect each user’s daily contributions without adding excessive noise, it is common to restrict user contributions prior to aggregation to achieve a smaller global sensitivity ΔΔ\Deltaroman_Δ  (Aktay et al.,, 2020; Herdağdelen and Dow,, 2021). We analysed daily activity distributions, and had them preprocessed such that a user may only contribute up to Δ=20Δ20\Delta=20roman_Δ = 20 visits in any given day for CDR, and Δ=10Δ10\Delta=10roman_Δ = 10 visits in any given day for Milan. ϵitalic-ϵ\epsilonitalic_ϵ-DP at the user-day level is then ensured by adding Lap(Δϵ)𝐿𝑎𝑝Δitalic-ϵLap(\frac{\Delta}{\epsilon})italic_L italic_a italic_p ( divide start_ARG roman_Δ end_ARG start_ARG italic_ϵ end_ARG ) noise to each count in the aggregate matrix.

Results. Figure 8 shows that ZK MIA matches KK MIA across all tested DP settings. Indeed, ZK maintained a mean AUC within 0.060.060.060.06 of KK (PS) across each of the 10101010 privacy settings for both datasets. KK and ZK notably succeeded for many of the tested privacy budgets ε𝜀\varepsilonitalic_ε, particularly in the event level setting. Indeed, we observed AUC0.9𝐴𝑈𝐶0.9AUC\geq 0.9italic_A italic_U italic_C ≥ 0.9 for both MIAs whenever the noise scale Δϵ2Δitalic-ϵ2\frac{\Delta}{\epsilon}\leq 2divide start_ARG roman_Δ end_ARG start_ARG italic_ϵ end_ARG ≤ 2 for the CDR dataset, and Δϵ1Δitalic-ϵ1\frac{\Delta}{\epsilon}\leq 1divide start_ARG roman_Δ end_ARG start_ARG italic_ϵ end_ARG ≤ 1 for the Milan dataset. These settings are in line with many real-life applications (Kohli et al.,, 2023; Desfontaines,, 2021). Conversely, user-day level DP with privacy budget ε1.0𝜀1.0\varepsilon\leq 1.0italic_ε ≤ 1.0 effectively reduced both MIAs to an AUC below 0.550.550.550.55. We discuss the significance of these results with respect to practical mitigations in Section 7.

6.3. Partial Knowledge of the Target Trace

We now relax the assumption that Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v knows the full target trace Lusuperscript𝐿superscript𝑢L^{u^{*}}italic_L start_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. This is in line with our ZK threat model, and we expand KK MIA for this setting to be able to compare methods. To simulate a weaker adversary, we suppose that Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v only knows a subset of the target usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT’s visits. We assume that Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v only knows a random fraction pu{0.1,0.25,0.5,0.75,1.0}subscript𝑝superscript𝑢0.10.250.50.751.0p_{u^{*}}\in\{0.1,0.25,0.5,0.75,1.0\}italic_p start_POSTSUBSCRIPT italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∈ { 0.1 , 0.25 , 0.5 , 0.75 , 1.0 } of the trace Lusuperscript𝐿superscript𝑢L^{u^{*}}italic_L start_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. The number of retained visits is rounded up to the next integer to prevent cases where Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v knows 00 visits. For example, if a target has 4444 visits and pu=0.1subscript𝑝superscript𝑢0.1p_{u^{*}}=0.1italic_p start_POSTSUBSCRIPT italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = 0.1, then this would correspond to Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v knowing 1111 visit. This partial trace is used instead of the full trace when creating IN training and validation aggregates. The full trace Lusuperscript𝐿superscript𝑢L^{u^{*}}italic_L start_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is still used for IN test aggregates.

We perform this experiment in the setting where the data collector applies event-level DP with ϵ=1italic-ϵ1\epsilon=1italic_ϵ = 1, followed by k=1𝑘1k=1italic_k = 1 suppression. We choose this setting for a couple of reasons. First, to study the degradation of the MIAs with decreasing information about the target, we choose a setting where Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v would succeed given the full target trace. Previous experiments revealed that ϵ=1italic-ϵ1\epsilon=1italic_ϵ = 1-DP at the event level and k=1𝑘1k=1italic_k = 1 suppression were not effective in preventing the MIAs by themselves, as the MIAs achieved AUC >0.97absent0.97>0.97> 0.97 on the CDR dataset and AUC >0.9absent0.9>0.9> 0.9 for Milan. Second, we combine the two defense mechanisms to see if suppression has an observable mitigation effect when applied following DP noise addition. By the post-processing property of DP, this would not alter the theoretical performance bound. However, zeroing all counts 1absent1\leq 1≤ 1 might add a layer of complexity that affect MIAs in practice.

Results. First, we note that applying k=1𝑘1k=1italic_k = 1 SSC on top of ε=1.0𝜀1.0\varepsilon=1.0italic_ε = 1.0 event level DP has an insignificant effect on the MIAs. For the full target trace, we continue to observe AUC >0.97absent0.97>0.97> 0.97 on the CDR dataset and AUC >0.9absent0.9>0.9> 0.9 on the Milan dataset. The only MIA with a noticeable decline was KK MIA on the Milan dataset, which dropped from 0.970.970.970.97 AUC to 0.90.90.90.9 AUC.

Although decreasing the fraction of the target trace known to the adversary from 1111 to 0.10.10.10.1 decreases the performance of the MIAs, the corresponding degradation is relatively gradual. All AUCs are captured within a range of 0.130.130.130.13 on the CDR dataset, and within a range of 0.220.220.220.22 on the Milan dataset. Even the lowest observed AUC by ZK MIA on the CDR dataset (0.840.840.840.84 when 10%percent1010\%10 % of the target trace is known) achieves high discrimination. We note that the 50505050 targets in both datasets have a wide variation in trace size, as shown in Figure 14 of the Appendix. For some targets, Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v will only know one of the target’s visits, whereas for others, they will still know dozens and be able to infer membership easily. Interestingly, we note that knowing a single visit from a target trace can still train a classifier that is better than random. For one CDR target with 9999 visits in their full trace, we observed ZK achieve an AUC of 0.6600.6600.6600.660 across 100100100100 aggregates when only 1111 random visit was known. Although far from perfect, we found this surprising, as it shows that even a single visit by the target can inform an MIA against a noisy aggregate over 1000100010001000 users.

Refer to caption
(a) Zero Auxiliary Knowledge
Refer to caption
(b) Knock-Knock
Figure 10. Mean AUC scores with standard error for ZK and KK on size 1000100010001000 aggregates from the Milan dataset across different privacy budgets ε𝜀\varepsilonitalic_ε under user-day level ε𝜀\varepsilonitalic_ε-DP, for MIAs using independent sampling vs. paired sampling.

6.4. Paired Sampling vs. Independent Sampling

In this section, we study the performance of KK MIA and ZK MIA when we vary the sampling technique used for creating their training aggregates, i.e., paired sampling (PS) or independent sampling (IS). To test this, we consider the implementation of user-day εDP𝜀𝐷𝑃\varepsilon-DPitalic_ε - italic_D italic_P on the Milan dataset across the privacy budgets ε=1,2,3,,10𝜀12310\varepsilon=1,2,3,...,10italic_ε = 1 , 2 , 3 , … , 10 for all four possible MIAs: KK (PS), KK (IS), ZK (PS), ZK (IS).

Results. From Figure 10, we see that the paired sampling MIA always outperforms its independent sampling equivalent across all privacy budgets ε=1,2,3,,10𝜀12310\varepsilon=1,2,3,...,10italic_ε = 1 , 2 , 3 , … , 10. Paired sampling provides the largest boost when the inference task is challenging but not intractable. In particular, we notice a few striking examples of ZK (PS) drastically outperforming ZK (IS) in the middle of the graph. For ϵ=4italic-ϵ4\epsilon=4italic_ϵ = 4, ZK (IS) is basically random (AUC=0.54𝐴𝑈𝐶0.54AUC=0.54italic_A italic_U italic_C = 0.54), yet simply switching to paired sampling enables the classifier to achieve an AUC of 0.830.830.830.83.

The improvement achieved by switching from independent sampling to paired sampling is less significant for KK in this experiment. This suggests that using training aggregates sampled from a reference dataset of real traces may introduce less randomness to the membership classifier’s decision boundary, compared to when we use a synthetic reference. This is intuitive because of ZK’s probabilistic generation method, which relies on sampling from three different estimated distributions. However, the MIAs have indistinguishable performance when both attacks use paired sampling, with the difference in AUC always staying within 0.020.020.020.02 of one another across the 10101010 privacy settings. This suggests that paired sampling effectively eliminates the noise contributed by coincidental patterns in random entries, and enables the membership classifier to form a suitable decision boundary.

7. Discussion

We first provide a critical analysis of the experimental results, followed by a discussion of mitigation strategies and their practicality. We then consider limitations in our methods and evaluations, before discussing how our methodology may be generalized to MIAs beyond the setting of aggregate location data.

7.1. Analysis of Results: Practical Risk of MIA

In Section 6, we observed that our ZK MIA achieves approximately the same performance as the KK MIA across all experiments, and that both MIAs performed effectively across a range of common privacy settings. This has several important implications.

First, the ZK MIA significantly increases the attack surface of aggregate location data, since no auxiliary dataset is needed for the MIA. Although previous MIAs on aggregate location data have been successful, the strong assumption of a large auxiliary dataset prevents these attackers from attempting the MIA in most real-life cases. The auxiliary dataset comprises sensitive user-level information collected from the same dataset that is being aggregated. However, aggregation is applied to prevent the release of personal information. Moreover, Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v would be restricted by the size of their auxiliary dataset, since they would not be able to perform MIAs on aggregates computed over more users than there are in their reference. In contrast, we demonstrated that our Zero Auxiliary Knowledge Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v can create an arbitrary number of synthetic traces upon seeing the released aggregate, without an auxiliary dataset. This offers Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v the flexibility to attack aggregates of any size. Section D.1 in the Appendix shows results for KK and ZK MIA across aggregates of size m=100,500,1000,2000,3000𝑚100500100020003000m=100,500,1000,2000,3000italic_m = 100 , 500 , 1000 , 2000 , 3000. Figure 12 of the Appendix also shows that the ZK Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v can boost their own performance up to diminishing marginal returns, simply by generating more traces.

We have also shown that MIAs on aggregate location data are more powerful than previously known. By incorporating paired sampling for training, we have demonstrated more effective MIA results on size 1000100010001000 aggregates than previously reported (Pyrgelis et al.,, 2020), particularly when protected by differential privacy. Our results therefore demonstrate that MIAs on aggregate location data are easily performed without auxiliary data, more effective than previously believed, and that common privacy measures fail to protect against the risk.

7.2. Proposed Mitigations

Our results show that aggregated location data requires more stringent privacy safeguards to protect against MIAs. This is an inherently challenging task because our ZK MIA was able to succeed by using the aggregate to estimate where the population moves (space marginal), when the population moves (time marginal), and how frequently the population moves (activity marginal). However, location aggregates naturally leak this information. In fact, much of its utility is derived from these marginal statistics. Therefore, while one can mitigate the ZK MIA by perturbing the aggregate to the point where the basic mobility patterns of its population are unrecoverable, doing so may also destroy the aggregates’ utility.

In light of these results, we advise that data practitioners be mindful of the parameters that they select for ϵlimit-fromitalic-ϵ\epsilon-italic_ϵ -DP, because DP does not guarantee sufficient protection from MIAs if the parameters are chosen too loosely. Since data practitioners often prioritize utility, it is common to pick more relaxed parameters for the privacy unit (e.g. event or user-day instead of user) and budget ε𝜀\varepsilonitalic_ε (e.g. ϵ>1italic-ϵ1\epsilon>1italic_ϵ > 1). For example, Kohli et al., (2023) studied ϵ{0.1,0.5,1.0}italic-ϵ0.10.51.0\epsilon\in\{0.1,0.5,1.0\}italic_ϵ ∈ { 0.1 , 0.5 , 1.0 } at the event level in the context of aggregate O-D mobility matrices, Facebook used ϵ=0.45italic-ϵ0.45\epsilon=0.45italic_ϵ = 0.45 at the event level when collecting data about URLs shared on the site, and Apple uses ϵitalic-ϵ\epsilonitalic_ϵ between 2222 and 16161616 at the user-day level when collecting IOS data (Desfontaines,, 2021). Recall that ZK and KK achieved AUC0.9𝐴𝑈𝐶0.9AUC\geq 0.9italic_A italic_U italic_C ≥ 0.9 on the CDR dataset whenever the noise scale ΔϵΔitalic-ϵ\frac{\Delta}{\epsilon}divide start_ARG roman_Δ end_ARG start_ARG italic_ϵ end_ARG was 2222 or less. This corresponds to ϵ0.5italic-ϵ0.5\epsilon\geq 0.5italic_ϵ ≥ 0.5 for event level DP and ϵ10italic-ϵ10\epsilon\geq 10italic_ϵ ≥ 10 for user-day level DP with up to Δ=20Δ20\Delta=20roman_Δ = 20 daily visits. ZK and KK therefore both achieved high discrimination on privacy settings that are in line with many real-life applications.

However, we do observe DP mitigating the MIAs when we pick sufficiently strict parameters. For example, no MIA achieved better than random performance for ϵ=0.5italic-ϵ0.5\epsilon=0.5italic_ϵ = 0.5 in the user-day setting. We also note that we did not evaluate using the user level setting, which would achieve the strongest privacy protection. We note that the suitability of privacy parameters depends on the desired utility and sensitivity of the dataset. A stricter parameter choice is particularly relevant if the aggregate is publicly released and/or pertains to sensitive data.

Although ε𝜀\varepsilonitalic_ε-DP always offers privacy guarantees, our experimental results emphasize the importance of picking appropriate parameters. In particular, we observed that event-level DP was largely ineffective in preventing MIAs from both strong and weak adversaries. We instead encourage the use of user-day or user-level DP with carefully selected privacy budgets ε𝜀\varepsilonitalic_ε to mitigate the practical threat of an MIA.

7.3. Limitations

We have so far taken the Knock-Knock MIA to refer to the Subset of Locations setting (Pyrgelis et al.,, 2017). We now address why we do not consider the Knock-Knock Participation in Past Groups (Pyrgelis et al.,, 2017) threat model in this paper. Under the Participation in Past Groups setting, the adversary has access to a set of past aggregates {A¯𝒰1~,,A¯𝒰N~}superscript¯𝐴~subscript𝒰1superscript¯𝐴~subscript𝒰𝑁\{\overline{A}^{\tilde{\mathcal{U}_{1}}},...,\overline{A}^{\tilde{\mathcal{U}_% {N}}}\}{ over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT over~ start_ARG caligraphic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT , … , over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT over~ start_ARG caligraphic_U start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT }, collected over the same ROIs 𝒮𝒮\mathcal{S}caligraphic_S as the released aggregate A𝒰¯¯superscript𝐴𝒰\overline{A^{\mathcal{U}}}over¯ start_ARG italic_A start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT end_ARG. Moreover, Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v is assumed to know the membership status of the target usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in each of these aggregates. That is, Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v knows whether or not u𝒰isuperscript𝑢subscript𝒰𝑖u^{*}\in\mathcal{U}_{i}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for all i=1,,N𝑖1𝑁i=1,...,Nitalic_i = 1 , … , italic_N. This last assumption is crucial because Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v directly uses {A¯𝒰1~,,A¯𝒰N~}superscript¯𝐴~subscript𝒰1superscript¯𝐴~subscript𝒰𝑁\{\overline{A}^{\tilde{\mathcal{U}_{1}}},...,\overline{A}^{\tilde{\mathcal{U}_% {N}}}\}{ over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT over~ start_ARG caligraphic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT , … , over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT over~ start_ARG caligraphic_U start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT } as their training data for the membership classifier in this setting. This is unrealistic for multiple reasons. First, to train an effective membership classifier, there would need to be hundreds of labeled aggregates to have sufficient training data. More importantly, there would be no reason for the membership status of an individual within an aggregate to be released in practice. We argue that the only plausible scenario in which Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v would know the membership status of each aggregate is if they created the aggregates themselves. This reduces to the Subset of Locations setting that we have assumed in this paper.

In terms of limitations for our ZK MIA, recall that the Delaunay triangulation of the ROIs, DT(𝒮)𝐷𝑇𝒮DT(\mathcal{S})italic_D italic_T ( caligraphic_S ), is the only non-probabilistic parameter used to generate synthetic traces for the ZK MIA. The triangulation only depends on the locations of the ROIs, which we have so far assumed to be shared as part of the aggregate release (Section 2.5). We believe this to be a realistic assumption, as omitting the locations of the ROIs would strongly diminish the utility of aggregate location data. Nonetheless, there might exist cases where released location aggregates do not relay the positions of the ROIs. For example, Google binned ROIs into categories, ex. restaurants, parks, and hospitals, when publicly releasing their mobility report during COVID https://www.google.com/covid19/mobility/. In this setting, the adversary would proceed without knowing where the ROIs are situated with respect to one another. The privacy risk under this setting is not known, and we identify it as an area of future research. Similarly, there might exist cases where the adversary knows the ROIs that were visited by the target (ex. home and work), but not the visitation times. We show in Appendix  D.3 that only knowing the visited ROIs substantially reduces the effectiveness of both MIAs.

ZK MIA also requires that we estimate statistical parameters from the released aggregate. It may be difficult to estimate these precisely if the aggregate size is small or if the collected location data is not regular. However, we still observe strong performance by ZK MIA on both datasets for small aggregate sizes (see Appendix D.1).

Furthermore, aggregate location data collected over large metropolitan populations are known to obey high regularity across different cities and time periods. These patterns include log-normal activity distributions (Schneider et al.,, 2013; Seshadri et al.,, 2008; Farzanehfar et al.,, 2021) and periodic ”circadian rhythm” time marginals (Csáji et al.,, 2013; Song et al.,, 2010; Seshadri et al.,, 2008; Farzanehfar et al.,, 2021). This suggests that our statistical parameter estimation should be highly transferable across sufficiently regular datasets. However, we acknowledge that there are scenarios where the observed population is not regular (e.g. taxi drivers).

7.4. Generalizations to MIAs in other Settings

In this paper, we have proposed a new methodology to perform membership inference attacks on aggregate data, by training the attack on synthetic records, generated from the released aggregate. We believe that this approach can be adapted for MIAs in settings beyond aggregate location data. Our methodology can be broken down into two main steps: 1) extracting noise-less global statistics from the released aggregate, 2) use these statistics to create individual-level records to train the MIA.

In the setting of location data, the relevant statistics pertain to the mobility trends of large-scale human populations (Schneider et al.,, 2013; Seshadri et al.,, 2008; Farzanehfar et al.,, 2021; Csáji et al.,, 2013; Song et al.,, 2010; Seshadri et al.,, 2008), and individual location  (Kulkarni and Garbinato,, 2017; Kulkarni et al.,, 2018; Ouyang et al.,, 2018; Karagiannis et al.,, 2007; Lee et al.,, 2009) which have both been well established in the literature. This facilitates both steps of our methodology, as we know in advance what location data should look like at both the global and individual level.

Although the trends will be distinct from aggregate location data, aggregate releases for other types of data will generally reveal global statistics. For instance, categorical tabular data is modeled by discrete random variables, whereas location data is modeled by continuous random variables, and approximated by high-dimensional discrete data. Our methods for denoising and debiasing statistics from differentially private and suppressed aggregates are however not specific to location data, and should generalize to other data releases. Regarding the second step, using the statistics to create individual records for training, the probabilistic method used for our ZK MIA, drawing from the Delaunay triangulation and the relevant marginal distributions, is partially specific to location data. One would thus need to carefully consider the statistical properties of the type of data to create high quality individual records.

8. Conclusion

Aggregate location data is widely shared and used by governments (Savage,, 2021; Hope,, 2021; Oli,, 2021), companies (Apple,, 2017; Aktay et al.,, 2020; Herdağdelen and Dow,, 2021), and researchers (Trasberg and Cheshire,, 2023; Jeffrey et al.,, 2020; Kohli et al.,, 2023) because of its insights into human behaviour and its presumed security against reidentification.

In this paper, we demonstrated that aggregate location data is susceptible to MIAs by realistic adversaries, who only know some of their target’s location history. With ZK MIA, we introduced the first MIA on aggregate location data that does not require an auxiliary dataset. We accomplished this by generating appropriate synthetic traces, using statistics that are estimated from the released aggregate. We also equipoed our parameter estimation with techniques that automatically correct for bias and noise from popular privacy mechanisms like suppression of small counts and ε𝜀\varepsilonitalic_ε-DP noise.

We then showed that MIAs on aggregate location data are significantly improved by incorporating a paired sampling technique, which helps isolate the effect of the target trace within a high dimensional aggregate. Hence, the vulnerability of aggregate location data is further heightened by these improved attacks.

Our evaluations over two large datasets demonstrate that, despite the absence of an auxiliary dataset, ZK MIA performs as well as the state-of-the-art KK MIA, with both MIAs achieving high discrimination when commonly used privacy settings are applied. ZK MIA remains effective in realistic privacy settings, even when only a small fraction (10%percent1010\%10 %) of the target trace is known. These results emphasize the need for strict differential privacy guarantees on released aggregate location data.

Taken together, our findings show that membership inference attacks are not merely a theoretical privacy threat posed by unrealistically strong adversaries, but also a realistic threat to contend with in practice.

Acknowledgements.
Ana-Maria Cretu did most of her work while she was at Imperial College London and was partially funded by the Agence Française de Développement via the Flowminder Foundation. The authors would like to sincerely thank the Flowminder team for their support with this work, in particular Galina Veres and James Harrison for their help in designing a secure data-sharing pipeline to test the MIAs on the CDR dataset. The authors would like to further acknowledge Cyril Miras for his early work on MIAs on aggregate location data, including the implementation of the baseline rule MIA. Finally, the authors would like to thank the anonymous reviewers and shepherd for their feedback on the paper.

References

  • wp2, (2014) (2014). Article 29 data protection working party. opinion 05/2014 on anonymisation techniques. https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf.
  • flo, (2024) (2024). Flowminder website. https://www.flowminder.org/.
  • Aktay et al., (2020) Aktay, A., Bavadekar, S., Cossoul, G., Davis, J., Desfontaines, D., Fabrikant, A., Gabrilovich, E., Gadepalli, K., Gipson, B., Guevara, M., et al. (2020). Google covid-19 community mobility reports: anonymization process description (version 1.1). arXiv preprint arXiv:2004.04145.
  • Apple, (2017) Apple, D. (2017). Learning with privacy at scale. Apple Machine Learning Journal, 1(8).
  • Barlacchi et al., (2015) Barlacchi, G., De Nadai, M., Larcher, R., Casella, A., Chitic, C., Torrisi, G., Antonelli, F., Vespignani, A., Pentland, A., and Lepri, B. (2015). A multi-source dataset of urban life in the city of milan and the province of trentino. Scientific data, 2(1):1–15.
  • Bauer and Bindschaedler, (2020) Bauer, L. A. and Bindschaedler, V. (2020). Towards realistic membership inferences: The case of survey data. In Annual Computer Security Applications Conference, pages 116–128.
  • Bishop and Nasrabadi, (2006) Bishop, C. M. and Nasrabadi, N. M. (2006). Pattern recognition and machine learning, volume 4. Springer.
  • Boorstein and Kelly, (2023) Boorstein, M. and Kelly, H. (2023). Colorado catholic group bought app data that tracked gay priests. The Washington Post.
  • Chen et al., (2009) Chen, B.-C., Kifer, D., LeFevre, K., Machanavajjhala, A., et al. (2009). Privacy-preserving data publishing. Foundations and Trends® in Databases, 2(1–2):1–167.
  • Creţu et al., (2021) Creţu, A.-M., Guépin, F., and de Montjoye, Y.-A. (2021). Correlation inference attacks against machine learning models. arXiv preprint arXiv:2112.08806.
  • Cretu et al., (2022) Cretu, A.-M., Houssiau, F., Cully, A., and de Montjoye, Y.-A. (2022). Querysnout: Automating the discovery of attribute inference attacks against query-based systems. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 623–637.
  • Csáji et al., (2013) Csáji, B. C., Browet, A., Traag, V. A., Delvenne, J.-C., Huens, E., Van Dooren, P., Smoreda, Z., and Blondel, V. D. (2013). Exploring the mobility of mobile phone users. Physica A: statistical mechanics and its applications, 392(6):1459–1473.
  • de Montjoye et al., (2018) de Montjoye, Y.-A., Gambs, S., Blondel, V., Canright, G., De Cordes, N., Deletaille, S., Engø-Monsen, K., Garcia-Herranz, M., Kendall, J., Kerry, C., et al. (2018). On the privacy-conscientious use of mobile phone data. Scientific data, 5(1):1–6.
  • de Montjoye et al., (2013) de Montjoye, Y.-A., Hidalgo, C. A., Verleysen, M., and Blondel, V. D. (2013). Unique in the crowd: The privacy bounds of human mobility. Scientific reports, 3(1):1–5.
  • Delaunay et al., (1934) Delaunay, B. et al. (1934). Sur la sphere vide. Izv. Akad. Nauk SSSR, Otdelenie Matematicheskii i Estestvennyka Nauk, 7(793-800):1–2.
  • Desfontaines, (2021) Desfontaines, D. (2021). A list of real-world uses of differential privacy.
  • Dwork et al., (2019) Dwork, C., Kohli, N., and Mulligan, D. (2019). Differential privacy in practice: Expose your epsilons! Journal of Privacy and Confidentiality, 9(2).
  • Dwork et al., (2006) Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pages 265–284. Springer.
  • Farzanehfar et al., (2021) Farzanehfar, A., Houssiau, F., and de Montjoye, Y.-A. (2021). The risk of re-identification remains high even in country-scale location datasets. Patterns, 2(3):100204.
  • Gadotti et al., (2019) Gadotti, A., Houssiau, F., Rocher, L., Livshits, B., and De Montjoye, Y.-A. (2019). When the signal is in the noise: Exploiting diffix’s sticky noise. In 28th USENIX Security Symposium (USENIX Security 19), pages 1081–1098.
  • Ge and Fukuda, (2016) Ge, Q. and Fukuda, D. (2016). Updating origin–destination matrices with aggregated data of gps traces. Transportation Research Part C: Emerging Technologies, 69:291–312.
  • Georgiadou et al., (2019) Georgiadou, Y., de By, R. A., and Kounadi, O. (2019). Location privacy in the wake of the gdpr. ISPRS international journal of geo-information, 8(3):157.
  • Grantz et al., (2020) Grantz, K. H., Meredith, H. R., Cummings, D. A., Metcalf, C. J. E., Grenfell, B. T., Giles, J. R., Mehta, S., Solomon, S., Labrique, A., Kishore, N., et al. (2020). The use of mobile phone data to inform analysis of covid-19 pandemic epidemiology. Nature communications, 11(1):4961.
  • Guépin et al., (2023) Guépin, F., Meeus, M., Cretu, A.-M., and de Montjoye, Y.-A. (2023). Synthetic is all you need: removing the auxiliary data assumption for membership inference attacks against synthetic data. arXiv preprint arXiv:2307.01701.
  • Hara and Yamaguchi, (2021) Hara, Y. and Yamaguchi, H. (2021). Japanese travel behavior trends and change under covid-19 state-of-emergency declaration: Nationwide observation by mobile phone location data. Transportation Research Interdisciplinary Perspectives, 9:100288.
  • Herdağdelen and Dow, (2021) Herdağdelen, A. and Dow, A. (2021). Protecting privacy in facebook mobility data during the covid-19 response (2020). URL https://research. fb. com/blog/2020/06/protecting-privacy-in-facebook-mobility-data-during-the-covid-19-response.
  • Holmes et al., (2013) Holmes, A., Byrne, A., and Rowley, J. (2013). Mobile shop** behaviour: insights into attitudes, shop** process involvement and location. International Journal of Retail & Distribution Management, 42(1):25–39.
  • Homer et al., (2008) Homer, N., Szelinger, S., Redman, M., Duggan, D., Tembe, W., Muehling, J., Pearson, J. V., Stephan, D. A., Nelson, S. F., and Craig, D. W. (2008). Resolving individuals contributing trace amounts of dna to highly complex mixtures using high-density snp genoty** microarrays. PLoS genetics, 4(8):e1000167.
  • Hope, (2021) Hope, C. (2021). Millions ’unwittingly tracked’ by phone after vaccination to see if movements changed.
  • Houssiau et al., (2022) Houssiau, F., Jordon, J., Cohen, S. N., Daniel, O., Elliott, A., Geddes, J., Mole, C., Rangel-Smith, C., and Szpruch, L. (2022). Tapas: A toolbox for adversarial privacy auditing of synthetic data. arXiv preprint arXiv:2211.06550.
  • Humphries et al., (2023) Humphries, T., Oya, S., Tulloch, L., Rafuse, M., Goldberg, I., Hengartner, U., and Kerschbaum, F. (2023). Investigating membership inference attacks under data dependencies. In 2023 IEEE 36th Computer Security Foundations Symposium (CSF), pages 473–488. IEEE.
  • Jagielski et al., (2020) Jagielski, M., Ullman, J., and Oprea, A. (2020). Auditing differentially private machine learning: How private is private sgd? Advances in Neural Information Processing Systems, 33:22205–22216.
  • Jahromi et al., (2016) Jahromi, K. K., Zignani, M., Gaito, S., and Rossi, G. P. (2016). Simulating human mobility patterns in urban areas. Simulation Modelling Practice and Theory, 62:137–156.
  • Jayaraman and Evans, (2019) Jayaraman, B. and Evans, D. (2019). Evaluating differentially private machine learning in practice. In 28th USENIX Security Symposium (USENIX Security 19), pages 1895–1912.
  • Jeffrey et al., (2020) Jeffrey, B., Walters, C. E., Ainslie, K. E., Eales, O., Ciavarella, C., Bhatia, S., Hayes, S., Baguelin, M., Boonyasiri, A., Brazeau, N. F., et al. (2020). Anonymised and aggregated crowd level mobility data from mobile phones suggests that initial compliance with covid-19 social distancing interventions was high and geographically consistent across the uk. Wellcome Open Research, 5.
  • Kakakhel, (2022) Kakakhel, S. (2022). Optimising urban planning with location intelligence. Quadrant Blog. Accessed: 2024-03-07.
  • Karagiannis et al., (2007) Karagiannis, T., Le Boudec, J.-Y., and Vojnović, M. (2007). Power law and exponential decay of inter contact times between mobile devices. In Proceedings of the 13th annual ACM international conference on Mobile computing and networking, pages 183–194.
  • Kohli et al., (2023) Kohli, N., Aiken, E., and Blumenstock, J. (2023). Privacy guarantees for personal mobility data in humanitarian response. arXiv preprint arXiv:2306.09471.
  • Kulkarni and Garbinato, (2017) Kulkarni, V. and Garbinato, B. (2017). Generating synthetic mobility traffic using rnns. In Proceedings of the 1st Workshop on Artificial Intelligence and Deep Learning for Geographic Knowledge Discovery, pages 1–4.
  • Kulkarni et al., (2018) Kulkarni, V., Tagasovska, N., Vatter, T., and Garbinato, B. (2018). Generative models for simulating mobility trajectories. arXiv preprint arXiv:1811.12801.
  • Lee et al., (2009) Lee, K., Hong, S., Kim, S. J., Rhee, I., and Chong, S. (2009). Slaw: A new mobility model for human walks. In IEEE INFOCOM 2009, pages 855–863. IEEE.
  • Li et al., (2013) Li, N., Qardaji, W., Su, D., Wu, Y., and Yang, W. (2013). Membership privacy: A unifying framework for privacy definitions. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security, pages 889–900.
  • Martínez-Durive et al., (2023) Martínez-Durive, O. E., Mishra, S., Ziemlicki, C., Rubrichi, S., Smoreda, Z., and Fiore, M. (2023). The netmob23 dataset: A high-resolution multi-region service-level mobile data traffic cartography. arXiv preprint arXiv:2305.06933.
  • Meeus et al., (2023) Meeus, M., Guepin, F., Creţu, A.-M., and de Montjoye, Y.-A. (2023). Achilles’ heels: vulnerable record identification in synthetic data publishing. In European Symposium on Research in Computer Security, pages 380–399. Springer.
  • Miguel Alonso and Richard, (2004) Miguel Alonso, B. D. and Richard, G. (2004). Tempo and beat estimation of musical signals. In Proceedings of the International Conference on Music Information Retrieval (ISMIR), Barcelona, Spain.
  • Morgan and Lovelace, (2021) Morgan, M. and Lovelace, R. (2021). Travel flow aggregation: Nationally scalable methods for interactive and online visualisation of transport behaviour at the road network level. Environment and Planning B: Urban Analytics and City Science, 48(6):1684–1696.
  • Müller, (2015) Müller, M. (2015). Logarithmic compression. https://www.audiolabs-erlangen.de/resources/MIR/FMP/C3/C3S1_LogCompression.html.
  • Nasr et al., (2019) Nasr, M., Shokri, R., and Houmansadr, A. (2019). Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning. In 2019 IEEE symposium on security and privacy (SP), pages 739–753. IEEE.
  • Nasr et al., (2021) Nasr, M., Songi, S., Thakurta, A., Papernot, N., and Carlin, N. (2021). Adversary instantiation: Lower bounds for differentially private machine learning. In 2021 IEEE Symposium on security and privacy (SP), pages 866–882. IEEE.
  • O2, (2019) O2 (2019). O2 transport smart steps product sheet. https://static-www.o2.co.uk/sites/default/files/2019-04/o2-transport-smart-steps-product-sheet.pdf. [Online].
  • Oehmichen et al., (2019) Oehmichen, A., Jain, S., Gadotti, A., and de Montjoye, Y.-A. (2019). Opal: High performance platform for large-scale privacy-preserving location data analytics. In 2019 IEEE International Conference on Big Data (Big Data), pages 1332–1342. IEEE.
  • Office of the Privacy Commissioner of Canada, (2023) Office of the Privacy Commissioner of Canada (2023). Investigation into the collection and use of de-identified mobility data in the course of the covid-19 pandemic. Accessed: 2023-09-14.
  • Oli, (2021) Oli, S. (2021). Canada’s public health agency admits it tracked 33 million mobile devices during lockdown. National Post, 24.
  • Ouyang et al., (2018) Ouyang, K., Shokri, R., Rosenblum, D. S., and Yang, W. (2018). A non-parametric generative model for human trajectories. In IJCAI, volume 18, pages 3812–3817.
  • Popa et al., (2011) Popa, R. A., Blumberg, A. J., Balakrishnan, H., and Li, F. H. (2011). Privacy and accountability for location-based aggregate statistics. In Proceedings of the 18th ACM conference on Computer and communications security, pages 653–666.
  • Precisely, (2024) Precisely (2024). Placeiq movement. https://www.precisely.com/product/precisely-placeiq/placeiq-movement. Accessed: 2024-02-13.
  • Pyrgelis et al., (2017) Pyrgelis, A., Troncoso, C., and De Cristofaro, E. (2017). Knock knock, who’s there? membership inference on aggregate location data. arXiv preprint arXiv:1708.06145.
  • Pyrgelis et al., (2020) Pyrgelis, A., Troncoso, C., and De Cristofaro, E. (2020). Measuring membership privacy on aggregate location time-series. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 4(2):1–28.
  • SafeGraph, (2024) SafeGraph (2024). Enrich pois with aggregated transaction data. https://www.safegraph.com/products/spend. Accessed: 2024-02-13.
  • Salem et al., (2018) Salem, A., Zhang, Y., Humbert, M., Berrang, P., Fritz, M., and Backes, M. (2018). Ml-leaks: Model and data independent membership inference attacks and defenses on machine learning models. arXiv preprint arXiv:1806.01246.
  • Sankararaman et al., (2009) Sankararaman, S., Obozinski, G., Jordan, M. I., and Halperin, E. (2009). Genomic privacy and limits of individual detection in a pool. Nature genetics, 41(9):965–967.
  • Savage, (2021) Savage, C. (2021). Intelligence analysts use u.s. smartphone location data without warrants, memo says. The New York Times. Available at {https://www.nytimes.com/2021/01/22/us/politics/dia-surveillance-data.html}.
  • Schneider et al., (2013) Schneider, C. M., Belik, V., Couronné, T., Smoreda, Z., and González, M. C. (2013). Unravelling daily human mobility motifs. Journal of The Royal Society Interface, 10(84):20130246.
  • Seshadri et al., (2008) Seshadri, M., Machiraju, S., Sridharan, A., Bolot, J., Faloutsos, C., and Leskove, J. (2008). Mobile call graphs: beyond power-law and lognormal distributions. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 596–604.
  • Shokri et al., (2017) Shokri, R., Stronati, M., Song, C., and Shmatikov, V. (2017). Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pages 3–18. IEEE.
  • Song et al., (2010) Song, C., Qu, Z., Blumm, N., and Barabási, A.-L. (2010). Limits of predictability in human mobility. Science, 327(5968):1018–1021.
  • SpazioDati and di Milano, (2015) SpazioDati and di Milano, D. P. (2015). Social Pulse - Milano.
  • Stadler et al., (2022) Stadler, T., Oprisanu, B., and Troncoso, C. (2022). Synthetic data–anonymisation groundhog day. In 31st USENIX Security Symposium (USENIX Security 22), pages 1451–1468.
  • Telus, (2024) Telus (2024). Telus Insights Location API. https://docs.insights.telus.com/. [Online].
  • Tournier and de Montjoye, (2022) Tournier, A. J. and de Montjoye, Y.-A. (2022). Expanding the attack surface: Robust profiling attacks threaten the privacy of sparse behavioral data. Science Advances, 8(33):eabl6464.
  • Trasberg and Cheshire, (2023) Trasberg, T. and Cheshire, J. (2023). Spatial and social disparities in the decline of activities during the covid-19 lockdown in greater london. Urban Studies, 60(8):1427–1447.
  • Truex et al., (2019) Truex, S., Liu, L., Gursoy, M. E., Yu, L., and Wei, W. (2019). Demystifying membership inference attacks in machine learning as a service. IEEE Transactions on Services Computing, 14(6):2073–2089.
  • Van Zoonen, (2016) Van Zoonen, L. (2016). Privacy concerns in smart cities. Government Information Quarterly, 33(3):472–480.
  • Xu et al., (2015) Xu, Y., Shaw, S.-L., Zhao, Z., Yin, L., Fang, Z., and Li, Q. (2015). Understanding aggregate human mobility patterns using passive mobile phone location data: a home-based approach. Transportation, 42:625–646.
  • Yabe et al., (2022) Yabe, T., Jones, N. K., Rao, P. S. C., Gonzalez, M. C., and Ukkusuri, S. V. (2022). Mobile phone location data for disasters: A review from natural hazards and epidemics. Computers, Environment and Urban Systems, 94:101777.
  • Yeom et al., (2018) Yeom, S., Giacomelli, I., Fredrikson, M., and Jha, S. (2018). Privacy risk in machine learning: Analyzing the connection to overfitting. In 2018 IEEE 31st computer security foundations symposium (CSF), pages 268–282. IEEE.
  • Zang and Bolot, (2011) Zang, H. and Bolot, J. (2011). Anonymization of location data does not work: A large-scale measurement study. In Proceedings of the 17th annual international conference on Mobile computing and networking, pages 145–156.
  • Zhang et al., (2020) Zhang, G., Zhang, A., and Zhao, P. (2020). Locmia: Membership inference attacks against aggregated location data. IEEE Internet of Things Journal, 7(12):11778–11788.
  • Zhou, (2017) Zhou, T. (2017). Understanding location-based services users’ privacy concern: An elaboration likelihood model perspective. Internet Research, 27(3):506–519.
  • Zhu et al., (2022) Zhu, K., Fioretto, F., and Van Hentenryck, P. (2022). Post-processing of differentially private data: A fairness perspective. arXiv preprint arXiv:2201.09425.

Appendix

Table 1. Glossary of notations.
Notation Definition
𝒮𝒮\mathcal{S}caligraphic_S Set of regions of interests (ROIs)
𝒯𝒯\mathcal{T}caligraphic_T Set of epochs in observation period
ΩΩ\Omegaroman_Ω Set of all users in the dataset
Lusuperscript𝐿𝑢L^{u}italic_L start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT Location trace of user uΩ𝑢Ωu\in\Omegaitalic_u ∈ roman_Ω over 𝒮×𝒯𝒮𝒯\mathcal{S}\times\mathcal{T}caligraphic_S × caligraphic_T
𝒰𝒰\mathcal{U}caligraphic_U Aggregation group of users sampled from ΩΩ\Omegaroman_Ω
A𝒰superscript𝐴𝒰A^{\mathcal{U}}italic_A start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT Raw aggregate count matrix in 𝒮×T𝒮𝑇\mathcal{S}\times Tcaligraphic_S × italic_T over users in 𝒰𝒰\mathcal{U}caligraphic_U
ADP𝒰(ε)superscriptsubscript𝐴𝐷𝑃𝒰𝜀{A}_{DP}^{\mathcal{U}}(\varepsilon)italic_A start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT ( italic_ε ) An ε𝜀\varepsilonitalic_ε-DP aggregate
ASSC𝒰(k)superscriptsubscript𝐴𝑆𝑆𝐶𝒰𝑘A_{SSC}^{\mathcal{U}}(k)italic_A start_POSTSUBSCRIPT italic_S italic_S italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT ( italic_k ) An aggregate with counts kabsent𝑘\leq k≤ italic_k suppressed
ADP,SSC𝒰(ε,k)superscriptsubscript𝐴𝐷𝑃𝑆𝑆𝐶𝒰𝜀𝑘{A}_{DP,SSC}^{\mathcal{U}}(\varepsilon,k)italic_A start_POSTSUBSCRIPT italic_D italic_P , italic_S italic_S italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT ( italic_ε , italic_k ) An ε𝜀\varepsilonitalic_ε-DP aggregate with counts kabsent𝑘\leq k≤ italic_k suppressed
A¯𝒰superscript¯𝐴𝒰\overline{A}^{\mathcal{U}}over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT The released aggregate count matrix
m𝑚mitalic_m Number of users in the aggregation group 𝒰𝒰\mathcal{U}caligraphic_U
usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT Target drawn from full population, uΩsuperscript𝑢Ωu^{*}\in\Omegaitalic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Ω
Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v Adversary performing MIA on usuperscript𝑢u^{*}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
Table 2. Default experiment parameters.
Default value Definition
ntrain=400subscript𝑛𝑡𝑟𝑎𝑖𝑛400n_{train}=400italic_n start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT = 400 Number of training aggregates
nval=100subscript𝑛𝑣𝑎𝑙100n_{val}=100italic_n start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT = 100 Number of validation aggregates
ntest=100subscript𝑛𝑡𝑒𝑠𝑡100n_{test}=100italic_n start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT = 100 Number of test aggregates
ntarget=50subscript𝑛𝑡𝑎𝑟𝑔𝑒𝑡50n_{target}=50italic_n start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT = 50 Number of targets
m=1000𝑚1000m=1000italic_m = 1000 Aggregate size
|Ref|=5000𝑅𝑒𝑓5000|Ref|=5000| italic_R italic_e italic_f | = 5000 (CDR), Traces in Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v’s real (KK)
2500250025002500 (Milan) or synthetic (ZK) reference
pu=1subscript𝑝superscript𝑢1p_{u^{*}}=1italic_p start_POSTSUBSCRIPT italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = 1 Fraction of Lusuperscript𝐿superscript𝑢L^{u^{*}}italic_L start_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT known by Adv𝐴𝑑𝑣Advitalic_A italic_d italic_v

Appendix A Supplementary Proofs

Definition 0.

(Oracle average count) Given a raw aggregate A𝒰superscript𝐴𝒰A^{\mathcal{U}}italic_A start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT, we define the oracle average count function μS:𝒮+:subscript𝜇𝑆𝒮superscript\mu_{S}:\mathcal{S}\to\mathbb{R}^{+}italic_μ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT : caligraphic_S → blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT as

(11) μS(s)subscript𝜇𝑆𝑠\displaystyle\mu_{S}(s)italic_μ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_s ) =lim|𝒯|t=1|𝒯|As,t𝒰|𝒯|.absentsubscript𝒯superscriptsubscript𝑡1𝒯superscriptsubscript𝐴𝑠𝑡𝒰𝒯\displaystyle=\lim_{|\mathcal{T}|\to\infty}\frac{\sum_{t=1}^{|\mathcal{T}|}A_{% s,t}^{\mathcal{U}}}{|\mathcal{T}|}.= roman_lim start_POSTSUBSCRIPT | caligraphic_T | → ∞ end_POSTSUBSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_T | end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT end_ARG start_ARG | caligraphic_T | end_ARG .

Letting |𝒯|𝒯|\mathcal{T}|\to\infty| caligraphic_T | → ∞ corresponds to extending the observation period indefinitely. Thus, μS(s)subscript𝜇𝑆𝑠\mu_{S}(s)italic_μ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_s ) represents the expected number of users 𝒰𝒰\mathcal{U}caligraphic_U who visit ROI s𝑠sitalic_s at a randomly selected epoch, given infinite location data over the ROIs 𝒮𝒮\mathcal{S}caligraphic_S.

Definition 0.

(Strong sparsity) We say that A𝒰superscript𝐴𝒰A^{\mathcal{U}}italic_A start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT is strongly sparse if

(12) μS(s)=0 s𝒮subscript𝜇𝑆𝑠0 for-all𝑠𝒮\displaystyle\mu_{S}(s)=0\text{ }\forall s\in\mathcal{S}italic_μ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_s ) = 0 ∀ italic_s ∈ caligraphic_S

Equivalently, t=1|𝒯|As,t𝒰o(|𝒯|)superscriptsubscript𝑡1𝒯superscriptsubscript𝐴𝑠𝑡𝒰𝑜𝒯\sum_{t=1}^{|\mathcal{T}|}A_{s,t}^{\mathcal{U}}\in o(|\mathcal{T}|)∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_T | end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT ∈ italic_o ( | caligraphic_T | ) s𝒮for-all𝑠𝒮\forall s\in\mathcal{S}∀ italic_s ∈ caligraphic_S. This is a strong assumption, as it implies that the visitation rate to each ROI decreases at a sublinear rate.

Lemma A.3.

Given a fixed geographic region in which location data is collected,

(13) lim|𝒮|s=1|𝒮|As,t𝒰|𝒮|=0subscript𝒮superscriptsubscript𝑠1𝒮superscriptsubscript𝐴𝑠𝑡𝒰𝒮0\displaystyle\lim_{|\mathcal{S}|\to\infty}\frac{\sum_{s=1}^{|\mathcal{S}|}A_{s% ,t}^{\mathcal{U}}}{|\mathcal{S}|}=0roman_lim start_POSTSUBSCRIPT | caligraphic_S | → ∞ end_POSTSUBSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_S | end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT end_ARG start_ARG | caligraphic_S | end_ARG = 0
Proof.

s=1|𝒮|As,t𝒰superscriptsubscript𝑠1𝒮superscriptsubscript𝐴𝑠𝑡𝒰{\sum_{s=1}^{|\mathcal{S}|}A_{s,t}^{\mathcal{U}}}∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_S | end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT corresponds to the number of users who registered a visit during epoch t𝑡titalic_t. Letting |𝒮|𝒮|\mathcal{S}|\to\infty| caligraphic_S | → ∞ corresponds to increasing creating finer regional partitions within the fixed geographic region. s=1|𝒮|As,t𝒰superscriptsubscript𝑠1𝒮superscriptsubscript𝐴𝑠𝑡𝒰{\sum_{s=1}^{|\mathcal{S}|}A_{s,t}^{\mathcal{U}}}∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_S | end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT is invariant to increasing |𝒮|𝒮|\mathcal{S}|\to\infty| caligraphic_S | → ∞, since the same users are observed over the same time. It follows that lim|𝒮|s=1|𝒮|As,t𝒰|𝒮|=0subscript𝒮superscriptsubscript𝑠1𝒮superscriptsubscript𝐴𝑠𝑡𝒰𝒮0\lim_{|\mathcal{S}|\to\infty}\frac{\sum_{s=1}^{|\mathcal{S}|}A_{s,t}^{\mathcal% {U}}}{|\mathcal{S}|}=0roman_lim start_POSTSUBSCRIPT | caligraphic_S | → ∞ end_POSTSUBSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_S | end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT end_ARG start_ARG | caligraphic_S | end_ARG = 0. ∎

Theorem A.4.

(Convergence of empirical marginals to uniform distribution under ε𝜀\varepsilonitalic_ε-DP) Let Δ>0Δ0\Delta>0roman_Δ > 0 be the global sensitivity and suppose that ε𝜀\varepsilonitalic_ε-DP is applied on an aggregate release A¯𝒰=ADP𝒰(ε)superscript¯𝐴𝒰superscriptsubscript𝐴𝐷𝑃𝒰𝜀\overline{A}^{\mathcal{U}}=A_{DP}^{\mathcal{U}}(\varepsilon)over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT = italic_A start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT ( italic_ε ) with post-processed non-negative counts. If the original raw counts A𝒰superscript𝐴𝒰A^{\mathcal{U}}italic_A start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT are strongly sparse, then the empirical space and time marginals, 𝒫S0superscriptsubscript𝒫𝑆0\mathcal{P}_{S}^{0}caligraphic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and 𝒫T0superscriptsubscript𝒫𝑇0\mathcal{P}_{T}^{0}caligraphic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, each converge to discrete uniform distributions:

  • 𝒫^S0Unif(𝒮)superscriptsubscript^𝒫𝑆0𝑈𝑛𝑖𝑓𝒮\widehat{\mathcal{P}}_{S}^{0}\to Unif(\mathcal{S})over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT → italic_U italic_n italic_i italic_f ( caligraphic_S ) in distribution as |𝒯|𝒯|\mathcal{T}|\to\infty| caligraphic_T | → ∞

  • 𝒫^T0Unif(𝒯)superscriptsubscript^𝒫𝑇0𝑈𝑛𝑖𝑓𝒯\widehat{\mathcal{P}}_{T}^{0}\to Unif(\mathcal{T})over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT → italic_U italic_n italic_i italic_f ( caligraphic_T ) in distribution as |𝒮|𝒮|\mathcal{S}|\to\infty| caligraphic_S | → ∞

Proof.

We first consider 𝒫^S0superscriptsubscript^𝒫𝑆0\widehat{\mathcal{P}}_{S}^{0}over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. It suffices to show that as ϵ0italic-ϵ0\epsilon\to 0italic_ϵ → 0, 𝒫^S0(s0)a.s.1|𝒮|a.s.superscriptsubscript^𝒫𝑆0subscript𝑠01𝒮\widehat{\mathcal{P}}_{S}^{0}(s_{0})\xrightarrow{\text{a.s.}}\frac{1}{|% \mathcal{S}|}over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_ARROW overa.s. → end_ARROW divide start_ARG 1 end_ARG start_ARG | caligraphic_S | end_ARG for each s0𝒮subscript𝑠0𝒮s_{0}\in\mathcal{S}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_S.

Let ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 and let b=Δϵ𝑏Δitalic-ϵb=\frac{\Delta}{\epsilon}italic_b = divide start_ARG roman_Δ end_ARG start_ARG italic_ϵ end_ARG. Recall that ε𝜀\varepsilonitalic_ε-DP with post-processed non-negative counts is obtained by A¯s,t𝒰=(As,t𝒰+Ls,t(b))0subscriptsuperscript¯𝐴𝒰𝑠𝑡superscriptsubscript𝐴𝑠𝑡𝒰subscript𝐿𝑠𝑡𝑏0\overline{A}^{\mathcal{U}}_{s,t}=(A_{s,t}^{\mathcal{U}}+L_{s,t}(b))\vee 0over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT = ( italic_A start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ( italic_b ) ) ∨ 0, where As,t𝒰superscriptsubscript𝐴𝑠𝑡𝒰A_{s,t}^{\mathcal{U}}italic_A start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT is the true number of visits by users in 𝒰𝒰\mathcal{U}caligraphic_U to (s,t)𝑠𝑡(s,t)( italic_s , italic_t ) and {Ls,tLap(b):s𝒮,t𝒯}conditional-setsimilar-tosubscript𝐿𝑠𝑡𝐿𝑎𝑝𝑏formulae-sequence𝑠𝒮𝑡𝒯\{L_{s,t}\sim Lap(b):s\in\mathcal{S},t\in\mathcal{T}\}{ italic_L start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ∼ italic_L italic_a italic_p ( italic_b ) : italic_s ∈ caligraphic_S , italic_t ∈ caligraphic_T } are i.i.d Laplacian noise samples (Section 2.2.1). By definition,

𝒫^S0(s0)superscriptsubscript^𝒫𝑆0subscript𝑠0\displaystyle\widehat{\mathcal{P}}_{S}^{0}(s_{0})over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) =t=1|𝒯|A¯s0,t𝒰s=1|𝒮|t=1|𝒯|A¯s,t𝒰absentsuperscriptsubscript𝑡1𝒯subscriptsuperscript¯𝐴𝒰subscript𝑠0𝑡superscriptsubscript𝑠1𝒮superscriptsubscript𝑡1𝒯subscriptsuperscript¯𝐴𝒰𝑠𝑡\displaystyle=\frac{\sum_{t=1}^{|\mathcal{T}|}\overline{A}^{\mathcal{U}}_{s_{0% },t}}{\sum_{s=1}^{|\mathcal{S}|}\sum_{t=1}^{|\mathcal{T}|}\overline{A}^{% \mathcal{U}}_{s,t}}= divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_T | end_POSTSUPERSCRIPT over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_S | end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_T | end_POSTSUPERSCRIPT over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT end_ARG
=t=1|𝒯|((As0,t𝒰+Ls0,t(b))0)s=1|𝒮|t=1|𝒯|((As,t𝒰+Ls,t(b))0)absentsuperscriptsubscript𝑡1𝒯superscriptsubscript𝐴subscript𝑠0𝑡𝒰subscript𝐿subscript𝑠0𝑡𝑏0superscriptsubscript𝑠1𝒮superscriptsubscript𝑡1𝒯superscriptsubscript𝐴𝑠𝑡𝒰subscript𝐿𝑠𝑡𝑏0\displaystyle=\frac{\sum_{t=1}^{|\mathcal{T}|}\left((A_{s_{0},t}^{\mathcal{U}}% +L_{s_{0},t}(b))\vee 0\right)}{\sum_{s=1}^{|\mathcal{S}|}\sum_{t=1}^{|\mathcal% {T}|}\left((A_{s,t}^{\mathcal{U}}+L_{s,t}(b))\vee 0\right)}= divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_T | end_POSTSUPERSCRIPT ( ( italic_A start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t end_POSTSUBSCRIPT ( italic_b ) ) ∨ 0 ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_S | end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_T | end_POSTSUPERSCRIPT ( ( italic_A start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ( italic_b ) ) ∨ 0 ) end_ARG
=t=1|𝒯|((As0,t𝒰+Ls0,t)0)|𝒯|s=1|𝒮|t=1|𝒯|((As,t𝒰+Ls,t)0)|𝒯|.absentsuperscriptsubscript𝑡1𝒯superscriptsubscript𝐴subscript𝑠0𝑡𝒰subscript𝐿subscript𝑠0𝑡0𝒯superscriptsubscript𝑠1𝒮superscriptsubscript𝑡1𝒯superscriptsubscript𝐴𝑠𝑡𝒰subscript𝐿𝑠𝑡0𝒯\displaystyle=\frac{\frac{\sum_{t=1}^{|\mathcal{T}|}\left((A_{s_{0},t}^{% \mathcal{U}}+L_{s_{0},t})\vee 0\right)}{|\mathcal{T}|}}{\sum_{s=1}^{|\mathcal{% S}|}\frac{\sum_{t=1}^{|\mathcal{T}|}\left((A_{s,t}^{\mathcal{U}}+L_{s,t})\vee 0% \right)}{|\mathcal{T}|}}.= divide start_ARG divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_T | end_POSTSUPERSCRIPT ( ( italic_A start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t end_POSTSUBSCRIPT ) ∨ 0 ) end_ARG start_ARG | caligraphic_T | end_ARG end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_S | end_POSTSUPERSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_T | end_POSTSUPERSCRIPT ( ( italic_A start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) ∨ 0 ) end_ARG start_ARG | caligraphic_T | end_ARG end_ARG .

We now express (As,t𝒰+Ls,t)0=Ls,t0+Xs,tsuperscriptsubscript𝐴𝑠𝑡𝒰subscript𝐿𝑠𝑡0subscript𝐿𝑠𝑡0subscript𝑋𝑠𝑡(A_{s,t}^{\mathcal{U}}+L_{s,t})\vee 0=L_{s,t}\vee 0+X_{s,t}( italic_A start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) ∨ 0 = italic_L start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ∨ 0 + italic_X start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT, for some Xs,tsubscript𝑋𝑠𝑡X_{s,t}italic_X start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT, in order to apply Lemma A.5 later. Since As,t𝒰0superscriptsubscript𝐴𝑠𝑡𝒰0A_{s,t}^{\mathcal{U}}\geq 0italic_A start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT ≥ 0, there are three cases:

Xs,t={As,t𝒰,if Ls,t0 As,t𝒰+Ls,t,if Ls,t<0 and (As,t𝒰+Ls,t)0>00,if Ls,t<0 and (As,t𝒰+Ls,t)0=0 subscript𝑋𝑠𝑡casessuperscriptsubscript𝐴𝑠𝑡𝒰if Ls,t0 superscriptsubscript𝐴𝑠𝑡𝒰subscript𝐿𝑠𝑡if Ls,t<0 and (As,t𝒰+Ls,t)0>00if Ls,t<0 and (As,t𝒰+Ls,t)0=0 X_{s,t}=\begin{cases}A_{s,t}^{\mathcal{U}},&\text{if $L_{s,t}\geq 0$ }\\ A_{s,t}^{\mathcal{U}}+{L}_{s,t},&\text{if $L_{s,t}<0$ and $(A_{s,t}^{\mathcal{% U}}+L_{s,t})\vee 0>0$}\\ 0,&\text{if $L_{s,t}<0$ and $(A_{s,t}^{\mathcal{U}}+L_{s,t})\vee 0=0$ }\end{cases}italic_X start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT = { start_ROW start_CELL italic_A start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT , end_CELL start_CELL if italic_L start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ≥ 0 end_CELL end_ROW start_ROW start_CELL italic_A start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT , end_CELL start_CELL if italic_L start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT < 0 and ( italic_A start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) ∨ 0 > 0 end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL if italic_L start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT < 0 and ( italic_A start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) ∨ 0 = 0 end_CELL end_ROW

We therefore have

(14) 𝒫^S0(s)superscriptsubscript^𝒫𝑆0𝑠\displaystyle\widehat{\mathcal{P}}_{S}^{0}(s)over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_s ) =t=1|𝒯|Xs0,t|𝒯|+t=1|𝒯|Ls0,t0|𝒯|s=1|𝒮|(t=1|𝒯|Xs,t|𝒯|+t=1|𝒯|Ls,t0|𝒯|).absentsuperscriptsubscript𝑡1𝒯subscript𝑋subscript𝑠0𝑡𝒯superscriptsubscript𝑡1𝒯subscript𝐿subscript𝑠0𝑡0𝒯superscriptsubscript𝑠1𝒮superscriptsubscript𝑡1𝒯subscript𝑋𝑠𝑡𝒯superscriptsubscript𝑡1𝒯subscript𝐿𝑠𝑡0𝒯\displaystyle=\frac{\frac{\sum_{t=1}^{|\mathcal{T}|}X_{s_{0},t}}{|\mathcal{T}|% }+\frac{\sum_{t=1}^{|\mathcal{T}|}L_{s_{0},t}\vee 0}{|\mathcal{T}|}}{\sum_{s=1% }^{|\mathcal{S}|}\left(\frac{\sum_{t=1}^{|\mathcal{T}|}X_{s,t}}{|\mathcal{T}|}% +\frac{\sum_{t=1}^{|\mathcal{T}|}L_{s,t}\vee 0}{|\mathcal{T}|}\right)}.= divide start_ARG divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_T | end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t end_POSTSUBSCRIPT end_ARG start_ARG | caligraphic_T | end_ARG + divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_T | end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t end_POSTSUBSCRIPT ∨ 0 end_ARG start_ARG | caligraphic_T | end_ARG end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_S | end_POSTSUPERSCRIPT ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_T | end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT end_ARG start_ARG | caligraphic_T | end_ARG + divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_T | end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ∨ 0 end_ARG start_ARG | caligraphic_T | end_ARG ) end_ARG .

By sparsity, for each s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S

(15) t=1|𝒯|As,t𝒰|𝒯|0 as |𝒯|superscriptsubscript𝑡1𝒯superscriptsubscript𝐴𝑠𝑡𝒰𝒯0 as |𝒯|\displaystyle\frac{\sum_{t=1}^{|\mathcal{T}|}A_{s,t}^{\mathcal{U}}}{|\mathcal{% T}|}\to 0\text{ as $|\mathcal{T}|\to\infty$}divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_T | end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT end_ARG start_ARG | caligraphic_T | end_ARG → 0 as | caligraphic_T | → ∞

Also, by the Strong Law of Large Numbers, since {Ls,t}subscript𝐿𝑠𝑡\{L_{s,t}\}{ italic_L start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT } are i.i.d., and 𝔼[Lap(b)]=0𝔼delimited-[]𝐿𝑎𝑝𝑏0\mathbb{E}[Lap(b)]=0blackboard_E [ italic_L italic_a italic_p ( italic_b ) ] = 0, we have

t=1|𝒯|Ls,t|𝒯|a.s.0 as |𝒯|a.s.superscriptsubscript𝑡1𝒯subscript𝐿𝑠𝑡𝒯0 as |𝒯|\displaystyle\frac{\sum_{t=1}^{|\mathcal{T}|}L_{s,t}}{|\mathcal{T}|}% \xrightarrow{\text{a.s.}}0\text{ as $|\mathcal{T}|\to\infty$}divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_T | end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT end_ARG start_ARG | caligraphic_T | end_ARG start_ARROW overa.s. → end_ARROW 0 as | caligraphic_T | → ∞

By linearity,

t=1|𝒯|As,t𝒰+Ls,t|𝒯|a.s.0 as |𝒯|a.s.superscriptsubscript𝑡1𝒯superscriptsubscript𝐴𝑠𝑡𝒰subscript𝐿𝑠𝑡𝒯0 as |𝒯|\displaystyle\frac{\sum_{t=1}^{|\mathcal{T}|}A_{s,t}^{\mathcal{U}}+{L}_{s,t}}{% |\mathcal{T}|}\xrightarrow{\text{a.s.}}0\text{ as $|\mathcal{T}|\to\infty$}divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_T | end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT end_ARG start_ARG | caligraphic_T | end_ARG start_ARROW overa.s. → end_ARROW 0 as | caligraphic_T | → ∞

Hence, in all three possible cases,

t=1|𝒯|Xs,t|𝒯|a.s.0 as |𝒯|a.s.superscriptsubscript𝑡1𝒯subscript𝑋𝑠𝑡𝒯0 as |𝒯|\displaystyle\frac{\sum_{t=1}^{|\mathcal{T}|}X_{s,t}}{|\mathcal{T}|}% \xrightarrow{\text{a.s.}}0\text{ as $|\mathcal{T}|\to\infty$}divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_T | end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT end_ARG start_ARG | caligraphic_T | end_ARG start_ARROW overa.s. → end_ARROW 0 as | caligraphic_T | → ∞

This allows us to simplify

𝒫^S0(s)superscriptsubscript^𝒫𝑆0𝑠\displaystyle\widehat{\mathcal{P}}_{S}^{0}(s)over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_s ) a.s.t=1|𝒯|Ls0,t0|𝒯|s=1|𝒮|t=1|𝒯|Ls,t0|𝒯|.a.s.absentsuperscriptsubscript𝑡1𝒯subscript𝐿subscript𝑠0𝑡0𝒯superscriptsubscript𝑠1𝒮superscriptsubscript𝑡1𝒯subscript𝐿𝑠𝑡0𝒯\displaystyle\xrightarrow{\text{a.s.}}\frac{\frac{\sum_{t=1}^{|\mathcal{T}|}L_% {s_{0},t}\vee 0}{|\mathcal{T}|}}{\sum_{s=1}^{|\mathcal{S}|}\frac{\sum_{t=1}^{|% \mathcal{T}|}L_{s,t}\vee 0}{|\mathcal{T}|}}.start_ARROW overa.s. → end_ARROW divide start_ARG divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_T | end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t end_POSTSUBSCRIPT ∨ 0 end_ARG start_ARG | caligraphic_T | end_ARG end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_S | end_POSTSUPERSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_T | end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ∨ 0 end_ARG start_ARG | caligraphic_T | end_ARG end_ARG .

Since for all s,t𝑠𝑡s,titalic_s , italic_t, Ls,t0Lap(b)0similar-tosubscript𝐿𝑠𝑡0𝐿𝑎𝑝𝑏0L_{s,t}\vee 0\sim Lap(b)\vee 0italic_L start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ∨ 0 ∼ italic_L italic_a italic_p ( italic_b ) ∨ 0, Lemma A.5 implies 𝔼[Ls,t0]=b2𝔼delimited-[]subscript𝐿𝑠𝑡0𝑏2\mathbb{E}[L_{s,t}\vee 0]=\frac{b}{2}blackboard_E [ italic_L start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ∨ 0 ] = divide start_ARG italic_b end_ARG start_ARG 2 end_ARG. Hence, by the Strong Law of Large Numbers,

t=1|𝒯|Ls,t0|𝒯|a.s.b2 as |𝒯|a.s.superscriptsubscript𝑡1𝒯subscript𝐿𝑠𝑡0𝒯𝑏2 as |𝒯|\displaystyle\frac{\sum_{t=1}^{|\mathcal{T}|}L_{s,t}\vee 0}{|\mathcal{T}|}% \xrightarrow{\text{a.s.}}\frac{b}{2}\text{ as $|\mathcal{T}|\to\infty$}divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_T | end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ∨ 0 end_ARG start_ARG | caligraphic_T | end_ARG start_ARROW overa.s. → end_ARROW divide start_ARG italic_b end_ARG start_ARG 2 end_ARG as | caligraphic_T | → ∞

Finally, for any set of ROIs 𝒮𝒮\mathcal{S}caligraphic_S, and any s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S,

𝒫^S0(s)superscriptsubscript^𝒫𝑆0𝑠\displaystyle\widehat{\mathcal{P}}_{S}^{0}(s)over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_s ) a.s.b2s=1|𝒮|b2=b|𝒮|b=1|𝒮|,a.s.absent𝑏2superscriptsubscript𝑠1𝒮𝑏2𝑏𝒮𝑏1𝒮\displaystyle\xrightarrow{\text{a.s.}}\frac{\frac{b}{2}}{\sum_{s=1}^{|\mathcal% {S}|}\frac{b}{2}}=\frac{b}{|\mathcal{S}|b}=\frac{1}{|\mathcal{S}|},start_ARROW overa.s. → end_ARROW divide start_ARG divide start_ARG italic_b end_ARG start_ARG 2 end_ARG end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_S | end_POSTSUPERSCRIPT divide start_ARG italic_b end_ARG start_ARG 2 end_ARG end_ARG = divide start_ARG italic_b end_ARG start_ARG | caligraphic_S | italic_b end_ARG = divide start_ARG 1 end_ARG start_ARG | caligraphic_S | end_ARG ,

A symmetric argument proves 𝒫^T0Unif(𝒯)superscriptsubscript^𝒫𝑇0𝑈𝑛𝑖𝑓𝒯\widehat{\mathcal{P}}_{T}^{0}\to Unif(\mathcal{T})over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT → italic_U italic_n italic_i italic_f ( caligraphic_T ) in distribution as |𝒮|𝒮|\mathcal{S}|\to\infty| caligraphic_S | → ∞, using Lemma A.3 instead of strong sparsity. ∎

Remark. We note that strong sparsity is assumed in Eq. (14) to prove that t=1|𝒯|Xs,t|𝒯|a.s.0 as |𝒯|a.s.superscriptsubscript𝑡1𝒯subscript𝑋𝑠𝑡𝒯0 as |𝒯|\frac{\sum_{t=1}^{|\mathcal{T}|}X_{s,t}}{|\mathcal{T}|}\xrightarrow{\text{a.s.% }}0\text{ as $|\mathcal{T}|\to\infty$}divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_T | end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT end_ARG start_ARG | caligraphic_T | end_ARG start_ARROW overa.s. → end_ARROW 0 as | caligraphic_T | → ∞. Although we expect the oracle average count μS(s)subscript𝜇𝑆𝑠\mu_{S}(s)italic_μ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_s ) to be very small for most s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S, due to the sparsity of aggregate location data, it is unlikely to observe μS=0subscript𝜇𝑆0\mu_{S}=0italic_μ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = 0 for real data. Substituting μS(s)subscript𝜇𝑆𝑠\mu_{S}(s)italic_μ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_s ) in place of 00 in Eq. (14) will not yield the uniform probability 𝒫^S0(s)a.s.1|𝒮|a.s.superscriptsubscript^𝒫𝑆0𝑠1𝒮\widehat{\mathcal{P}}_{S}^{0}(s)\xrightarrow{\text{a.s.}}\frac{1}{|\mathcal{S}|}over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_s ) start_ARROW overa.s. → end_ARROW divide start_ARG 1 end_ARG start_ARG | caligraphic_S | end_ARG, but it will be a close approximation, provided that Δϵ>>μSmuch-greater-thanΔitalic-ϵsubscript𝜇𝑆\frac{\Delta}{\epsilon}>>\mu_{S}divide start_ARG roman_Δ end_ARG start_ARG italic_ϵ end_ARG > > italic_μ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and that the number of epochs is large.

In practice, fixed dimensions for S𝑆Sitalic_S and T𝑇Titalic_T will prevent the empirical marginals from completely converging to the uniform distribution. This is demonstrated for different noise scales on the Milan dataset (which has |S|=100𝑆100|S|=100| italic_S | = 100 and |T|=168𝑇168|T|=168| italic_T | = 168) in Figure 17.

Lemma A.5.

Suppose that YL0similar-to𝑌𝐿0Y\sim L\vee 0italic_Y ∼ italic_L ∨ 0, with LLap(b)similar-to𝐿𝐿𝑎𝑝𝑏L\sim Lap(b)italic_L ∼ italic_L italic_a italic_p ( italic_b ). Then, Y𝑌Yitalic_Y has mean

𝔼[Y]=b2𝔼delimited-[]𝑌𝑏2\displaystyle\mathbb{E}[Y]=\frac{b}{2}blackboard_E [ italic_Y ] = divide start_ARG italic_b end_ARG start_ARG 2 end_ARG
Proof.

Let LLap(b)similar-to𝐿𝐿𝑎𝑝𝑏L\sim Lap(b)italic_L ∼ italic_L italic_a italic_p ( italic_b ). Then, its probability density function (pdf) fLsubscript𝑓𝐿f_{L}italic_f start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT is given by

fL(x)=12be|x|bsubscript𝑓𝐿𝑥12𝑏superscript𝑒𝑥𝑏\displaystyle f_{L}(x)=\frac{1}{2b}e^{\frac{|x|}{b}}italic_f start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 2 italic_b end_ARG italic_e start_POSTSUPERSCRIPT divide start_ARG | italic_x | end_ARG start_ARG italic_b end_ARG end_POSTSUPERSCRIPT for xfor 𝑥\displaystyle\text{ for }x\in\mathbb{R}for italic_x ∈ blackboard_R

which is symmetric about x=0𝑥0x=0italic_x = 0. Hence, P(X0)=12𝑃𝑋012P(X\leq 0)=\frac{1}{2}italic_P ( italic_X ≤ 0 ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG. It follows that Y=X0𝑌𝑋0Y=X\vee 0italic_Y = italic_X ∨ 0 has the pdf fYsubscript𝑓𝑌f_{Y}italic_f start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT

fY(x)={0,for x<0δ(x)2,for x=012bexb,for x>0subscript𝑓𝑌𝑥cases0for 𝑥0𝛿𝑥2for 𝑥012𝑏superscript𝑒𝑥𝑏for 𝑥0f_{Y}(x)=\begin{cases}0,&\text{for }x<0\\ \frac{\delta(x)}{2},&\text{for }x=0\\ \frac{1}{2b}e^{-\frac{x}{b}},&\text{for }x>0\end{cases}italic_f start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_x ) = { start_ROW start_CELL 0 , end_CELL start_CELL for italic_x < 0 end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_δ ( italic_x ) end_ARG start_ARG 2 end_ARG , end_CELL start_CELL for italic_x = 0 end_CELL end_ROW start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG 2 italic_b end_ARG italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x end_ARG start_ARG italic_b end_ARG end_POSTSUPERSCRIPT , end_CELL start_CELL for italic_x > 0 end_CELL end_ROW

where δ(x)𝛿𝑥\delta(x)italic_δ ( italic_x ) is the Dirac delta function representing the accumulated probability mass at zero. We then evaluate

𝔼[Y]𝔼delimited-[]𝑌\displaystyle\mathbb{E}[Y]blackboard_E [ italic_Y ] =12b0xexb𝑑xabsent12𝑏superscriptsubscript0𝑥superscript𝑒𝑥𝑏differential-d𝑥\displaystyle=\frac{1}{2b}\int_{0}^{\infty}xe^{-\frac{x}{b}}dx= divide start_ARG 1 end_ARG start_ARG 2 italic_b end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x end_ARG start_ARG italic_b end_ARG end_POSTSUPERSCRIPT italic_d italic_x
=12b(bxexb|0+b0exb𝑑x)absent12𝑏evaluated-at𝑏𝑥superscript𝑒𝑥𝑏0𝑏superscriptsubscript0superscript𝑒𝑥𝑏differential-d𝑥\displaystyle=\frac{1}{2b}\left(-bxe^{-\frac{x}{b}}\bigg{|}_{0}^{\infty}+b\int% _{0}^{\infty}e^{-\frac{x}{b}}dx\right)= divide start_ARG 1 end_ARG start_ARG 2 italic_b end_ARG ( - italic_b italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x end_ARG start_ARG italic_b end_ARG end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT + italic_b ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x end_ARG start_ARG italic_b end_ARG end_POSTSUPERSCRIPT italic_d italic_x )
=12b(b2exb|0)absent12𝑏evaluated-atsuperscript𝑏2superscript𝑒𝑥𝑏0\displaystyle=\frac{1}{2b}\left(-b^{2}e^{-\frac{x}{b}}\bigg{|}_{0}^{\infty}\right)= divide start_ARG 1 end_ARG start_ARG 2 italic_b end_ARG ( - italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x end_ARG start_ARG italic_b end_ARG end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT )
=12b(b2)=b2absent12𝑏superscript𝑏2𝑏2\displaystyle=\frac{1}{2b}\left(b^{2}\right)=\frac{b}{2}= divide start_ARG 1 end_ARG start_ARG 2 italic_b end_ARG ( italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = divide start_ARG italic_b end_ARG start_ARG 2 end_ARG

Appendix B Algorithms

In this section, we present the main algorithms required to generate synthetic traces from the released aggregate for our ZK MIA.

Algorithm 1 describes how we adapted the unicity model from  Farzanehfar et al., (2021) to generate synthetic traces for ZK MIA. We note that the procedure for generating a synthetic trace can also be interpreted as running a Markov chain {Xi:i=1,,nvisits}conditional-setsubscript𝑋𝑖𝑖1subscript𝑛𝑣𝑖𝑠𝑖𝑡𝑠\{X_{i}:i=1,...,n_{visits}\}{ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : italic_i = 1 , … , italic_n start_POSTSUBSCRIPT italic_v italic_i italic_s italic_i italic_t italic_s end_POSTSUBSCRIPT } over the state space of spatiotemporal pairs (s,t)C(s0)×𝒯𝑠𝑡𝐶subscript𝑠0𝒯(s,t)\in C(s_{0})\times\mathcal{T}( italic_s , italic_t ) ∈ italic_C ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) × caligraphic_T with transition probabilities to (s,t)C(s0)×𝒯superscript𝑠superscript𝑡𝐶subscript𝑠0𝒯(s^{\prime},t^{\prime})\in C(s_{0})\times\mathcal{T}( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ italic_C ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) × caligraphic_T proportional to the product of the pmfs 𝒫S(s)𝒫T(t)subscript𝒫𝑆superscript𝑠subscript𝒫𝑇superscript𝑡\mathcal{P}_{S}(s^{\prime})\mathcal{P}_{T}(t^{\prime})caligraphic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) caligraphic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).

Algorithm 2 estimates the three marginal probability distributions required to run Algorithm 1: the space marginal 𝒫Ssubscript𝒫𝑆\mathcal{P}_{S}caligraphic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, the time marginal 𝒫Tsubscript𝒫𝑇\mathcal{P}_{T}caligraphic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, and the activity marginal 𝒫Asubscript𝒫𝐴\mathcal{P}_{A}caligraphic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT from an aggregate release A¯Usuperscript¯𝐴𝑈\overline{A}^{U}over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT. We estimate the marginals via our denoising and debiasing techniques (from Section 4.3.1), depending on the application of privacy measures on A¯Usuperscript¯𝐴𝑈\overline{A}^{U}over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT.

Algorithm 3 describes our procedure for achieving an estimate μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG for the mean number of visits per user in the dataset given a privacy-aware aggregate release. Recall that 𝒫Asubscript𝒫𝐴\mathcal{P}_{A}caligraphic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT is set to Exp(μ^)𝐸𝑥𝑝^𝜇Exp(\hat{\mu})italic_E italic_x italic_p ( over^ start_ARG italic_μ end_ARG ). Algorithm 4 describes our procedure for computing which degree p𝑝pitalic_p will work best in the power transformation, to correct the empirical marginal 𝒫^ 0superscriptsubscript^𝒫 0\widehat{\mathcal{P}}_{\rule{3.5pt}{0.3pt}}^{0}over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT obtained directly from a ε𝜀\varepsilonitalic_ε-DP aggregate release.

Algorithm 1 GenerateSyntheticTrace
1:Inputs:
2:      𝒫Ssubscript𝒫𝑆\mathcal{P}_{S}caligraphic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT: Approximated space marginal over ROIs 𝒫Tsubscript𝒫𝑇\mathcal{P}_{T}caligraphic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT: Approximated time marginal over epochs 𝒫Asubscript𝒫𝐴\mathcal{P}_{A}caligraphic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT: Approximated activity marginal over trace sizes DT(𝒮)𝐷𝑇𝒮DT(\mathcal{S})italic_D italic_T ( caligraphic_S ): Delaunay triangulation of ROIs
3:Output:
4:      Lssuperscript𝐿𝑠L^{s}italic_L start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT: A synthetic trace // We sample an origin ROI.
5:s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT \leftarrow sample_from_distribution(𝒫S,1subscript𝒫𝑆1\mathcal{P}_{S},1caligraphic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , 1)
6: // Use DT(𝒮)𝐷𝑇𝒮DT(\mathcal{S})italic_D italic_T ( caligraphic_S ) to create a connected subgraph of ROIs including the origin ROI
7:C(s0)𝐶subscript𝑠0C(s_{0})italic_C ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) \leftarrow generate_connected_subgraph(s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, DT(𝒮)𝐷𝑇𝒮DT(\mathcal{S})italic_D italic_T ( caligraphic_S ), n_rois_subgraph =10absent10=10= 10 (default value from (Farzanehfar et al.,, 2021))
8: // Normalize 𝒫ssubscript𝒫𝑠\mathcal{P}_{s}caligraphic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT restricted to C(s0)𝐶subscript𝑠0C(s_{0})italic_C ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ).
9:𝒫C(s0)subscript𝒫𝐶subscript𝑠0absent\mathcal{P}_{C(s_{0})}\leftarrowcaligraphic_P start_POSTSUBSCRIPT italic_C ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ← normalize(restrict(𝒫ssubscript𝒫𝑠\mathcal{P}_{s}caligraphic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, C(s0)𝐶subscript𝑠0C(s_{0})italic_C ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )))
10: // Sample the trace size (# visits).
11:n_visits \leftarrow round(sample_from_distribution(𝒫A,1subscript𝒫𝐴1\mathcal{P}_{A},1caligraphic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , 1))
12: // Randomly sample n_visits ROIs and epochs with replacement.
13:ROIs \leftarrow sample_from_distribution(𝒫C(s0)subscript𝒫𝐶subscript𝑠0\mathcal{P}_{C(s_{0})}caligraphic_P start_POSTSUBSCRIPT italic_C ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT, n_visits)
14:epochs \leftarrow sample_from_distribution(𝒫Tsubscript𝒫𝑇\mathcal{P}_{T}caligraphic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, n_visits)
15:return Lssuperscript𝐿𝑠L^{s}italic_L start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT \leftarrow [ (ROIs[i], epochs[i]) for i = 1 … n_visits ]
Algorithm 2 Approximate Marginals From Aggregate
1:Inputs:
2:      A¯𝒰::superscript¯𝐴𝒰absent\overline{A}^{\mathcal{U}}:over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT : Released aggregate m𝑚mitalic_m: Aggregate group size p𝑝pitalic_p: Specified probability distribution family
3:Output:
4:      𝒫Ssubscript𝒫𝑆\mathcal{P}_{S}caligraphic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT: Approximated space marginal over ROIs 𝒫Tsubscript𝒫𝑇\mathcal{P}_{T}caligraphic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT: Approximated time marginal over epochs 𝒫Asubscript𝒫𝐴\mathcal{P}_{A}caligraphic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT: Approximated activity marginal over trace sizes // Compute direct estimates.
5:𝒫S,𝒫Tsubscript𝒫𝑆subscript𝒫𝑇\mathcal{P}_{S},\mathcal{P}_{T}caligraphic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT \leftarrow compute_empirical_marginals(A¯𝒰superscript¯𝐴𝒰\overline{A}^{\mathcal{U}}over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT)
6:μvisits0superscriptsubscript𝜇𝑣𝑖𝑠𝑖𝑡𝑠0\mu_{visits}^{0}italic_μ start_POSTSUBSCRIPT italic_v italic_i italic_s italic_i italic_t italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT \leftarrow sum_entries(A¯𝒰)/msuperscript¯𝐴𝒰𝑚(\overline{A}^{\mathcal{U}})/m( over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT ) / italic_m
7:if A¯𝒰=A𝒰superscript¯𝐴𝒰superscript𝐴𝒰\overline{A}^{\mathcal{U}}=A^{\mathcal{U}}over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT = italic_A start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT then
8: // Return direct estimates if no privacy.
9:     𝒫Asubscript𝒫𝐴\mathcal{P}_{A}caligraphic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT \leftarrow fit_dist(p𝑝pitalic_p, μvisits0superscriptsubscript𝜇𝑣𝑖𝑠𝑖𝑡𝑠0\mu_{visits}^{0}italic_μ start_POSTSUBSCRIPT italic_v italic_i italic_s italic_i italic_t italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT)
10:     return 𝒫S,𝒫T,𝒫Asubscript𝒫𝑆subscript𝒫𝑇subscript𝒫𝐴\mathcal{P}_{S},\mathcal{P}_{T},\mathcal{P}_{A}caligraphic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT
11:end if
12:if A¯𝒰=ASSC𝒰(k)superscript¯𝐴𝒰superscriptsubscript𝐴𝑆𝑆𝐶𝒰𝑘\overline{A}^{\mathcal{U}}=A_{SSC}^{\mathcal{U}}(k)over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT = italic_A start_POSTSUBSCRIPT italic_S italic_S italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT ( italic_k ) and k>0𝑘0k>0italic_k > 0 then
13: // Apply log compression if SSC.
14:     𝒫S,𝒫Tsubscript𝒫𝑆subscript𝒫𝑇\mathcal{P}_{S},\mathcal{P}_{T}caligraphic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT \leftarrow log_compression(𝒫S,𝒫T)subscript𝒫𝑆subscript𝒫𝑇(\mathcal{P}_{S},\mathcal{P}_{T})( caligraphic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT )
15:end if
16:if A¯𝒰=ADP𝒰(ε)superscript¯𝐴𝒰superscriptsubscript𝐴𝐷𝑃𝒰𝜀\overline{A}^{\mathcal{U}}=A_{DP}^{\mathcal{U}}(\varepsilon)over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT = italic_A start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT ( italic_ε ) or A¯𝒰=ADP,SSC𝒰(ε,k)superscript¯𝐴𝒰superscriptsubscript𝐴𝐷𝑃𝑆𝑆𝐶𝒰𝜀𝑘\overline{A}^{\mathcal{U}}=A_{DP,SSC}^{\mathcal{U}}(\varepsilon,k)over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT = italic_A start_POSTSUBSCRIPT italic_D italic_P , italic_S italic_S italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_U end_POSTSUPERSCRIPT ( italic_ε , italic_k ) then
17: // Apply power transformation if DP.
18:     𝒫S,𝒫Tsubscript𝒫𝑆subscript𝒫𝑇\mathcal{P}_{S},\mathcal{P}_{T}caligraphic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT \leftarrow power_transform(𝒫S,𝒫T)subscript𝒫𝑆subscript𝒫𝑇(\mathcal{P}_{S},\mathcal{P}_{T})( caligraphic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT )
19:end if
20: // Apply Algorithm 3 from Appendix.
21:μvisitssubscript𝜇𝑣𝑖𝑠𝑖𝑡𝑠\mu_{visits}italic_μ start_POSTSUBSCRIPT italic_v italic_i italic_s italic_i italic_t italic_s end_POSTSUBSCRIPT \leftarrow estimate_mean(μvisits0superscriptsubscript𝜇𝑣𝑖𝑠𝑖𝑡𝑠0\mu_{visits}^{0}italic_μ start_POSTSUBSCRIPT italic_v italic_i italic_s italic_i italic_t italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, A𝐴Aitalic_A, m𝑚mitalic_m, 𝒫S,𝒫Tsubscript𝒫𝑆subscript𝒫𝑇\mathcal{P}_{S},\mathcal{P}_{T}caligraphic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, k𝑘kitalic_k, ε𝜀\varepsilonitalic_ε)
22:𝒫Asubscript𝒫𝐴\mathcal{P}_{A}caligraphic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT \leftarrow fit_dist(p𝑝pitalic_p, μvisitssubscript𝜇𝑣𝑖𝑠𝑖𝑡𝑠\mu_{visits}italic_μ start_POSTSUBSCRIPT italic_v italic_i italic_s italic_i italic_t italic_s end_POSTSUBSCRIPT)
23:return 𝒫S,𝒫T,𝒫Asubscript𝒫𝑆subscript𝒫𝑇subscript𝒫𝐴\mathcal{P}_{S},\mathcal{P}_{T},\mathcal{P}_{A}caligraphic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT
Algorithm 3 EstimateMean
1:Inputs:
2:      μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: Initial guess for mean visits A::𝐴absentA:italic_A : Released aggregate DT(𝒮)𝐷𝑇𝒮DT(\mathcal{S})italic_D italic_T ( caligraphic_S ): Delaunay triangulation of ROIs 𝒫ssubscript𝒫𝑠\mathcal{P}_{s}caligraphic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT: Estimated space marginal 𝒫tsubscript𝒫𝑡\mathcal{P}_{t}caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT: Estimated time marginal k𝑘kitalic_k: Suppression threshold ε,Δ𝜀Δ\varepsilon,\Deltaitalic_ε , roman_Δ: DP parameters m𝑚mitalic_m: Aggregate group size
3:Additional parameters
4:tol: Tolerance for stop**
5:max_iter: Max iterations
6:Output:
7:      μ𝜇\muitalic_μ: Approximated mean visits per user
8:μ𝜇\muitalic_μ \leftarrow μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
9:for i=1𝑖1i=1italic_i = 1 to max_iter do
10:     A1subscript𝐴1A_{1}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT \leftarrow initialize_matrix()
11: // Create a synthetic aggregate of size m𝑚mitalic_m.
12:     for j=1𝑗1j=1italic_j = 1 to m𝑚mitalic_m do
13: // Generate synthetic trace via with μ𝜇\muitalic_μ vistis.
14:         A1subscript𝐴1A_{1}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT \leftarrow A1+limit-fromsubscript𝐴1A_{1}+italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + generate_synthetic_trace(𝒫ssubscript𝒫𝑠\mathcal{P}_{s}caligraphic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, 𝒫tsubscript𝒫𝑡\mathcal{P}_{t}caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, μ𝜇\muitalic_μ, DT(𝒮)𝐷𝑇𝒮DT(\mathcal{S})italic_D italic_T ( caligraphic_S ))
15:     end for
16:     A1subscript𝐴1A_{1}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT \leftarrow apply_privacy_measures(A1subscript𝐴1A_{1}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, k𝑘kitalic_k, ϵitalic-ϵ\epsilonitalic_ϵ, ΔΔ\Deltaroman_Δ)
17: // Increase or decrease the estimate μ𝜇\muitalic_μ accordingly
18:     μμ0+𝜇limit-fromsubscript𝜇0\mu\leftarrow\mu_{0}+italic_μ ← italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + (sum(A𝐴Aitalic_A)- sum(A1subscript𝐴1A_{1}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT))/m𝑚mitalic_m
19:     if |μμ0|<tol𝜇subscript𝜇0𝑡𝑜𝑙|\mu-\mu_{0}|<tol| italic_μ - italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | < italic_t italic_o italic_l then return μ𝜇\muitalic_μ
20:     end if
21:end for
22:return μ𝜇\muitalic_μ
Algorithm 4 pSelection
1:Inputs:
2:      σ0subscript𝜎0\sigma_{0}italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: Reference variance 𝒫𝒫\mathcal{P}caligraphic_P: Space or time marginal to be modified ϵtolsubscriptitalic-ϵ𝑡𝑜𝑙\epsilon_{tol}italic_ϵ start_POSTSUBSCRIPT italic_t italic_o italic_l end_POSTSUBSCRIPT: Error tolerance
3:Output:
4:      p𝑝pitalic_p: Degree for transformation xpsuperscript𝑥𝑝x^{p}italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT that sets the variance of 𝒫𝒫\mathcal{P}caligraphic_P to approximately match σ0subscript𝜎0\sigma_{0}italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
5:σ𝜎\sigmaitalic_σ \leftarrow compute_variance(𝒫)𝑐𝑜𝑚𝑝𝑢𝑡𝑒_𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒𝒫compute\_variance(\mathcal{P})italic_c italic_o italic_m italic_p italic_u italic_t italic_e _ italic_v italic_a italic_r italic_i italic_a italic_n italic_c italic_e ( caligraphic_P ) // We compute the variance from the original marginal value.
6:p𝑝pitalic_p \leftarrow 1
7:while |σ0σ|>ϵtolsubscript𝜎0𝜎subscriptitalic-ϵ𝑡𝑜𝑙|\sigma_{0}-\sigma|>\epsilon_{tol}| italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_σ | > italic_ϵ start_POSTSUBSCRIPT italic_t italic_o italic_l end_POSTSUBSCRIPT do
8:     σ𝜎\sigmaitalic_σ \leftarrow compute_variance(pow(𝒫,p))𝑐𝑜𝑚𝑝𝑢𝑡𝑒_𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒𝑝𝑜𝑤𝒫𝑝compute\_variance(pow(\mathcal{P},p))italic_c italic_o italic_m italic_p italic_u italic_t italic_e _ italic_v italic_a italic_r italic_i italic_a italic_n italic_c italic_e ( italic_p italic_o italic_w ( caligraphic_P , italic_p ) )
9: // Increment the power p𝑝pitalic_p until estimate is in range.
10:     p𝑝pitalic_p \leftarrow p+0.01𝑝0.01p+0.01italic_p + 0.01
11:end while
12:return σ𝜎\sigmaitalic_σ

Appendix C Accuracy Results

In this section, we present the accuracy scores of ZK MIA and KK MIA for the experiments on suppression of small counts and ε𝜀\varepsilonitalic_ε-DP noise addition.

Table 3 presents the accuracy scores obtained by ZK and KK from the experiments on suppression of small counts from Section 6.1. Table 4 presents the accuracy scores obtained by ZK and KK from the experiments on event level ε𝜀\varepsilonitalic_ε-DP from Section 6.2. Table 5 presents the accuracy scores obtained by ZK and KK from the experiments on user-day level ε𝜀\varepsilonitalic_ε-DP.

We observe that the accuracy scores of KK and ZK are close in each experiment, as observed already with the AUC metric in the main text.

Table 3. Mean accuracy scores with standard error for KK and ZK on size 1000100010001000 aggregates from the CDR and Milan datasets across various suppression thresholds k𝑘kitalic_k.
k𝑘kitalic_k CDR dataset Milan dataset
KK ZK KK ZK
0 0.980±0.012plus-or-minus0.9800.0120.980\pm 0.0120.980 ± 0.012 0.991±0.008plus-or-minus0.9910.008\it{0.991\pm 0.008}italic_0.991 ± italic_0.008 0.990±0.005plus-or-minus0.9900.0050.990\pm 0.0050.990 ± 0.005 0.990±0.003plus-or-minus0.9900.003\it{0.990\pm 0.003}italic_0.990 ± italic_0.003
1 0.907±0.028plus-or-minus0.9070.028\it{0.907\pm 0.028}italic_0.907 ± italic_0.028 0.879±0.025plus-or-minus0.8790.0250.879\pm 0.0250.879 ± 0.025 0.767±0.018plus-or-minus0.7670.018\it{0.767\pm 0.018}italic_0.767 ± italic_0.018 0.700±0.017plus-or-minus0.7000.017\it{0.700\pm 0.017}italic_0.700 ± italic_0.017
2 0.807±0.031plus-or-minus0.8070.0310.807\pm 0.0310.807 ± 0.031 0.827±0.026plus-or-minus0.8270.026\it{0.827\pm 0.026}italic_0.827 ± italic_0.026 0.683±0.019plus-or-minus0.6830.019\it{0.683\pm 0.019}italic_0.683 ± italic_0.019 0.631±0.013plus-or-minus0.6310.013\it{0.631\pm 0.013}italic_0.631 ± italic_0.013
3 0.685±0.032plus-or-minus0.6850.032\it{0.685\pm 0.032}italic_0.685 ± italic_0.032 0.687±0.031plus-or-minus0.6870.0310.687\pm 0.0310.687 ± 0.031 0.600±0.016plus-or-minus0.6000.016\it{0.600\pm 0.016}italic_0.600 ± italic_0.016 0.550±0.009plus-or-minus0.5500.009\it{0.550\pm 0.009}italic_0.550 ± italic_0.009
4 0.597±0.024plus-or-minus0.5970.024\it{0.597\pm 0.024}italic_0.597 ± italic_0.024 0.603±0.027plus-or-minus0.6030.0270.603\pm 0.0270.603 ± 0.027 0.566±0.011plus-or-minus0.5660.011\it{0.566\pm 0.011}italic_0.566 ± italic_0.011 0.512±0.003plus-or-minus0.5120.003\it{0.512\pm 0.003}italic_0.512 ± italic_0.003
5 0.543±0.018plus-or-minus0.5430.0180.543\pm 0.0180.543 ± 0.018 0.528±0.019plus-or-minus0.5280.019\it{0.528\pm 0.019}italic_0.528 ± italic_0.019 0.536±0.010plus-or-minus0.5360.010\it{0.536\pm 0.010}italic_0.536 ± italic_0.010 0.500±0.000plus-or-minus0.5000.000\it{0.500\pm 0.000}italic_0.500 ± italic_0.000
Table 4. Mean accuracy scores with standard error for KK and ZK on size 1000100010001000 aggregates from the CDR and Milan datasets across various privacy budgets ε𝜀\varepsilonitalic_ε for event level DP.
ε𝜀\varepsilonitalic_ε CDR dataset Milan dataset
KK ZK KK ZK
0.1 0.588±0.019plus-or-minus0.5880.019\it{0.588\pm 0.019}italic_0.588 ± italic_0.019 0.555±0.014plus-or-minus0.5550.0140.555\pm 0.0140.555 ± 0.014 0.539±0.006plus-or-minus0.5390.006\it{0.539\pm 0.006}italic_0.539 ± italic_0.006 0.549±0.007plus-or-minus0.5490.007\it{0.549\pm 0.007}italic_0.549 ± italic_0.007
0.5 0.848±0.028plus-or-minus0.8480.028\it{0.848\pm 0.028}italic_0.848 ± italic_0.028 0.791±0.019plus-or-minus0.7910.0190.791\pm 0.0190.791 ± 0.019 0.744±0.010plus-or-minus0.7440.010\it{0.744\pm 0.010}italic_0.744 ± italic_0.010 0.634±0.007plus-or-minus0.6340.007\it{0.634\pm 0.007}italic_0.634 ± italic_0.007
1.0 0.920±0.026plus-or-minus0.9200.0260.920\pm 0.0260.920 ± 0.026 0.907±0.018plus-or-minus0.9070.0180.907\pm 0.0180.907 ± 0.018 0.850±0.016plus-or-minus0.8500.016\it{0.850\pm 0.016}italic_0.850 ± italic_0.016 0.594±0.008plus-or-minus0.5940.008\it{0.594\pm 0.008}italic_0.594 ± italic_0.008
5.0 0.906±0.035plus-or-minus0.9060.0350.906\pm 0.0350.906 ± 0.035 0.934±0.019plus-or-minus0.9340.0190.934\pm 0.0190.934 ± 0.019 0.881±0.022plus-or-minus0.8810.022\it{0.881\pm 0.022}italic_0.881 ± italic_0.022 0.660±0.018plus-or-minus0.6600.018\it{0.660\pm 0.018}italic_0.660 ± italic_0.018
10.0 0.923±0.029plus-or-minus0.9230.0290.923\pm 0.0290.923 ± 0.029 0.934±0.018plus-or-minus0.9340.0180.934\pm 0.0180.934 ± 0.018 0.920±0.021plus-or-minus0.9200.021\it{0.920\pm 0.021}italic_0.920 ± italic_0.021 0.671±0.019plus-or-minus0.6710.019\it{0.671\pm 0.019}italic_0.671 ± italic_0.019
Table 5. Mean accuracy scores with standard error for KK and ZK on size 1000100010001000 aggregates from the CDR and Milan datasets across various privacy budgets ε𝜀\varepsilonitalic_ε for user-day level DP.
ε𝜀\varepsilonitalic_ε CDR dataset Milan dataset
KK ZK KK ZK
0.1 0.502±0.014plus-or-minus0.5020.0140.502\pm 0.0140.502 ± 0.014 0.502±0.010plus-or-minus0.5020.010\it{0.502\pm 0.010}italic_0.502 ± italic_0.010 0.497±0.006plus-or-minus0.4970.0060.497\pm 0.0060.497 ± 0.006 0.496±0.006plus-or-minus0.4960.0060.496\pm 0.0060.496 ± 0.006
0.5 0.508±0.014plus-or-minus0.5080.0140.508\pm 0.0140.508 ± 0.014 0.526±0.016plus-or-minus0.5260.016\it{0.526\pm 0.016}italic_0.526 ± italic_0.016 0.517±0.006plus-or-minus0.5170.0060.517\pm 0.0060.517 ± 0.006 0.519±0.006plus-or-minus0.5190.0060.519\pm 0.0060.519 ± 0.006
1.0 0.533±0.014plus-or-minus0.5330.0140.533\pm 0.0140.533 ± 0.014 0.539±0.014plus-or-minus0.5390.014\it{0.539\pm 0.014}italic_0.539 ± italic_0.014 0.534±0.006plus-or-minus0.5340.0060.534\pm 0.0060.534 ± 0.006 0.544±0.007plus-or-minus0.5440.0070.544\pm 0.0070.544 ± 0.007
5.0 0.723±0.020plus-or-minus0.7230.020\it{0.723\pm 0.020}italic_0.723 ± italic_0.020 0.676±0.018plus-or-minus0.6760.0180.676\pm 0.0180.676 ± 0.018 0.746±0.009plus-or-minus0.7460.0090.746\pm 0.0090.746 ± 0.009 0.680±0.008plus-or-minus0.6800.0080.680\pm 0.0080.680 ± 0.008
10.0 0.874±0.025plus-or-minus0.8740.025\it{0.874\pm 0.025}italic_0.874 ± italic_0.025 0.825±0.019plus-or-minus0.8250.0190.825\pm 0.0190.825 ± 0.019 0.870±0.014plus-or-minus0.8700.0140.870\pm 0.0140.870 ± 0.014 0.777±0.018plus-or-minus0.7770.0180.777\pm 0.0180.777 ± 0.018

Appendix D Additional Experiments

D.1. Varying the size of the aggregate

Since ZK MIA requires the estimation of statistics from the aggregate, there may be concerns about its performance when the aggregate size is small. However, like previous MIAs, ZK MIA performs more effectively on smaller-scale aggregates compared to larger aggregates. This is shown in Figure 11 for aggregate sizes m=100,250,500,1000𝑚1002505001000m=100,250,500,1000italic_m = 100 , 250 , 500 , 1000 and different privacy budgets ε𝜀\varepsilonitalic_ε.

To further understand how MIA performance scales with aggregate size m𝑚mitalic_m, we also consider m>1000𝑚1000m>1000italic_m > 1000 in this experiment. To this end, we vary m=100,500,1000,2000,3000𝑚100500100020003000m=100,500,1000,2000,3000italic_m = 100 , 500 , 1000 , 2000 , 3000 and compare the performance of KK MIA and ZK MIA on raw (k=0𝑘0k=0italic_k = 0) and suppressed (k=1𝑘1k=1italic_k = 1) aggregates. Results on the CDR dataset are reported in Tables 6 and  8 and results on the Milan dataset are reported in Tables 7 and  9. m=3000𝑚3000m=3000italic_m = 3000 was not run on the Milan dataset due size limitations.

In these settings with mild privacy protection, the attacks always succeed regardless of the value of m𝑚mitalic_m. We also observe a few intuitive trends. First, when raw aggregates (k=0𝑘0k=0italic_k = 0) are attacked, increasing the size of the aggregates slowly decreases the performance of the attack. On the CDR dataset, KK and ZK attain AUCs 0.9990.9990.9990.999 and 1.01.01.01.0 for m=100𝑚100m=100italic_m = 100, which decreases to 0.9190.9190.9190.919 and 0.9770.9770.9770.977 for m=3000𝑚3000m=3000italic_m = 3000. Second, when we apply suppression k=1𝑘1k=1italic_k = 1, the attacks initially perform poorly when the aggregate size is small. We hypothesize this to be due to a larger percentage of entries being suppressed when fewer traces are aggregated, leaving less information in the release. This effect gradually decrease as aggregate size increases. It is then counterbalanced by the first effect, that increasing the size of the aggregates slowly decreases the performance of the attack, when aggregate sizes increase. This is visible for m1000𝑚1000m\leq 1000italic_m ≤ 1000 in the CDR dataset. For the Milan dataset, AUC however still monotonically increases even beyond m1000𝑚1000m\leq 1000italic_m ≤ 1000 as the dataset is more sensitive to suppression with the average user has approximately 6666 times less visits, as shown in Table 14(b).

Table 6. Mean AUCs of KK and ZK MIAs for k=0𝑘0k=0italic_k = 0 on the CDR dataset with varying m𝑚mitalic_m.
m𝑚mitalic_m KK ZK
100 1.000±0.000plus-or-minus1.0000.0001.000\pm 0.0001.000 ± 0.000 1.000±0.000plus-or-minus1.0000.0001.000\pm 0.0001.000 ± 0.000
500 1.000±0.000plus-or-minus1.0000.0001.000\pm 0.0001.000 ± 0.000 1.000±0.000plus-or-minus1.0000.0001.000\pm 0.0001.000 ± 0.000
1000 1.000±0.000plus-or-minus1.0000.0001.000\pm 0.0001.000 ± 0.000 0.999±0.001plus-or-minus0.9990.0010.999\pm 0.0010.999 ± 0.001
2000 0.997±0.003plus-or-minus0.9970.0030.997\pm 0.0030.997 ± 0.003 0.994±0.006plus-or-minus0.9940.0060.994\pm 0.0060.994 ± 0.006
3000 0.988±0.011plus-or-minus0.9880.0110.988\pm 0.0110.988 ± 0.011 0.977±0.021plus-or-minus0.9770.0210.977\pm 0.0210.977 ± 0.021
Table 7. Mean AUCs of KK and ZK MIAs for k=0𝑘0k=0italic_k = 0 on the Milan dataset with varying m𝑚mitalic_m.
m𝑚mitalic_m KK ZK
100 1.000±0.000plus-or-minus1.0000.0001.000\pm 0.0001.000 ± 0.000 1.000±0.000plus-or-minus1.0000.0001.000\pm 0.0001.000 ± 0.000
500 1.000±0.000plus-or-minus1.0000.0001.000\pm 0.0001.000 ± 0.000 1.000±0.000plus-or-minus1.0000.0001.000\pm 0.0001.000 ± 0.000
1000 0.995±0.002plus-or-minus0.9950.0020.995\pm 0.0020.995 ± 0.002 1.000±0.000plus-or-minus1.0000.0001.000\pm 0.0001.000 ± 0.000
2000 0.977±0.011plus-or-minus0.9770.0110.977\pm 0.0110.977 ± 0.011 1.000±0.000plus-or-minus1.0000.0001.000\pm 0.0001.000 ± 0.000
Table 8. Mean AUCs of KK and ZK MIAs for k=1𝑘1k=1italic_k = 1 on the CDR dataset with varying m𝑚mitalic_m.
m𝑚mitalic_m KK ZK
100 0.856±0.019plus-or-minus0.8560.0190.856\pm 0.0190.856 ± 0.019 0.779±0.041plus-or-minus0.7790.0410.779\pm 0.0410.779 ± 0.041
500 0.961±0.008plus-or-minus0.9610.0080.961\pm 0.0080.961 ± 0.008 0.987±0.003plus-or-minus0.9870.0030.987\pm 0.0030.987 ± 0.003
1000 0.981±0.007plus-or-minus0.9810.0070.981\pm 0.0070.981 ± 0.007 0.976±0.010plus-or-minus0.9760.0100.976\pm 0.0100.976 ± 0.010
2000 0.979±0.010plus-or-minus0.9790.0100.979\pm 0.0100.979 ± 0.010 0.965±0.013plus-or-minus0.9650.0130.965\pm 0.0130.965 ± 0.013
3000 0.973±0.011plus-or-minus0.9730.0110.973\pm 0.0110.973 ± 0.011 0.938±0.016plus-or-minus0.9380.0160.938\pm 0.0160.938 ± 0.016
Table 9. Mean AUCs of KK and ZK MIAs for k=1𝑘1k=1italic_k = 1 on the Milan dataset with varying m𝑚mitalic_m.
m𝑚mitalic_m KK ZK
100 0.756±0.020plus-or-minus0.7560.0200.756\pm 0.0200.756 ± 0.020 0.701±0.046plus-or-minus0.7010.0460.701\pm 0.0460.701 ± 0.046
500 0.889±0.018plus-or-minus0.8890.0180.889\pm 0.0180.889 ± 0.018 0.885±0.025plus-or-minus0.8850.0250.885\pm 0.0250.885 ± 0.025
1000 0.916±0.015plus-or-minus0.9160.0150.916\pm 0.0150.916 ± 0.015 0.919±0.016plus-or-minus0.9190.0160.919\pm 0.0160.919 ± 0.016
2000 0.981±0.009plus-or-minus0.9810.0090.981\pm 0.0090.981 ± 0.009 0.972±0.007plus-or-minus0.9720.0070.972\pm 0.0070.972 ± 0.007
Refer to caption
Figure 11. Mean AUC scores with standard error for ZK on event level ε𝜀\varepsilonitalic_ε-DP for varying privacy budgets ε𝜀\varepsilonitalic_ε on aggregates of varying sizes from the Milan dataset.

D.2. Increasing the size of ZK synthetic reference

Figure 12 illustrates how increasing the number of synthetic traces available to the attacker improves the MIA’s performance up to marginal returns.

Refer to caption
Figure 12. AUC of the Zero Auxiliary Knowledge MIA for different number of synthetic traces generated in the setting m=1000𝑚1000m=1000italic_m = 1000, k=1𝑘1k=1italic_k = 1, and ε=1𝜀1\varepsilon=1italic_ε = 1-DP at the event level.

D.3. No time information

In this experiment, we now assume that the adversary only has access to some of the locations that the target has visited, without knowing the epochs during which the visits were done. For example, the adversary may know the target’s home and work. To model this attack setting, we suppose that the adversary either knows the target’s top-K𝐾Kitalic_K most visited ROIs, for K=1,2,3𝐾123K=1,2,3italic_K = 1 , 2 , 3, or the full set of the target’s visited ROIs during the observation period. In one implementation, which we call ”greedy”, the adversary assumes that the target visits each known ROI during every epoch in the observation period. This ensures that the visits to these ROIs are reflected in the target trace, but it also sets many incorrect visits. Results are presented in Table 11. In our second implementation, which we call ”random sampling”, the adversary distributes the target’s visits uniformly across the known ROIs. For example, if the adversary knows the top-3333 ROIs, and their estimate for the mean number of visits per user is μ𝜇\muitalic_μ, then they would sample μ3𝜇3\frac{\mu}{3}divide start_ARG italic_μ end_ARG start_ARG 3 end_ARG visits for each of the top-3333 ROIs. The corresponding epochs for each visit are sampled from the estimated time marginal. For simplicity, we assume that μ𝜇\muitalic_μ is the true mean number of visits and that the estimated time marginal is the true one. Results on raw aggregates of size m=1000𝑚1000m=1000italic_m = 1000 are presented in Table 10.

Table 10 shows that both MIAs perform poorly (AUC<0.63𝐴𝑈𝐶0.63AUC<0.63italic_A italic_U italic_C < 0.63) when the adversary uses random sampling to approximate the target trace. This suggests that random sampling fails to estimate the target trace, due to the omission of true target visits, and the inclusion of incorrect visits.

In contrast, Table 11 shows that KK was able to perform significantly better than random when the adversary knew more than 2222 of the target’s most visited ROIs and used the greedy implementation (ex. AUC=0.86𝐴𝑈𝐶0.86AUC=0.86italic_A italic_U italic_C = 0.86 on Milan when knowing all visited ROIs). This suggests that, although the greedy implementation includes many incorrect visits, the guaranteed inclusion of some of the target’s actual visits enables membership inference to an extent. ZK, on the other hand, fails to attain 0.60.60.60.6 AUC. Since ZK already replaces real individual traces with synthetic traces, we hypothesize that membership inference becomes too difficult if the estimated target trace contains significantly incorrect information.

We however note that our current implementation for sampling the visits under this prior knowledge might be suboptimal and that better implementations might exist. For example, (Zhang et al.,, 2020) uses a synthetic target trace, using social network information and the traces of the target’s friends. We leave this exploration for future work.

Dataset Knock Knock Zero Auxiliary Knowledge
Top 1 Top 2 Top 3 All Top 1 Top 2 Top 3 All
CDR 0.542±0.04plus-or-minus0.5420.040.542\pm 0.040.542 ± 0.04 0.538±0.03plus-or-minus0.5380.030.538\pm 0.030.538 ± 0.03 0.531±0.02plus-or-minus0.5310.020.531\pm 0.020.531 ± 0.02 0.525±0.02plus-or-minus0.5250.020.525\pm 0.020.525 ± 0.02 0.524±0.03plus-or-minus0.5240.030.524\pm 0.030.524 ± 0.03 0.528±0.02plus-or-minus0.5280.020.528\pm 0.020.528 ± 0.02 0.515±0.02plus-or-minus0.5150.020.515\pm 0.020.515 ± 0.02 0.510±0.02plus-or-minus0.5100.020.510\pm 0.020.510 ± 0.02
Milan 0.576±0.02plus-or-minus0.5760.020.576\pm 0.020.576 ± 0.02 0.628±0.03plus-or-minus0.6280.030.628\pm 0.030.628 ± 0.03 0.614±0.03plus-or-minus0.6140.030.614\pm 0.030.614 ± 0.03 0.560±0.03plus-or-minus0.5600.030.560\pm 0.030.560 ± 0.03 0.518±0.02plus-or-minus0.5180.020.518\pm 0.020.518 ± 0.02 0.553±0.02plus-or-minus0.5530.020.553\pm 0.020.553 ± 0.02 0.568±0.02plus-or-minus0.5680.020.568\pm 0.020.568 ± 0.02 0.556±0.04plus-or-minus0.5560.040.556\pm 0.040.556 ± 0.04
Table 10. Mean AUC scores with standard error for KK and ZK on raw aggregates of size m=1000𝑚1000m=1000italic_m = 1000 when the adversary only knows some of the target’s visited ROIs and employs the random sampling approach of distributing random visits uniformly across each known ROI.
Dataset Knock Knock Zero Auxiliary Knowledge
Top 1 Top 2 Top 3 All Top 1 Top 2 Top 3 All
CDR 0.571±0.05plus-or-minus0.5710.050.571\pm 0.050.571 ± 0.05 0.611±0.05plus-or-minus0.6110.050.611\pm 0.050.611 ± 0.05 0.647±0.06plus-or-minus0.6470.060.647\pm 0.060.647 ± 0.06 0.825±0.05plus-or-minus0.8250.050.825\pm 0.050.825 ± 0.05 0.512±0.02plus-or-minus0.5120.020.512\pm 0.020.512 ± 0.02 0.527±0.01plus-or-minus0.5270.010.527\pm 0.010.527 ± 0.01 0.514±0.02plus-or-minus0.5140.020.514\pm 0.020.514 ± 0.02 0.516±0.01plus-or-minus0.5160.010.516\pm 0.010.516 ± 0.01
Milan 0.682±0.03plus-or-minus0.6820.030.682\pm 0.030.682 ± 0.03 0.764±0.03plus-or-minus0.7640.030.764\pm 0.030.764 ± 0.03 0.822±0.03plus-or-minus0.8220.030.822\pm 0.030.822 ± 0.03 0.860±0.02plus-or-minus0.8600.020.860\pm 0.020.860 ± 0.02 0.542±0.02plus-or-minus0.5420.020.542\pm 0.020.542 ± 0.02 0.524±0.02plus-or-minus0.5240.020.524\pm 0.020.524 ± 0.02 0.542±0.02plus-or-minus0.5420.020.542\pm 0.020.542 ± 0.02 0.545±0.02plus-or-minus0.5450.020.545\pm 0.020.545 ± 0.02
Table 11. Mean AUC scores with standard error for KK and ZK on raw aggregates of size m=1000𝑚1000m=1000italic_m = 1000 when the adversary only knows some of the target’s visited ROIs and employs the greedy approach of assuming that the target visits each known ROI during every epoch.

Appendix E Additional Plots

We present additional figures demonstrating statistics related to the location datasets.

Refer to caption
Figure 13. Time marginal from a raw aggregate over m=1000𝑚1000m=1000italic_m = 1000 users from the CDR dataset.
Refer to caption
(a) CDR dataset
Refer to caption
(b) Milan
Figure 14. Number of visits per target over the 50505050 targets.
Refer to caption
(a) CDR dataset
Refer to caption
(b) Milan
Figure 15. The percentage of nonzero entries that are suppressed in a size m=1000𝑚1000m=1000italic_m = 1000 aggregate after undergoing SSC with threshold k𝑘kitalic_k is plotted.
Refer to caption
(a) Space Marginal
Refer to caption
(b) Time Marginal
Refer to caption
(c) Activity Marginal
Refer to caption
(d) Delaunay Triangulation
Figure 16. The four statistical parameters for the unicity model by Farzanehfar et al., (2021) include marginal distributions in Figures 16(a)-16(c) (the dataset’s true marginal distributions are shown in red) and the Delaunay triangulation of ROIs in Figure 16(d).
Refer to caption
(a) Milan space marginals estimated from ε𝜀\varepsilonitalic_ε-DP aggregates
Refer to caption
(b) Milan time marginals estimated from ε𝜀\varepsilonitalic_ε-DP aggregates
Figure 17. The space and time marginals directly obtained from ε𝜀\varepsilonitalic_ε-DP aggregates over m=1000𝑚1000m=1000italic_m = 1000 users from the Milan dataset are plotted for different noise scales ΔεΔ𝜀\frac{\Delta}{\varepsilon}divide start_ARG roman_Δ end_ARG start_ARG italic_ε end_ARG. Interestingly, the distribution does not converge to a uniform distribution as the noise scale increases, due to the increasing variance of Lap(Δε)𝐿𝑎𝑝Δ𝜀Lap(\frac{\Delta}{\varepsilon})italic_L italic_a italic_p ( divide start_ARG roman_Δ end_ARG start_ARG italic_ε end_ARG )