Search | arXiv e-print repository

Counting Distinct Elements in the Turnstile Model with Differential Privacy under Continual Observation

Authors: Palak Jain, Iden Kalemaj, Sofya Raskhodnikova, Satchit Sivakumar, Adam Smith

Abstract: Privacy is a central challenge for systems that learn from sensitive data sets, especially when a system's outputs must be continuously updated to reflect changing data. We consider the achievable error for differentially private continual release of a basic statistic - the number of distinct items - in a stream where items may be both inserted and deleted (the turnstile model). With only insertio… ▽ More Privacy is a central challenge for systems that learn from sensitive data sets, especially when a system's outputs must be continuously updated to reflect changing data. We consider the achievable error for differentially private continual release of a basic statistic - the number of distinct items - in a stream where items may be both inserted and deleted (the turnstile model). With only insertions, existing algorithms have additive error just polylogarithmic in the length of the stream $T$. We uncover a much richer landscape in the turnstile model, even without considering memory restrictions. We show that every differentially private mechanism that handles insertions and deletions has worst-case additive error at least $T^{1/4}$ even under a relatively weak, event-level privacy definition. Then, we identify a parameter of the input stream, its maximum flippancy, that is low for natural data streams and for which we give tight parameterized error guarantees. Specifically, the maximum flippancy is the largest number of times that the contribution of a single item to the distinct elements count changes over the course of the stream. We present an item-level differentially private mechanism that, for all turnstile streams with maximum flippancy $w$, continually outputs the number of distinct elements with an $O(\sqrt{w} \cdot poly\log T)$ additive error, without requiring prior knowledge of $w$. We prove that this is the best achievable error bound that depends only on $w$, for a large range of values of $w$. When $w$ is small, the error of our mechanism is similar to the polylogarithmic in $T$ error in the insertion-only setting, bypassing the hardness in the turnstile model. △ Less

Submitted 10 July, 2024; v1 submitted 11 June, 2023; originally announced June 2023.

Comments: NeurIPS 2023

arXiv:2306.06721 [pdf, other]

Differentially Private Conditional Independence Testing

Authors: Iden Kalemaj, Shiva Prasad Kasiviswanathan, Aaditya Ramdas

Abstract: Conditional independence (CI) tests are widely used in statistical data analysis, e.g., they are the building block of many algorithms for causal graph discovery. The goal of a CI test is to accept or reject the null hypothesis that $X \perp \!\!\! \perp Y \mid Z$, where $X \in \mathbb{R}, Y \in \mathbb{R}, Z \in \mathbb{R}^d$. In this work, we investigate conditional independence testing under th… ▽ More Conditional independence (CI) tests are widely used in statistical data analysis, e.g., they are the building block of many algorithms for causal graph discovery. The goal of a CI test is to accept or reject the null hypothesis that $X \perp \!\!\! \perp Y \mid Z$, where $X \in \mathbb{R}, Y \in \mathbb{R}, Z \in \mathbb{R}^d$. In this work, we investigate conditional independence testing under the constraint of differential privacy. We design two private CI testing procedures: one based on the generalized covariance measure of Shah and Peters (2020) and another based on the conditional randomization test of Candès et al. (2016) (under the model-X assumption). We provide theoretical guarantees on the performance of our tests and validate them empirically. These are the first private CI tests with rigorous theoretical guarantees that work for the general case when $Z$ is continuous. △ Less

Submitted 22 March, 2024; v1 submitted 11 June, 2023; originally announced June 2023.

arXiv:2304.05890 [pdf, other]

doi 10.1145/3584372.3588671

Node-Differentially Private Estimation of the Number of Connected Components

Authors: Iden Kalemaj, Sofya Raskhodnikova, Adam Smith, Charalampos E. Tsourakakis

Abstract: We design the first node-differentially private algorithm for approximating the number of connected components in a graph. Given a database representing an $n$-vertex graph $G$ and a privacy parameter $\varepsilon$, our algorithm runs in polynomial time and, with probability $1-o(1)$, has additive error $\widetilde{O}(\frac{Δ^*\ln\ln n}{\varepsilon}),$ where $Δ^*$ is the smallest possible maximum… ▽ More We design the first node-differentially private algorithm for approximating the number of connected components in a graph. Given a database representing an $n$-vertex graph $G$ and a privacy parameter $\varepsilon$, our algorithm runs in polynomial time and, with probability $1-o(1)$, has additive error $\widetilde{O}(\frac{Δ^*\ln\ln n}{\varepsilon}),$ where $Δ^*$ is the smallest possible maximum degree of a spanning forest of $G.$ Node-differentially private algorithms are known only for a small number of database analysis tasks. A major obstacle for designing such an algorithm for the number of connected components is that this graph statistic is not robust to adding one node with arbitrary connections (a change that node-differential privacy is designed to hide): every graph is a neighbor of a connected graph. We overcome this by designing a family of efficiently computable Lipschitz extensions of the number of connected components or, equivalently, the size of a spanning forest. The construction of the extensions, which is at the core of our algorithm, is based on the forest polytope of $G.$ We prove several combinatorial facts about spanning forests, in particular, that a graph with no induced $Δ$-stars has a spanning forest of degree at most $Δ$. With this fact, we show that our Lipschitz extensions for the number of connected components equal the true value of the function for the largest possible monotone families of graphs. More generally, on all monotone sets of graphs, the $\ell_\infty$ error of our Lipschitz extensions is nearly optimal. △ Less

Submitted 13 April, 2023; v1 submitted 12 April, 2023; originally announced April 2023.

Journal ref: In Proceedings of the ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS) 2023

arXiv:2109.08745 [pdf, ps, other]

Sublinear-Time Computation in the Presence of Online Erasures

Authors: Iden Kalemaj, Sofya Raskhodnikova, Nithin Varma

Abstract: We initiate the study of sublinear-time algorithms that access their input via an online adversarial erasure oracle. After answering each query to the input object, such an oracle can erase $t$ input values. Our goal is to understand the complexity of basic computational tasks in extremely adversarial situations, where the algorithm's access to data is blocked during the execution of the algorithm… ▽ More We initiate the study of sublinear-time algorithms that access their input via an online adversarial erasure oracle. After answering each query to the input object, such an oracle can erase $t$ input values. Our goal is to understand the complexity of basic computational tasks in extremely adversarial situations, where the algorithm's access to data is blocked during the execution of the algorithm in response to its actions. Specifically, we focus on property testing in the model with online erasures. We show that two fundamental properties of functions, linearity and quadraticity, can be tested for constant $t$ with asymptotically the same complexity as in the standard property testing model. For linearity testing, we prove tight bounds in terms of $t$, showing that the query complexity is $Θ(\log t)$. In contrast to linearity and quadraticity, some other properties, including sortedness and the Lipschitz property of sequences, cannot be tested at all, even for $t=1$. Our investigation leads to a deeper understanding of the structure of violations of linearity and other widely studied properties. We also consider implications of our results for algorithms that are resilient to online adversarial corruptions instead of erasures. △ Less

Submitted 22 November, 2022; v1 submitted 17 September, 2021; originally announced September 2021.

Comments: 31 pages, 1 figure

Journal ref: Proceedings, Innovations in Theoretical Computer Science (ITCS), 2022

arXiv:2011.09441 [pdf, other]

Isoperimetric Inequalities for Real-Valued Functions with Applications to Monotonicity Testing

Authors: Hadley Black, Iden Kalemaj, Sofya Raskhodnikova

Abstract: We generalize the celebrated isoperimetric inequality of Khot, Minzer, and Safra~(SICOMP 2018) for Boolean functions to the case of real-valued functions $f \colon \{0,1\}^d\to\mathbb{R}$. Our main tool in the proof of the generalized inequality is a new Boolean decomposition that represents every real-valued function $f$ over an arbitrary partially ordered domain as a collection of Boolean functi… ▽ More We generalize the celebrated isoperimetric inequality of Khot, Minzer, and Safra~(SICOMP 2018) for Boolean functions to the case of real-valued functions $f \colon \{0,1\}^d\to\mathbb{R}$. Our main tool in the proof of the generalized inequality is a new Boolean decomposition that represents every real-valued function $f$ over an arbitrary partially ordered domain as a collection of Boolean functions over the same domain, roughly capturing the distance of $f$ to monotonicity and the structure of violations of $f$ to monotonicity. We apply our generalized isoperimetric inequality to improve algorithms for testing monotonicity and approximating the distance to monotonicity for real-valued functions. Our tester for monotonicity has query complexity $\widetilde{O}(\min(r \sqrt{d},d))$, where $r$ is the size of the image of the input function. (The best previously known tester, by Chakrabarty and Seshadhri (STOC 2013), makes $O(d)$ queries.) Our tester is nonadaptive and has 1-sided error. We show a matching lower bound for nonadaptive, 1-sided error testers for monotonicity. We also show that the distance to monotonicity of real-valued functions that are $α$-far from monotone can be approximated nonadaptively within a factor of $O(\sqrt{d\log d})$ with query complexity polynomial in $1/α$ and the dimension $d$. This query complexity is known to be nearly optimal for nonadaptive algorithms even for the special case of Boolean functions. (The best previously known distance approximation algorithm for real-valued functions, by Fattal and Ron (TALG 2010) achieves $O(d\log r)$-approximation.) △ Less

Submitted 18 November, 2020; originally announced November 2020.

arXiv:2011.03885 [pdf, other]

Performative Prediction in a Stateful World

Authors: Gavin Brown, Shlomi Hod, Iden Kalemaj

Abstract: Deployed supervised machine learning models make predictions that interact with and influence the world. This phenomenon is called performative prediction by Perdomo et al. (ICML 2020). It is an ongoing challenge to understand the influence of such predictions as well as design tools so as to control that influence. We propose a theoretical framework where the response of a target population to th… ▽ More Deployed supervised machine learning models make predictions that interact with and influence the world. This phenomenon is called performative prediction by Perdomo et al. (ICML 2020). It is an ongoing challenge to understand the influence of such predictions as well as design tools so as to control that influence. We propose a theoretical framework where the response of a target population to the deployed classifier is modeled as a function of the classifier and the current state (distribution) of the population. We show necessary and sufficient conditions for convergence to an equilibrium of two retraining algorithms, repeated risk minimization and a lazier variant. Furthermore, convergence is near an optimal classifier. We thus generalize results of Perdomo et al., whose performativity framework does not assume any dependence on the state of the target population. A particular phenomenon captured by our model is that of distinct groups that acquire information and resources at different rates to be able to respond to the latest deployed classifier. We study this phenomenon theoretically and empirically. △ Less

Submitted 22 February, 2022; v1 submitted 7 November, 2020; originally announced November 2020.

Comments: Accepted paper to AISTATS 2022. An earlier version appeared at the Workshop on Consequential Decision Making in Dynamic Environments, NeurIPS 2020

Showing 1–6 of 6 results for author: Kalemaj, I