Search | arXiv e-print repository

Public-data Assisted Private Stochastic Optimization: Power and Limitations

Authors: Enayat Ullah, Michael Menart, Raef Bassily, Cristóbal Guzmán, Raman Arora

Abstract: We study the limits and capability of public-data assisted differentially private (PA-DP) algorithms. Specifically, we focus on the problem of stochastic convex optimization (SCO) with either labeled or unlabeled public data. For complete/labeled public data, we show that any $(ε,δ)$-PA-DP has excess risk… ▽ More We study the limits and capability of public-data assisted differentially private (PA-DP) algorithms. Specifically, we focus on the problem of stochastic convex optimization (SCO) with either labeled or unlabeled public data. For complete/labeled public data, we show that any $(ε,δ)$-PA-DP has excess risk $\tildeΩ\big(\min\big\{\frac{1}{\sqrt{n_{\text{pub}}}},\frac{1}{\sqrt{n}}+\frac{\sqrt{d}}{nε} \big\} \big)$, where $d$ is the dimension, ${n_{\text{pub}}}$ is the number of public samples, ${n_{\text{priv}}}$ is the number of private samples, and $n={n_{\text{pub}}}+{n_{\text{priv}}}$. These lower bounds are established via our new lower bounds for PA-DP mean estimation, which are of a similar form. Up to constant factors, these lower bounds show that the simple strategy of either treating all data as private or discarding the private data, is optimal. We also study PA-DP supervised learning with \textit{unlabeled} public samples. In contrast to our previous result, we here show novel methods for leveraging public data in private supervised learning. For generalized linear models (GLM) with unlabeled public data, we show an efficient algorithm which, given $\tilde{O}({n_{\text{priv}}}ε)$ unlabeled public samples, achieves the dimension independent rate $\tilde{O}\big(\frac{1}{\sqrt{n_{\text{priv}}}} + \frac{1}{\sqrt{n_{\text{priv}}ε}}\big)$. We develop new lower bounds for this setting which shows that this rate cannot be improved with more public samples, and any fewer public samples leads to a worse rate. Finally, we provide extensions of this result to general hypothesis classes with finite fat-shattering dimension with applications to neural networks and non-Euclidean geometries. △ Less

Submitted 6 March, 2024; originally announced March 2024.

arXiv:2402.19437 [pdf, ps, other]

Differentially Private Worst-group Risk Minimization

Authors: Xinyu Zhou, Raef Bassily

Abstract: We initiate a systematic study of worst-group risk minimization under $(ε, δ)$-differential privacy (DP). The goal is to privately find a model that approximately minimizes the maximal risk across $p$ sub-populations (groups) with different distributions, where each group distribution is accessed via a sample oracle. We first present a new algorithm that achieves excess worst-group population risk… ▽ More We initiate a systematic study of worst-group risk minimization under $(ε, δ)$-differential privacy (DP). The goal is to privately find a model that approximately minimizes the maximal risk across $p$ sub-populations (groups) with different distributions, where each group distribution is accessed via a sample oracle. We first present a new algorithm that achieves excess worst-group population risk of $\tilde{O}(\frac{p\sqrt{d}}{Kε} + \sqrt{\frac{p}{K}})$, where $K$ is the total number of samples drawn from all groups and $d$ is the problem dimension. Our rate is nearly optimal when each distribution is observed via a fixed-size dataset of size $K/p$. Our result is based on a new stability-based analysis for the generalization error. In particular, we show that $Δ$-uniform argument stability implies $\tilde{O}(Δ+ \frac{1}{\sqrt{n}})$ generalization error w.r.t. the worst-group risk, where $n$ is the number of samples drawn from each sample oracle. Next, we propose an algorithmic framework for worst-group population risk minimization using any DP online convex optimization algorithm as a subroutine. Hence, we give another excess risk bound of $\tilde{O}\left( \sqrt{\frac{d^{1/2}}{εK}} +\sqrt{\frac{p}{Kε^2}} \right)$. Assuming the typical setting of $ε=Θ(1)$, this bound is more favorable than our first bound in a certain range of $p$ as a function of $K$ and $d$. Finally, we study differentially private worst-group empirical risk minimization in the offline setting, where each group distribution is observed by a fixed-size dataset. We present a new algorithm with nearly optimal excess risk of $\tilde{O}(\frac{p\sqrt{d}}{Kε})$. △ Less

Submitted 29 February, 2024; originally announced February 2024.

arXiv:2311.13447 [pdf, ps, other]

Differentially Private Non-Convex Optimization under the KL Condition with Optimal Rates

Authors: Michael Menart, Enayat Ullah, Raman Arora, Raef Bassily, Cristóbal Guzmán

Abstract: We study private empirical risk minimization (ERM) problem for losses satisfying the $(γ,κ)$-Kurdyka-Łojasiewicz (KL) condition. The Polyak-Łojasiewicz (PL) condition is a special case of this condition when $κ=2$. Specifically, we study this problem under the constraint of $ρ$ zero-concentrated differential privacy (zCDP). When $κ\in[1,2]$ and the loss function is Lipschitz and smooth over a suff… ▽ More We study private empirical risk minimization (ERM) problem for losses satisfying the $(γ,κ)$-Kurdyka-Łojasiewicz (KL) condition. The Polyak-Łojasiewicz (PL) condition is a special case of this condition when $κ=2$. Specifically, we study this problem under the constraint of $ρ$ zero-concentrated differential privacy (zCDP). When $κ\in[1,2]$ and the loss function is Lipschitz and smooth over a sufficiently large region, we provide a new algorithm based on variance reduced gradient descent that achieves the rate $\tilde{O}\big(\big(\frac{\sqrt{d}}{n\sqrtρ}\big)^κ\big)$ on the excess empirical risk, where $n$ is the dataset size and $d$ is the dimension. We further show that this rate is nearly optimal. When $κ\geq 2$ and the loss is instead Lipschitz and weakly convex, we show it is possible to achieve the rate $\tilde{O}\big(\big(\frac{\sqrt{d}}{n\sqrtρ}\big)^κ\big)$ with a private implementation of the proximal point method. When the KL parameters are unknown, we provide a novel modification and analysis of the noisy gradient descent algorithm and show that this algorithm achieves a rate of $\tilde{O}\big(\big(\frac{\sqrt{d}}{n\sqrtρ}\big)^{\frac{2κ}{4-κ}}\big)$ adaptively, which is nearly optimal when $κ= 2$. We further show that, without assuming the KL condition, the same gradient descent algorithm can achieve fast convergence to a stationary point when the gradient stays sufficiently large during the run of the algorithm. Specifically, we show that this algorithm can approximate stationary points of Lipschitz, smooth (and possibly nonconvex) objectives with rate as fast as $\tilde{O}\big(\frac{\sqrt{d}}{n\sqrtρ}\big)$ and never worse than $\tilde{O}\big(\big(\frac{\sqrt{d}}{n\sqrtρ}\big)^{1/2}\big)$. The latter rate matches the best known rate for methods that do not rely on variance reduction. △ Less

Submitted 3 April, 2024; v1 submitted 22 November, 2023; originally announced November 2023.

arXiv:2306.08838 [pdf, other]

Differentially Private Domain Adaptation with Theoretical Guarantees

Authors: Raef Bassily, Corinna Cortes, Anqi Mao, Mehryar Mohri

Abstract: In many applications, the labeled data at the learner's disposal is subject to privacy constraints and is relatively limited. To derive a more accurate predictor for the target domain, it is often beneficial to leverage publicly available labeled data from an alternative domain, somewhat close to the target domain. This is the modern problem of supervised domain adaptation from a public source to… ▽ More In many applications, the labeled data at the learner's disposal is subject to privacy constraints and is relatively limited. To derive a more accurate predictor for the target domain, it is often beneficial to leverage publicly available labeled data from an alternative domain, somewhat close to the target domain. This is the modern problem of supervised domain adaptation from a public source to a private target domain. We present two $(ε, δ)$-differentially private adaptation algorithms for supervised adaptation, for which we make use of a general optimization problem, recently shown to benefit from favorable theoretical learning guarantees. Our first algorithm is designed for regression with linear predictors and shown to solve a convex optimization problem. Our second algorithm is a more general solution for loss functions that may be non-convex but Lipschitz and smooth. While our main objective is a theoretical analysis, we also report the results of several experiments first demonstrating that the non-private versions of our algorithms outperform adaptation baselines and next showing that, for larger values of the target sample size or $ε$, the performance of our private algorithms remains close to that of the non-private formulation. △ Less

Submitted 4 February, 2024; v1 submitted 15 June, 2023; originally announced June 2023.

arXiv:2302.12909 [pdf, ps, other]

Differentially Private Algorithms for the Stochastic Saddle Point Problem with Optimal Rates for the Strong Gap

Authors: Raef Bassily, Cristóbal Guzmán, Michael Menart

Abstract: We show that convex-concave Lipschitz stochastic saddle point problems (also known as stochastic minimax optimization) can be solved under the constraint of $(ε,δ)$-differential privacy with \emph{strong (primal-dual) gap} rate of $\tilde O\big(\frac{1}{\sqrt{n}} + \frac{\sqrt{d}}{nε}\big)$, where $n$ is the dataset size and $d$ is the dimension of the problem. This rate is nearly optimal, based o… ▽ More We show that convex-concave Lipschitz stochastic saddle point problems (also known as stochastic minimax optimization) can be solved under the constraint of $(ε,δ)$-differential privacy with \emph{strong (primal-dual) gap} rate of $\tilde O\big(\frac{1}{\sqrt{n}} + \frac{\sqrt{d}}{nε}\big)$, where $n$ is the dataset size and $d$ is the dimension of the problem. This rate is nearly optimal, based on existing lower bounds in differentially private stochastic optimization. Specifically, we prove a tight upper bound on the strong gap via novel implementation and analysis of the recursive regularization technique repurposed for saddle point problems. We show that this rate can be attained with $O\big(\min\big\{\frac{n^2ε^{1.5}}{\sqrt{d}}, n^{3/2}\big\}\big)$ gradient complexity, and $\tilde{O}(n)$ gradient complexity if the loss function is smooth. As a byproduct of our method, we develop a general algorithm that, given a black-box access to a subroutine satisfying a certain $α$ primal-dual accuracy guarantee with respect to the empirical objective, gives a solution to the stochastic saddle point problem with a strong gap of $\tilde{O}(α+\frac{1}{\sqrt{n}})$. We show that this $α$-accuracy condition is satisfied by standard algorithms for the empirical saddle point problem such as the proximal point method and the stochastic gradient descent ascent algorithm. Further, we show that even for simple problems it is possible for an algorithm to have zero weak gap and suffer from $Ω(1)$ strong gap. We also show that there exists a fundamental tradeoff between stability and accuracy. Specifically, we show that any $Δ$-stable algorithm has empirical gap $Ω\big(\frac{1}{Δn}\big)$, and that this bound is tight. This result also holds also more specifically for empirical risk minimization problems and may be of independent interest. △ Less

Submitted 29 June, 2023; v1 submitted 24 February, 2023; originally announced February 2023.

arXiv:2208.06135 [pdf, other]

Private Domain Adaptation from a Public Source

Authors: Raef Bassily, Mehryar Mohri, Ananda Theertha Suresh

Abstract: A key problem in a variety of applications is that of domain adaptation from a public source domain, for which a relatively large amount of labeled data with no privacy constraints is at one's disposal, to a private target domain, for which a private sample is available with very few or no labeled data. In regression problems with no privacy constraints on the source or target data, a discrepancy… ▽ More A key problem in a variety of applications is that of domain adaptation from a public source domain, for which a relatively large amount of labeled data with no privacy constraints is at one's disposal, to a private target domain, for which a private sample is available with very few or no labeled data. In regression problems with no privacy constraints on the source or target data, a discrepancy minimization algorithm based on several theoretical guarantees was shown to outperform a number of other adaptation algorithm baselines. Building on that approach, we design differentially private discrepancy-based algorithms for adaptation from a source domain with public labeled data to a target domain with unlabeled private data. The design and analysis of our private algorithms critically hinge upon several key properties we prove for a smooth approximation of the weighted discrepancy, such as its smoothness with respect to the $\ell_1$-norm and the sensitivity of its gradient. Our solutions are based on private variants of Frank-Wolfe and Mirror-Descent algorithms. We show that our adaptation algorithms benefit from strong generalization and privacy guarantees and report the results of experiments demonstrating their effectiveness. △ Less

Submitted 12 August, 2022; originally announced August 2022.

arXiv:2206.00846 [pdf, ps, other]

Faster Rates of Convergence to Stationary Points in Differentially Private Optimization

Authors: Raman Arora, Raef Bassily, Tomás González, Cristóbal Guzmán, Michael Menart, Enayat Ullah

Abstract: We study the problem of approximating stationary points of Lipschitz and smooth functions under $(\varepsilon,δ)$-differential privacy (DP) in both the finite-sum and stochastic settings. A point $\widehat{w}$ is called an $α$-stationary point of a function $F:\mathbb{R}^d\rightarrow\mathbb{R}$ if $\|\nabla F(\widehat{w})\|\leq α$. We provide a new efficient algorithm that finds an… ▽ More We study the problem of approximating stationary points of Lipschitz and smooth functions under $(\varepsilon,δ)$-differential privacy (DP) in both the finite-sum and stochastic settings. A point $\widehat{w}$ is called an $α$-stationary point of a function $F:\mathbb{R}^d\rightarrow\mathbb{R}$ if $\|\nabla F(\widehat{w})\|\leq α$. We provide a new efficient algorithm that finds an $\tilde{O}\big(\big[\frac{\sqrt{d}}{n\varepsilon}\big]^{2/3}\big)$-stationary point in the finite-sum setting, where $n$ is the number of samples. This improves on the previous best rate of $\tilde{O}\big(\big[\frac{\sqrt{d}}{n\varepsilon}\big]^{1/2}\big)$. We also give a new construction that improves over the existing rates in the stochastic optimization setting, where the goal is to find approximate stationary points of the population risk. Our construction finds a $\tilde{O}\big(\frac{1}{n^{1/3}} + \big[\frac{\sqrt{d}}{n\varepsilon}\big]^{1/2}\big)$-stationary point of the population risk in time linear in $n$. Furthermore, under the additional assumption of convexity, we completely characterize the sample complexity of finding stationary points of the population risk (up to polylog factors) and show that the optimal rate on population stationarity is $\tilde Θ\big(\frac{1}{\sqrt{n}}+\frac{\sqrt{d}}{n\varepsilon}\big)$. Finally, we show that our methods can be used to provide dimension-independent rates of $O\big(\frac{1}{\sqrt{n}}+\min\big(\big[\frac{\sqrt{rank}}{n\varepsilon}\big]^{2/3},\frac{1}{(n\varepsilon)^{2/5}}\big)\big)$ on population stationarity for Generalized Linear Models (GLM), where $rank$ is the rank of the design matrix, which improves upon the previous best known rate. △ Less

Submitted 30 May, 2023; v1 submitted 1 June, 2022; originally announced June 2022.

arXiv:2205.03014 [pdf, ps, other]

Differentially Private Generalized Linear Models Revisited

Authors: Raman Arora, Raef Bassily, Cristóbal Guzmán, Michael Menart, Enayat Ullah

Abstract: We study the problem of $(ε,δ)$-differentially private learning of linear predictors with convex losses. We provide results for two subclasses of loss functions. The first case is when the loss is smooth and non-negative but not necessarily Lipschitz (such as the squared loss). For this case, we establish an upper bound on the excess population risk of… ▽ More We study the problem of $(ε,δ)$-differentially private learning of linear predictors with convex losses. We provide results for two subclasses of loss functions. The first case is when the loss is smooth and non-negative but not necessarily Lipschitz (such as the squared loss). For this case, we establish an upper bound on the excess population risk of $\tilde{O}\left(\frac{\Vert w^*\Vert}{\sqrt{n}} + \min\left\{\frac{\Vert w^* \Vert^2}{(nε)^{2/3}},\frac{\sqrt{d}\Vert w^*\Vert^2}{nε}\right\}\right)$, where $n$ is the number of samples, $d$ is the dimension of the problem, and $w^*$ is the minimizer of the population risk. Apart from the dependence on $\Vert w^\ast\Vert$, our bound is essentially tight in all parameters. In particular, we show a lower bound of $\tildeΩ\left(\frac{1}{\sqrt{n}} + {\min\left\{\frac{\Vert w^*\Vert^{4/3}}{(nε)^{2/3}}, \frac{\sqrt{d}\Vert w^*\Vert}{nε}\right\}}\right)$. We also revisit the previously studied case of Lipschitz losses [SSTT20]. For this case, we close the gap in the existing work and show that the optimal rate is (up to log factors) $Θ\left(\frac{\Vert w^*\Vert}{\sqrt{n}} + \min\left\{\frac{\Vert w^*\Vert}{\sqrt{nε}},\frac{\sqrt{\text{rank}}\Vert w^*\Vert}{nε}\right\}\right)$, where $\text{rank}$ is the rank of the design matrix. This improves over existing work in the high privacy regime. Finally, our algorithms involve a private model selection approach that we develop to enable attaining the stated rates without a-priori knowledge of $\Vert w^*\Vert$. △ Less

Submitted 6 March, 2024; v1 submitted 6 May, 2022; originally announced May 2022.

arXiv:2204.10376 [pdf, other]

Differentially Private Learning with Margin Guarantees

Authors: Raef Bassily, Mehryar Mohri, Ananda Theertha Suresh

Abstract: We present a series of new differentially private (DP) algorithms with dimension-independent margin guarantees. For the family of linear hypotheses, we give a pure DP learning algorithm that benefits from relative deviation margin guarantees, as well as an efficient DP learning algorithm with margin guarantees. We also present a new efficient DP learning algorithm with margin guarantees for kernel… ▽ More We present a series of new differentially private (DP) algorithms with dimension-independent margin guarantees. For the family of linear hypotheses, we give a pure DP learning algorithm that benefits from relative deviation margin guarantees, as well as an efficient DP learning algorithm with margin guarantees. We also present a new efficient DP learning algorithm with margin guarantees for kernel-based hypotheses with shift-invariant kernels, such as Gaussian kernels, and point out how our results can be extended to other kernels using oblivious sketching techniques. We further give a pure DP learning algorithm for a family of feed-forward neural networks for which we prove margin guarantees that are independent of the input dimension. Additionally, we describe a general label DP learning algorithm, which benefits from relative deviation margin bounds and is applicable to a broad family of hypothesis sets, including that of neural networks. Finally, we show how our DP learning algorithms can be augmented in a general way to include model selection, to select the best confidence margin parameter. △ Less

Submitted 21 April, 2022; originally announced April 2022.

arXiv:2107.05585 [pdf, ps, other]

Differentially Private Stochastic Optimization: New Results in Convex and Non-Convex Settings

Authors: Raef Bassily, Cristóbal Guzmán, Michael Menart

Abstract: We study differentially private stochastic optimization in convex and non-convex settings. For the convex case, we focus on the family of non-smooth generalized linear losses (GLLs). Our algorithm for the $\ell_2$ setting achieves optimal excess population risk in near-linear time, while the best known differentially private algorithms for general convex losses run in super-linear time. Our algori… ▽ More We study differentially private stochastic optimization in convex and non-convex settings. For the convex case, we focus on the family of non-smooth generalized linear losses (GLLs). Our algorithm for the $\ell_2$ setting achieves optimal excess population risk in near-linear time, while the best known differentially private algorithms for general convex losses run in super-linear time. Our algorithm for the $\ell_1$ setting has nearly-optimal excess population risk $\tilde{O}\big(\sqrt{\frac{\log{d}}{n\varepsilon}}\big)$, and circumvents the dimension dependent lower bound of \cite{Asi:2021} for general non-smooth convex losses. In the differentially private non-convex setting, we provide several new algorithms for approximating stationary points of the population risk. For the $\ell_1$-case with smooth losses and polyhedral constraint, we provide the first nearly dimension independent rate, $\tilde O\big(\frac{\log^{2/3}{d}}{(n\varepsilon)^{1/3}}\big)$ in linear time. For the constrained $\ell_2$-case with smooth losses, we obtain a linear-time algorithm with rate $\tilde O\big(\frac{1}{n^{1/3}}+\frac{d^{1/5}}{(n\varepsilon)^{2/5}}\big)$. Finally, for the $\ell_2$-case we provide the first method for {\em non-smooth weakly convex} stochastic optimization with rate $\tilde O\big(\frac{1}{n^{1/4}}+\frac{d^{1/6}}{(n\varepsilon)^{1/3}}\big)$ which matches the best existing non-private algorithm when $d= O(\sqrt{n})$. We also extend all our results above for the non-convex $\ell_2$ setting to the $\ell_p$ setting, where $1 < p \leq 2$, with only polylogarithmic (in the dimension) overhead in the rates. △ Less

Submitted 10 November, 2021; v1 submitted 12 July, 2021; originally announced July 2021.

arXiv:2103.01278 [pdf, ps, other]

Non-Euclidean Differentially Private Stochastic Convex Optimization: Optimal Rates in Linear Time

Authors: Raef Bassily, Cristóbal Guzmán, Anupama Nandi

Abstract: Differentially private (DP) stochastic convex optimization (SCO) is a fundamental problem, where the goal is to approximately minimize the population risk with respect to a convex loss function, given a dataset of $n$ i.i.d. samples from a distribution, while satisfying differential privacy with respect to the dataset. Most of the existing works in the literature of private convex optimization foc… ▽ More Differentially private (DP) stochastic convex optimization (SCO) is a fundamental problem, where the goal is to approximately minimize the population risk with respect to a convex loss function, given a dataset of $n$ i.i.d. samples from a distribution, while satisfying differential privacy with respect to the dataset. Most of the existing works in the literature of private convex optimization focus on the Euclidean (i.e., $\ell_2$) setting, where the loss is assumed to be Lipschitz (and possibly smooth) w.r.t. the $\ell_2$ norm over a constraint set with bounded $\ell_2$ diameter. Algorithms based on noisy stochastic gradient descent (SGD) are known to attain the optimal excess risk in this setting. In this work, we conduct a systematic study of DP-SCO for $\ell_p$-setups under a standard smoothness assumption on the loss. For $1< p\leq 2$, under a standard smoothness assumption, we give a new, linear-time DP-SCO algorithm with optimal excess risk. Previously known constructions with optimal excess risk for $1< p <2$ run in super-linear time in $n$. For $p=1$, we give an algorithm with nearly optimal excess risk. Our result for the $\ell_1$-setup also extends to general polyhedral norms and feasible sets. Moreover, we show that the excess risk bounds resulting from our algorithms for $1\leq p \leq 2$ are attained with high probability. For $2 < p \leq \infty$, we show that existing linear-time constructions for the Euclidean setup attain a nearly optimal excess risk in the low-dimensional regime. As a consequence, we show that such constructions attain a nearly optimal excess risk for $p=\infty$. Our work draws upon concepts from the geometry of normed spaces, such as the notions of regularity, uniform convexity, and uniform smoothness. △ Less

Submitted 4 May, 2022; v1 submitted 1 March, 2021; originally announced March 2021.

Comments: This version contains several extensions to the conference paper that appeared at COLT 2021 (and to the earlier arXiv version: arXiv:2103.01278v1). This version contains new, linear-time constructions with optimal, high-probability risk guarantees

arXiv:2008.00331 [pdf, ps, other]

Learning from Mixtures of Private and Public Populations

Authors: Raef Bassily, Shay Moran, Anupama Nandi

Abstract: We initiate the study of a new model of supervised learning under privacy constraints. Imagine a medical study where a dataset is sampled from a population of both healthy and unhealthy individuals. Suppose healthy individuals have no privacy concerns (in such case, we call their data "public") while the unhealthy individuals desire stringent privacy protection for their data. In this example, the… ▽ More We initiate the study of a new model of supervised learning under privacy constraints. Imagine a medical study where a dataset is sampled from a population of both healthy and unhealthy individuals. Suppose healthy individuals have no privacy concerns (in such case, we call their data "public") while the unhealthy individuals desire stringent privacy protection for their data. In this example, the population (data distribution) is a mixture of private (unhealthy) and public (healthy) sub-populations that could be very different. Inspired by the above example, we consider a model in which the population $\mathcal{D}$ is a mixture of two sub-populations: a private sub-population $\mathcal{D}_{\sf priv}$ of private and sensitive data, and a public sub-population $\mathcal{D}_{\sf pub}$ of data with no privacy concerns. Each example drawn from $\mathcal{D}$ is assumed to contain a privacy-status bit that indicates whether the example is private or public. The goal is to design a learning algorithm that satisfies differential privacy only with respect to the private examples. Prior works in this context assumed a homogeneous population where private and public data arise from the same distribution, and in particular designed solutions which exploit this assumption. We demonstrate how to circumvent this assumption by considering, as a case study, the problem of learning linear classifiers in $\mathbb{R}^d$. We show that in the case where the privacy status is correlated with the target label (as in the above example), linear classifiers in $\mathbb{R}^d$ can be learned, in the agnostic as well as the realizable setting, with sample complexity which is comparable to that of the classical (non-private) PAC-learning. It is known that this task is impossible if all the data is considered private. △ Less

Submitted 1 August, 2020; originally announced August 2020.

arXiv:2006.06914 [pdf, ps, other]

Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses

Authors: Raef Bassily, Vitaly Feldman, Cristóbal Guzmán, Kunal Talwar

Abstract: Uniform stability is a notion of algorithmic stability that bounds the worst case change in the model output by the algorithm when a single data point in the dataset is replaced. An influential work of Hardt et al. (2016) provides strong upper bounds on the uniform stability of the stochastic gradient descent (SGD) algorithm on sufficiently smooth convex losses. These results led to important prog… ▽ More Uniform stability is a notion of algorithmic stability that bounds the worst case change in the model output by the algorithm when a single data point in the dataset is replaced. An influential work of Hardt et al. (2016) provides strong upper bounds on the uniform stability of the stochastic gradient descent (SGD) algorithm on sufficiently smooth convex losses. These results led to important progress in understanding of the generalization properties of SGD and several applications to differentially private convex optimization for smooth losses. Our work is the first to address uniform stability of SGD on {\em nonsmooth} convex losses. Specifically, we provide sharp upper and lower bounds for several forms of SGD and full-batch GD on arbitrary Lipschitz nonsmooth convex losses. Our lower bounds show that, in the nonsmooth case, (S)GD can be inherently less stable than in the smooth case. On the other hand, our upper bounds show that (S)GD is sufficiently stable for deriving new and useful bounds on generalization error. Most notably, we obtain the first dimension-independent generalization bounds for multi-pass SGD in the nonsmooth case. In addition, our bounds allow us to derive a new algorithm for differentially private nonsmooth stochastic convex optimization with optimal excess population risk. Our algorithm is simpler and more efficient than the best known algorithm for the nonsmooth case Feldman et al. (2020). △ Less

Submitted 11 June, 2020; originally announced June 2020.

Comments: 32 pages

MSC Class: 90-08 ACM Class: F.2.1; G.1.6; G.3

arXiv:2004.10941 [pdf, other]

Private Query Release Assisted by Public Data

Authors: Raef Bassily, Albert Cheu, Shay Moran, Aleksandar Nikolov, Jonathan Ullman, Zhiwei Steven Wu

Abstract: We study the problem of differentially private query release assisted by access to public data. In this problem, the goal is to answer a large class $\mathcal{H}$ of statistical queries with error no more than $α$ using a combination of public and private samples. The algorithm is required to satisfy differential privacy only with respect to the private samples. We study the limits of this task in… ▽ More We study the problem of differentially private query release assisted by access to public data. In this problem, the goal is to answer a large class $\mathcal{H}$ of statistical queries with error no more than $α$ using a combination of public and private samples. The algorithm is required to satisfy differential privacy only with respect to the private samples. We study the limits of this task in terms of the private and public sample complexities. First, we show that we can solve the problem for any query class $\mathcal{H}$ of finite VC-dimension using only $d/α$ public samples and $\sqrt{p}d^{3/2}/α^2$ private samples, where $d$ and $p$ are the VC-dimension and dual VC-dimension of $\mathcal{H}$, respectively. In comparison, with only private samples, this problem cannot be solved even for simple query classes with VC-dimension one, and without any private samples, a larger public sample of size $d/α^2$ is needed. Next, we give sample complexity lower bounds that exhibit tight dependence on $p$ and $α$. For the class of decision stumps, we give a lower bound of $\sqrt{p}/α$ on the private sample complexity whenever the public sample size is less than $1/α^2$. Given our upper bounds, this shows that the dependence on $\sqrt{p}$ is necessary in the private sample complexity. We also give a lower bound of $1/α$ on the public sample complexity for a broad family of query classes, which by our upper bound, is tight in $α$. △ Less

Submitted 22 April, 2020; originally announced April 2020.

arXiv:1910.11519 [pdf, other]

Limits of Private Learning with Access to Public Data

Authors: Noga Alon, Raef Bassily, Shay Moran

Abstract: We consider learning problems where the training set consists of two types of examples: private and public. The goal is to design a learning algorithm that satisfies differential privacy only with respect to the private examples. This setting interpolates between private learning (where all examples are private) and classical learning (where all examples are public). We study the limits of learn… ▽ More We consider learning problems where the training set consists of two types of examples: private and public. The goal is to design a learning algorithm that satisfies differential privacy only with respect to the private examples. This setting interpolates between private learning (where all examples are private) and classical learning (where all examples are public). We study the limits of learning in this setting in terms of private and public sample complexities. We show that any hypothesis class of VC-dimension $d$ can be agnostically learned up to an excess error of $α$ using only (roughly) $d/α$ public examples and $d/α^2$ private labeled examples. This result holds even when the public examples are unlabeled. This gives a quadratic improvement over the standard $d/α^2$ upper bound on the public sample complexity (where private examples can be ignored altogether if the public examples are labeled). Furthermore, we give a nearly matching lower bound, which we prove via a generic reduction from this setting to the one of private learning without public data. △ Less

Submitted 25 October, 2019; originally announced October 2019.

Journal ref: NeurIPS 2019

arXiv:1908.09970 [pdf, other]

Private Stochastic Convex Optimization with Optimal Rates

Authors: Raef Bassily, Vitaly Feldman, Kunal Talwar, Abhradeep Thakurta

Abstract: We study differentially private (DP) algorithms for stochastic convex optimization (SCO). In this problem the goal is to approximately minimize the population loss given i.i.d. samples from a distribution over convex and Lipschitz loss functions. A long line of existing work on private convex optimization focuses on the empirical loss and derives asymptotically tight bounds on the excess empirical… ▽ More We study differentially private (DP) algorithms for stochastic convex optimization (SCO). In this problem the goal is to approximately minimize the population loss given i.i.d. samples from a distribution over convex and Lipschitz loss functions. A long line of existing work on private convex optimization focuses on the empirical loss and derives asymptotically tight bounds on the excess empirical loss. However a significant gap exists in the known bounds for the population loss. We show that, up to logarithmic factors, the optimal excess population loss for DP algorithms is equal to the larger of the optimal non-private excess population loss, and the optimal excess empirical loss of DP algorithms. This implies that, contrary to intuition based on private ERM, private SCO has asymptotically the same rate of $1/\sqrt{n}$ as non-private SCO in the parameter regime most common in practice. The best previous result in this setting gives rate of $1/n^{1/4}$. Our approach builds on existing differentially private algorithms and relies on the analysis of algorithmic stability to ensure generalization. △ Less

Submitted 26 August, 2019; originally announced August 2019.

arXiv:1907.13553 [pdf, ps, other]

Privately Answering Classification Queries in the Agnostic PAC Model

Authors: Anupama Nandi, Raef Bassily

Abstract: We revisit the problem of differentially private release of classification queries. In this problem, the goal is to design an algorithm that can accurately answer a sequence of classification queries based on a private training set while ensuring differential privacy. We formally study this problem in the agnostic PAC model and derive a new upper bound on the private sample complexity. Our results… ▽ More We revisit the problem of differentially private release of classification queries. In this problem, the goal is to design an algorithm that can accurately answer a sequence of classification queries based on a private training set while ensuring differential privacy. We formally study this problem in the agnostic PAC model and derive a new upper bound on the private sample complexity. Our results improve over those obtained in a recent work [BTT18] for the agnostic PAC setting. In particular, we give an improved construction that yields a tighter upper bound on the sample complexity. Moreover, unlike [BTT18], our accuracy guarantee does not involve any blow-up in the approximation error associated with the given hypothesis class. Given any hypothesis class with VC-dimension $d$, we show that our construction can privately answer up to $m$ classification queries with average excess error $α$ using a private sample of size $\approx \frac{d}{α^2}\,\max\left(1, \sqrt{m}\,α^{3/2}\right)$. Using recent results on private learning with auxiliary public data, we extend our construction to show that one can privately answer any number of classification queries with average excess error $α$ using a private sample of size $\approx \frac{d}{α^2}\,\max\left(1, \sqrt{d}\,α\right)$. When $α=O\left(\frac{1}{\sqrt{d}}\right)$, our private sample complexity bound is essentially optimal. △ Less

Submitted 3 December, 2019; v1 submitted 31 July, 2019; originally announced July 2019.

Comments: Made a a small tweak in the analysis to save a factor of $1/ε$

arXiv:1811.02564 [pdf, ps, other]

On exponential convergence of SGD in non-convex over-parametrized learning

Authors: Raef Bassily, Mikhail Belkin, Siyuan Ma

Abstract: Large over-parametrized models learned via stochastic gradient descent (SGD) methods have become a key element in modern machine learning. Although SGD methods are very effective in practice, most theoretical analyses of SGD suggest slower convergence than what is empirically observed. In our recent work [8] we analyzed how interpolation, common in modern over-parametrized learning, results in exp… ▽ More Large over-parametrized models learned via stochastic gradient descent (SGD) methods have become a key element in modern machine learning. Although SGD methods are very effective in practice, most theoretical analyses of SGD suggest slower convergence than what is empirically observed. In our recent work [8] we analyzed how interpolation, common in modern over-parametrized learning, results in exponential convergence of SGD with constant step size for convex loss functions. In this note, we extend those results to a much broader non-convex function class satisfying the Polyak-Lojasiewicz (PL) condition. A number of important non-convex problems in machine learning, including some classes of neural networks, have been recently shown to satisfy the PL condition. We argue that the PL condition provides a relevant and attractive setting for many machine learning problems, particularly in the over-parametrized regime. △ Less

Submitted 5 November, 2018; originally announced November 2018.

arXiv:1810.02810 [pdf, other]

Linear Queries Estimation with Local Differential Privacy

Authors: Raef Bassily

Abstract: We study the problem of estimating a set of $d$ linear queries with respect to some unknown distribution $\mathbf{p}$ over a domain $\mathcal{J}=[J]$ based on a sensitive data set of $n$ individuals under the constraint of local differential privacy. This problem subsumes a wide range of estimation tasks, e.g., distribution estimation and $d$-dimensional mean estimation. We provide new algorithms… ▽ More We study the problem of estimating a set of $d$ linear queries with respect to some unknown distribution $\mathbf{p}$ over a domain $\mathcal{J}=[J]$ based on a sensitive data set of $n$ individuals under the constraint of local differential privacy. This problem subsumes a wide range of estimation tasks, e.g., distribution estimation and $d$-dimensional mean estimation. We provide new algorithms for both the offline (non-adaptive) and adaptive versions of this problem. In the offline setting, the set of queries are fixed before the algorithm starts. In the regime where $n\lesssim d^2/\log(J)$, our algorithms attain $L_2$ estimation error that is independent of $d$, and is tight up to a factor of $\tilde{O}\left(\log^{1/4}(J)\right)$. For the special case of distribution estimation, we show that projecting the output estimate of an algorithm due to [Acharya et al. 2018] on the probability simplex yields an $L_2$ error that depends only sub-logarithmically on $J$ in the regime where $n\lesssim J^2/\log(J)$. These results show the possibility of accurate estimation of linear queries in the high-dimensional settings under the $L_2$ error criterion. In the adaptive setting, the queries are generated over $d$ rounds; one query at a time. In each round, a query can be chosen adaptively based on all the history of previous queries and answers. We give an algorithm for this problem with optimal $L_{\infty}$ estimation error (worst error in the estimated values for the queries w.r.t. the data distribution). Our bound matches a lower bound on the $L_{\infty}$ error for the offline version of this problem [Duchi et al. 2013]. △ Less

Submitted 5 October, 2018; originally announced October 2018.

arXiv:1803.05101 [pdf, ps, other]

Model-Agnostic Private Learning via Stability

Authors: Raef Bassily, Om Thakkar, Abhradeep Thakurta

Abstract: We design differentially private learning algorithms that are agnostic to the learning model. Our algorithms are interactive in nature, i.e., instead of outputting a model based on the training data, they provide predictions for a set of $m$ feature vectors that arrive online. We show that, for the feature vectors on which an ensemble of models (trained on random disjoint subsets of a dataset) mak… ▽ More We design differentially private learning algorithms that are agnostic to the learning model. Our algorithms are interactive in nature, i.e., instead of outputting a model based on the training data, they provide predictions for a set of $m$ feature vectors that arrive online. We show that, for the feature vectors on which an ensemble of models (trained on random disjoint subsets of a dataset) makes consistent predictions, there is almost no-cost of privacy in generating accurate predictions for those feature vectors. To that end, we provide a novel coupling of the distance to instability framework with the sparse vector technique. We provide algorithms with formal privacy and utility guarantees for both binary/multi-class classification, and soft-label classification. For binary classification in the standard (agnostic) PAC model, we show how to bootstrap from our privately generated predictions to construct a computationally efficient private learner that outputs a final accurate hypothesis. Our construction - to the best of our knowledge - is the first computationally efficient construction for a label-private learner. We prove sample complexity upper bounds for this setting. As in non-private sample complexity bounds, the only relevant property of the given concept class is its VC dimension. For soft-label classification, our techniques are based on exploiting the stability properties of traditional learning algorithms, like stochastic gradient descent (SGD). We provide a new technique to boost the average-case stability properties of learning algorithms to strong (worst-case) stability properties, and then exploit them to obtain private classification algorithms. In the process, we also show that a large class of SGD methods satisfy average-case stability properties, in contrast to a smaller class of SGD methods that are uniformly stable as shown in prior work. △ Less

Submitted 13 March, 2018; originally announced March 2018.

arXiv:1712.06559 [pdf, other]

The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning

Authors: Siyuan Ma, Raef Bassily, Mikhail Belkin

Abstract: In this paper we aim to formally explain the phenomenon of fast convergence of SGD observed in modern machine learning. The key observation is that most modern learning architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss (classification and regression) close to zero. While it is still unclear why these interpolated solutions perform well on tes… ▽ More In this paper we aim to formally explain the phenomenon of fast convergence of SGD observed in modern machine learning. The key observation is that most modern learning architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss (classification and regression) close to zero. While it is still unclear why these interpolated solutions perform well on test data, we show that these regimes allow for fast convergence of SGD, comparable in number of iterations to full gradient descent. For convex loss functions we obtain an exponential convergence bound for {\it mini-batch} SGD parallel to that for full gradient descent. We show that there is a critical batch size $m^*$ such that: (a) SGD iteration with mini-batch size $m\leq m^*$ is nearly equivalent to $m$ iterations of mini-batch size $1$ (\emph{linear scaling regime}). (b) SGD iteration with mini-batch $m> m^*$ is nearly equivalent to a full gradient descent iteration (\emph{saturation regime}). Moreover, for the quadratic loss, we derive explicit expressions for the optimal mini-batch and step size and explicitly characterize the two regimes above. The critical mini-batch size can be viewed as the limit for effective mini-batch parallelization. It is also nearly independent of the data size, implying $O(n)$ acceleration over GD per unit of computation. We give experimental evidence on real data which closely follows our theoretical analyses. Finally, we show how our results fit in the recent developments in training deep neural networks and discuss connections to adaptive rates for SGD and variance reduction. △ Less

Submitted 14 June, 2018; v1 submitted 18 December, 2017; originally announced December 2017.

arXiv:1710.05233 [pdf, ps, other]

Learners that Use Little Information

Authors: Raef Bassily, Shay Moran, Ido Nachum, Jonathan Shafer, Amir Yehudayoff

Abstract: We study learning algorithms that are restricted to using a small amount of information from their input sample. We introduce a category of learning algorithms we term $d$-bit information learners, which are algorithms whose output conveys at most $d$ bits of information of their input. A central theme in this work is that such algorithms generalize. We focus on the learning capacity of these al… ▽ More We study learning algorithms that are restricted to using a small amount of information from their input sample. We introduce a category of learning algorithms we term $d$-bit information learners, which are algorithms whose output conveys at most $d$ bits of information of their input. A central theme in this work is that such algorithms generalize. We focus on the learning capacity of these algorithms, and prove sample complexity bounds with tight dependencies on the confidence and error parameters. We also observe connections with well studied notions such as sample compression schemes, Occam's razor, PAC-Bayes and differential privacy. We discuss an approach that allows us to prove upper bounds on the amount of information that algorithms reveal about their inputs, and also provide a lower bound by showing a simple concept class for which every (possibly randomized) empirical risk minimizer must reveal a lot of information. On the other hand, we show that in the distribution-dependent setting every VC class has empirical risk minimizers that do not reveal a lot of information. △ Less

Submitted 27 February, 2018; v1 submitted 14 October, 2017; originally announced October 2017.

arXiv:1707.04982 [pdf, other]

Practical Locally Private Heavy Hitters

Authors: Raef Bassily, Kobbi Nissim, Uri Stemmer, Abhradeep Thakurta

Abstract: We present new practical local differentially private heavy hitters algorithms achieving optimal or near-optimal worst-case error and running time -- TreeHist and Bitstogram. In both algorithms, server running time is $\tilde O(n)$ and user running time is $\tilde O(1)$, hence improving on the prior state-of-the-art result of Bassily and Smith [STOC 2015] requiring $O(n^{5/2})$ server time and… ▽ More We present new practical local differentially private heavy hitters algorithms achieving optimal or near-optimal worst-case error and running time -- TreeHist and Bitstogram. In both algorithms, server running time is $\tilde O(n)$ and user running time is $\tilde O(1)$, hence improving on the prior state-of-the-art result of Bassily and Smith [STOC 2015] requiring $O(n^{5/2})$ server time and $O(n^{3/2})$ user time. With a typically large number of participants in local algorithms ($n$ in the millions), this reduction in time complexity, in particular at the user side, is crucial for making locally private heavy hitters algorithms usable in practice. We implemented Algorithm TreeHist to verify our theoretical analysis and compared its performance with the performance of Google's RAPPOR code. △ Less

Submitted 16 July, 2017; originally announced July 2017.

arXiv:1604.03336 [pdf, other]

Typical Stability

Authors: Raef Bassily, Yoav Freund

Abstract: In this paper, we introduce a notion of algorithmic stability called typical stability. When our goal is to release real-valued queries (statistics) computed over a dataset, this notion does not require the queries to be of bounded sensitivity -- a condition that is generally assumed under differential privacy [DMNS06, Dwork06] when used as a notion of algorithmic stability [DFHPRR15a, DFHPRR15b,… ▽ More In this paper, we introduce a notion of algorithmic stability called typical stability. When our goal is to release real-valued queries (statistics) computed over a dataset, this notion does not require the queries to be of bounded sensitivity -- a condition that is generally assumed under differential privacy [DMNS06, Dwork06] when used as a notion of algorithmic stability [DFHPRR15a, DFHPRR15b, BNSSSU16] -- nor does it require the samples in the dataset to be independent -- a condition that is usually assumed when generalization-error guarantees are sought. Instead, typical stability requires the output of the query, when computed on a dataset drawn from the underlying distribution, to be concentrated around its expected value with respect to that distribution. We discuss the implications of typical stability on the generalization error (i.e., the difference between the value of the query computed on the dataset and the expected value of the query with respect to the true data distribution). We show that typical stability can control generalization error in adaptive data analysis even when the samples in the dataset are not necessarily independent and when queries to be computed are not necessarily of bounded-sensitivity as long as the results of the queries over the dataset (i.e., the computed statistics) follow a distribution with a "light" tail. Examples of such queries include, but not limited to, subgaussian and subexponential queries. We also discuss the composition guarantees of typical stability and prove composition theorems that characterize the degradation of the parameters of typical stability under $k$-fold adaptive composition. We also give simple noise-addition algorithms that achieve this notion. These algorithms are similar to their differentially private counterparts, however, the added noise is calibrated differently. △ Less

Submitted 18 September, 2016; v1 submitted 12 April, 2016; originally announced April 2016.

Comments: New sections, extended discussions, and complete proofs

arXiv:1511.02513 [pdf, other]

Algorithmic Stability for Adaptive Data Analysis

Authors: Raef Bassily, Kobbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, Jonathan Ullman

Abstract: Adaptivity is an important feature of data analysis---the choice of questions to ask about a dataset often depends on previous interactions with the same dataset. However, statistical validity is typically studied in a nonadaptive model, where all questions are specified before the dataset is drawn. Recent work by Dwork et al. (STOC, 2015) and Hardt and Ullman (FOCS, 2014) initiated the formal stu… ▽ More Adaptivity is an important feature of data analysis---the choice of questions to ask about a dataset often depends on previous interactions with the same dataset. However, statistical validity is typically studied in a nonadaptive model, where all questions are specified before the dataset is drawn. Recent work by Dwork et al. (STOC, 2015) and Hardt and Ullman (FOCS, 2014) initiated the formal study of this problem, and gave the first upper and lower bounds on the achievable generalization error for adaptive data analysis. Specifically, suppose there is an unknown distribution $\mathbf{P}$ and a set of $n$ independent samples $\mathbf{x}$ is drawn from $\mathbf{P}$. We seek an algorithm that, given $\mathbf{x}$ as input, accurately answers a sequence of adaptively chosen queries about the unknown distribution $\mathbf{P}$. How many samples $n$ must we draw from the distribution, as a function of the type of queries, the number of queries, and the desired level of accuracy? In this work we make two new contributions: (i) We give upper bounds on the number of samples $n$ that are needed to answer statistical queries. The bounds improve and simplify the work of Dwork et al. (STOC, 2015), and have been applied in subsequent work by those authors (Science, 2015, NIPS, 2015). (ii) We prove the first upper bounds on the number of samples required to answer more general families of queries. These include arbitrary low-sensitivity queries and an important class of optimization queries. As in Dwork et al., our algorithms are based on a connection with algorithmic stability in the form of differential privacy. We extend their work by giving a quantitatively optimal, more general, and simpler proof of their main theorem that stability implies low generalization error. We also study weaker stability guarantees such as bounded KL divergence and total variation distance. △ Less

Submitted 8 November, 2015; originally announced November 2015.

Comments: This work unifies and subsumes the two arXiv manuscripts arXiv:1503.04843 and arXiv:1504.05800

arXiv:1504.04686 [pdf, other]

doi 10.1145/2746539.2746632

Local, Private, Efficient Protocols for Succinct Histograms

Authors: Raef Bassily, Adam Smith

Abstract: We give efficient protocols and matching accuracy lower bounds for frequency estimation in the local model for differential privacy. In this model, individual users randomize their data themselves, sending differentially private reports to an untrusted server that aggregates them. We study protocols that produce a succinct histogram representation of the data. A succinct histogram is a list of t… ▽ More We give efficient protocols and matching accuracy lower bounds for frequency estimation in the local model for differential privacy. In this model, individual users randomize their data themselves, sending differentially private reports to an untrusted server that aggregates them. We study protocols that produce a succinct histogram representation of the data. A succinct histogram is a list of the most frequent items in the data (often called "heavy hitters") along with estimates of their frequencies; the frequency of all other items is implicitly estimated as 0. If there are $n$ users whose items come from a universe of size $d$, our protocols run in time polynomial in $n$ and $\log(d)$. With high probability, they estimate the accuracy of every item up to error $O\left(\sqrt{\log(d)/(ε^2n)}\right)$ where $ε$ is the privacy parameter. Moreover, we show that this much error is necessary, regardless of computational efficiency, and even for the simple setting where only one item appears with significant frequency in the data set. Previous protocols (Mishra and Sandler, 2006; Hsu, Khanna and Roth, 2012) for this task either ran in time $Ω(d)$ or had much worse error (about $\sqrt[6]{\log(d)/(ε^2n)}$), and the only known lower bound on error was $Ω(1/\sqrt{n})$. We also adapt a result of McGregor et al (2010) to the local setting. In a model with public coins, we show that each user need only send 1 bit to the server. For all known local protocols (including ours), the transformation preserves computational efficiency. △ Less

Submitted 18 April, 2015; originally announced April 2015.

ACM Class: F.2.0

arXiv:1503.04843

More General Queries and Less Generalization Error in Adaptive Data Analysis

Authors: Raef Bassily, Adam Smith, Thomas Steinke, Jonathan Ullman

Abstract: Adaptivity is an important feature of data analysis---typically the choice of questions asked about a dataset depends on previous interactions with the same dataset. However, generalization error is typically bounded in a non-adaptive model, where all questions are specified before the dataset is drawn. Recent work by Dwork et al. (STOC '15) and Hardt and Ullman (FOCS '14) initiated the formal stu… ▽ More Adaptivity is an important feature of data analysis---typically the choice of questions asked about a dataset depends on previous interactions with the same dataset. However, generalization error is typically bounded in a non-adaptive model, where all questions are specified before the dataset is drawn. Recent work by Dwork et al. (STOC '15) and Hardt and Ullman (FOCS '14) initiated the formal study of this problem, and gave the first upper and lower bounds on the achievable generalization error for adaptive data analysis. Specifically, suppose there is an unknown distribution $\mathcal{P}$ and a set of $n$ independent samples $x$ is drawn from $\mathcal{P}$. We seek an algorithm that, given $x$ as input, "accurately" answers a sequence of adaptively chosen "queries" about the unknown distribution $\mathcal{P}$. How many samples $n$ must we draw from the distribution, as a function of the type of queries, the number of queries, and the desired level of accuracy? In this work we make two new contributions towards resolving this question: *We give upper bounds on the number of samples $n$ that are needed to answer statistical queries that improve over the bounds of Dwork et al. *We prove the first upper bounds on the number of samples required to answer more general families of queries. These include arbitrary low-sensitivity queries and the important class of convex risk minimization queries. As in Dwork et al., our algorithms are based on a connection between differential privacy and generalization error, but we feel that our analysis is simpler and more modular, which may be useful for studying these questions in the future. △ Less

Submitted 9 November, 2015; v1 submitted 16 March, 2015; originally announced March 2015.

Comments: This paper was merged with another manuscript and is now subsumed by arXiv:1511.02513

arXiv:1409.3893 [pdf, ps, other]

Causal Erasure Channels

Authors: Raef Bassily, Adam Smith

Abstract: We consider the communication problem over binary causal adversarial erasure channels. Such a channel maps $n$ input bits to $n$ output symbols in $\{0,1,\wedge\}$, where $\wedge$ denotes erasure. The channel is causal if, for every $i$, the channel adversarially decides whether to erase the $i$th bit of its input based on inputs $1,...,i$, before it observes bits $i+1$ to $n$. Such a channel is… ▽ More We consider the communication problem over binary causal adversarial erasure channels. Such a channel maps $n$ input bits to $n$ output symbols in $\{0,1,\wedge\}$, where $\wedge$ denotes erasure. The channel is causal if, for every $i$, the channel adversarially decides whether to erase the $i$th bit of its input based on inputs $1,...,i$, before it observes bits $i+1$ to $n$. Such a channel is $p$-bounded if it can erase at most a $p$ fraction of the input bits over the whole transmission duration. Causal channels provide a natural model for channels that obey basic physical restrictions but are otherwise unpredictable or highly variable. For a given erasure rate $p$, our goal is to understand the optimal rate (the capacity) at which a randomized (stochastic) encoder/decoder can transmit reliably across all causal $p$-bounded erasure channels. In this paper, we introduce the causal erasure model and provide new upper bounds and lower bounds on the achievable rate. Our bounds separate the achievable rate in the causal erasures setting from the rates achievable in two related models: random erasure channels (strictly weaker) and fully adversarial erasure channels (strictly stronger). Specifically, we show: - A strict separation between random and causal erasures for all constant erasure rates $p\in(0,1)$. - A strict separation between causal and fully adversarial erasures for $p\in(0,φ)$ where $φ\approx 0.348$. - For $p\in[φ,1/2)$, we show codes for causal erasures that have higher rate than the best known constructions for fully adversarial channels. Our results contrast with existing results on correcting causal bit-flip errors (as opposed to erasures) [Dey et. al 2008, 2009], [Haviv-Langberg 2011]. For the separations we provide, the analogous separations for bit-flip models are either not known at all or much weaker. △ Less

Submitted 12 September, 2014; originally announced September 2014.

arXiv:1405.7085 [pdf, other]

Differentially Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds

Authors: Raef Bassily, Adam Smith, Abhradeep Thakurta

Abstract: In this paper, we initiate a systematic investigation of differentially private algorithms for convex empirical risk minimization. Various instantiations of this problem have been studied before. We provide new algorithms and matching lower bounds for private ERM assuming only that each data point's contribution to the loss function is Lipschitz bounded and that the domain of optimization is bound… ▽ More In this paper, we initiate a systematic investigation of differentially private algorithms for convex empirical risk minimization. Various instantiations of this problem have been studied before. We provide new algorithms and matching lower bounds for private ERM assuming only that each data point's contribution to the loss function is Lipschitz bounded and that the domain of optimization is bounded. We provide a separate set of algorithms and matching lower bounds for the setting in which the loss functions are known to also be strongly convex. Our algorithms run in polynomial time, and in some cases even match the optimal non-private running time (as measured by oracle complexity). We give separate algorithms (and lower bounds) for $(ε,0)$- and $(ε,δ)$-differential privacy; perhaps surprisingly, the techniques used for designing optimal algorithms in the two cases are completely different. Our lower bounds apply even to very simple, smooth function families, such as linear and quadratic functions. This implies that algorithms from previous work can be used to obtain optimal error rates, under the additional assumption that the contributions of each data point to the loss function is smooth. We show that simple approaches to smoothing arbitrary loss functions (in order to apply previous techniques) do not yield optimal error rates. In particular, optimal algorithms were not previously known for problems such as training support vector machines and the high-dimensional median. △ Less

Submitted 17 October, 2014; v1 submitted 27 May, 2014; originally announced May 2014.

arXiv:1010.6057 [pdf, ps, other]

Ergodic Secret Alignment

Authors: Raef Bassily, Sennur Ulukus

Abstract: In this paper, we introduce two new achievable schemes for the fading multiple access wiretap channel (MAC-WT). In the model that we consider, we assume that perfect knowledge of the state of all channels is available at all the nodes in a causal fashion. Our schemes use this knowledge together with the time varying nature of the channel model to align the interference from different users at the… ▽ More In this paper, we introduce two new achievable schemes for the fading multiple access wiretap channel (MAC-WT). In the model that we consider, we assume that perfect knowledge of the state of all channels is available at all the nodes in a causal fashion. Our schemes use this knowledge together with the time varying nature of the channel model to align the interference from different users at the eavesdropper perfectly in a one-dimensional space while creating a higher dimensionality space for the interfering signals at the legitimate receiver hence allowing for better chance of recovery. While we achieve this alignment through signal scaling at the transmitters in our first scheme (scaling based alignment (SBA)), we let nature provide this alignment through the ergodicity of the channel coefficients in the second scheme (ergodic secret alignment (ESA)). For each scheme, we obtain the resulting achievable secrecy rate region. We show that the secrecy rates achieved by both schemes scale with SNR as 1/2log(SNR). Hence, we show the sub-optimality of the i.i.d. Gaussian signaling based schemes with and without cooperative jamming by showing that the secrecy rates achieved using i.i.d. Gaussian signaling with cooperative jamming do not scale with SNR. In addition, we introduce an improved version of our ESA scheme where we incorporate cooperative jamming to achieve higher secrecy rates. Moreover, we derive the necessary optimality conditions for the power control policy that maximizes the secrecy sum rate achievable by our ESA scheme when used solely and with cooperative jamming. △ Less

Submitted 28 October, 2010; originally announced October 2010.

Comments: Submitted to IEEE Transactions on Information Theory, October 2010

Showing 1–30 of 30 results for author: Bassily, R