\thesubsection Coarse Estimation

We review the one-dimensional, sample-level private coarse estimation as given in KSU’20: Pure DP Range Estimator PDPRE\eps,R,k,u⁒(X)subscriptPDPRE\epsπ‘…π‘˜π‘’π‘‹\textrm{PDPRE}_{\eps,R,k,u}(X)PDPRE start_POSTSUBSCRIPT , italic_R , italic_k , italic_u end_POSTSUBSCRIPT ( italic_X ):

  1. 1.

    If uβ‰₯2⁒R𝑒2𝑅u\geq 2Ritalic_u β‰₯ 2 italic_R, return any point in [βˆ’R,R]𝑅𝑅[-R,R][ - italic_R , italic_R ]. Otherwise, set parameter r←u/2β†π‘Ÿπ‘’2r\leftarrow u/2italic_r ← italic_u / 2.

  2. 2.

    Divide [βˆ’Rβˆ’2⁒r,R+2⁒r]𝑅2π‘Ÿπ‘…2π‘Ÿ[-R-2r,R+2r][ - italic_R - 2 italic_r , italic_R + 2 italic_r ] into buckets: [βˆ’Rβˆ’2⁒r,βˆ’R),…,[βˆ’2⁒r,0),[0,2⁒r),…,[R,R+2⁒r]𝑅2π‘Ÿπ‘…β€¦2π‘Ÿ002π‘Ÿβ€¦π‘…π‘…2π‘Ÿ[-R-2r,-R),\ldots,[-2r,0),[0,2r),\ldots,[R,R+2r][ - italic_R - 2 italic_r , - italic_R ) , … , [ - 2 italic_r , 0 ) , [ 0 , 2 italic_r ) , … , [ italic_R , italic_R + 2 italic_r ].

  3. 3.

    Run Pure DP Histogram for X𝑋Xitalic_X over the above buckets.

  4. 4.

    Let [a,b]π‘Žπ‘[a,b][ italic_a , italic_b ] be the bucket that has the maximum number of points.

  5. 5.

    Return ΞΌcoarse=a+b2subscriptπœ‡coarseπ‘Žπ‘2\mu_{\texttt{coarse}}=\frac{a+b}{2}italic_ΞΌ start_POSTSUBSCRIPT coarse end_POSTSUBSCRIPT = divide start_ARG italic_a + italic_b end_ARG start_ARG 2 end_ARG.

{theorem}

[Sample-Level Coarse Estimation] For all \eps>0\eps0\eps>0> 0, PDPRE is \eps\eps\eps-DP. Futhermore, suppose P𝑃Pitalic_P is a distribution over \R\R\R with mean μ∈[βˆ’R,R]πœ‡π‘…π‘…\mu\in[-R,R]italic_ΞΌ ∈ [ - italic_R , italic_R ] and kπ‘˜kitalic_k-th moment bounded by 1111. Then there exists a small constant c𝑐citalic_c \colorblue(that gets smaller as kπ‘˜kitalic_k gets bigger…) such that, for all uβ‰₯c𝑒𝑐u\geq citalic_u β‰₯ italic_c, there exists

n0=O⁒\Paren⁒log⁑(R/(u⁒β))\eps+log⁑(1/Ξ²)subscript𝑛0𝑂\Paren𝑅𝑒𝛽\eps1𝛽n_{0}=O\Paren{\frac{\log(R/(u\beta))}{\eps}+\log(1/\beta)}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_O divide start_ARG roman_log ( italic_R / ( italic_u italic_Ξ² ) ) end_ARG start_ARG end_ARG + roman_log ( 1 / italic_Ξ² )

such that, if nβ‰₯n0𝑛subscript𝑛0n\geq n_{0}italic_n β‰₯ italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, then with probability at least 1βˆ’Ξ²1𝛽1-\beta1 - italic_Ξ²,

|ΞΌβˆ’ΞΌcoarse|≀u.πœ‡subscriptπœ‡coarse𝑒|\mu-\mu_{\textrm{coarse}}|\leq u.| italic_ΞΌ - italic_ΞΌ start_POSTSUBSCRIPT coarse end_POSTSUBSCRIPT | ≀ italic_u .
{proof}

[Proof] Note that the proof of privacy follows directly from \colorblue cite. Thus, the rest of this proof is dedicated to the proof of accuracy. If uβ‰₯2⁒R𝑒2𝑅u\geq 2Ritalic_u β‰₯ 2 italic_R, by step 1111, the coarse estimate will be within u𝑒uitalic_u of ΞΌπœ‡\muitalic_ΞΌ. Otherwise, we show that max⁑(|aβˆ’ΞΌ|,|bβˆ’ΞΌ|)≀uπ‘Žπœ‡π‘πœ‡π‘’\max(|a-\mu|,|b-\mu|)\leq uroman_max ( | italic_a - italic_ΞΌ | , | italic_b - italic_ΞΌ | ) ≀ italic_u, which implies that |ΞΌcoarseβˆ’ΞΌ|≀usubscriptπœ‡coarseπœ‡π‘’|\mu_{\texttt{coarse}}-\mu|\leq u| italic_ΞΌ start_POSTSUBSCRIPT coarse end_POSTSUBSCRIPT - italic_ΞΌ | ≀ italic_u. We first show that, with probability at least 1βˆ’Ξ²/21𝛽21-\beta/21 - italic_Ξ² / 2, the heaviest (non-noisy) bucket in [βˆ’Rβˆ’2⁒r,R+2⁒r]𝑅2π‘Ÿπ‘…2π‘Ÿ[-R-2r,R+2r][ - italic_R - 2 italic_r , italic_R + 2 italic_r ] (i.e. the bucket with the most samples) must intersect with [ΞΌβˆ’r,ΞΌ+r]πœ‡π‘Ÿπœ‡π‘Ÿ[\mu-r,\mu+r][ italic_ΞΌ - italic_r , italic_ΞΌ + italic_r ]. If the (noisy) bucket [a,b]π‘Žπ‘[a,b][ italic_a , italic_b ] discovered in our algorithm is also the heaviest non-noisy bucket, then this would immediately imply max⁑(|aβˆ’ΞΌ|,|bβˆ’ΞΌ|)≀2⁒r=uπ‘Žπœ‡π‘πœ‡2π‘Ÿπ‘’\max(|a-\mu|,|b-\mu|)\leq 2r=uroman_max ( | italic_a - italic_ΞΌ | , | italic_b - italic_ΞΌ | ) ≀ 2 italic_r = italic_u. To prove this, it suffices to show that only at most n/16𝑛16n/16italic_n / 16 samples are outside of [ΞΌβˆ’r,ΞΌ+r]πœ‡π‘Ÿπœ‡π‘Ÿ[\mu-r,\mu+r][ italic_ΞΌ - italic_r , italic_ΞΌ + italic_r ]. This event would suffice as the heaviest bucket not intersecting with [ΞΌβˆ’r,ΞΌ+r]πœ‡π‘Ÿπœ‡π‘Ÿ[\mu-r,\mu+r][ italic_ΞΌ - italic_r , italic_ΞΌ + italic_r ] would only have at most n/16𝑛16n/16italic_n / 16 samples while, on the other hand, the heaviest bucket that intersects with [ΞΌβˆ’r,ΞΌ+r]πœ‡π‘Ÿπœ‡π‘Ÿ[\mu-r,\mu+r][ italic_ΞΌ - italic_r , italic_ΞΌ + italic_r ] will have more than n/16𝑛16n/16italic_n / 16 samples–at least (15⁒n/16)/3β‰₯n/415𝑛163𝑛4(15n/16)/3\geq n/4( 15 italic_n / 16 ) / 3 β‰₯ italic_n / 4 samples. We begin by calculating the expected number of samples that fall outside of the interval [ΞΌβˆ’r,ΞΌ+r]πœ‡π‘Ÿπœ‡π‘Ÿ[\mu-r,\mu+r][ italic_ΞΌ - italic_r , italic_ΞΌ + italic_r ]. If we set Yi=\mathbb⁒I⁒(Xiβˆ‰[ΞΌβˆ’r,ΞΌ+r])subscriptπ‘Œπ‘–\mathbb𝐼subscriptπ‘‹π‘–πœ‡π‘Ÿπœ‡π‘ŸY_{i}=\mathbb{I}(X_{i}\notin[\mu-r,\mu+r])italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_I ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT βˆ‰ [ italic_ΞΌ - italic_r , italic_ΞΌ + italic_r ] ), then this is equivalent to calculating \E⁒[βˆ‘Yi]\Edelimited-[]subscriptπ‘Œπ‘–\E[\sum Y_{i}][ βˆ‘ italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ]:

\E⁒\Bracβ’βˆ‘i=1nYi=n⁒Pr⁑[|Xiβˆ’ΞΌ|β‰₯r]≀n/rk,\E\Bracsuperscriptsubscript𝑖1𝑛subscriptπ‘Œπ‘–π‘›Prsubscriptπ‘‹π‘–πœ‡π‘Ÿπ‘›superscriptπ‘Ÿπ‘˜\E\Brac{\sum_{i=1}^{n}Y_{i}}=n\Pr[|X_{i}-\mu|\geq r]\leq n/r^{k},βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_n roman_Pr [ | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_ΞΌ | β‰₯ italic_r ] ≀ italic_n / italic_r start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ,

where the last inequality comes from the bounded kπ‘˜kitalic_k-th moment assumption and Markov’s inequality. Thus, we can show that, with probability at most Ξ²/2𝛽2\beta/2italic_Ξ² / 2, more than n/16𝑛16n/16italic_n / 16 samples fall outside of ΞΌΒ±rplus-or-minusπœ‡π‘Ÿ\mu\pm ritalic_ΞΌ Β± italic_r:

Pr⁑\Bracβ’βˆ‘i=1nYiβ‰₯n16≀Pr⁑\Bracβ’βˆ‘i=1nYiβ‰₯rk16β‹…nrk≀exp⁑\Parenβˆ’Ξ˜β’(n)≀β/2,Pr\Bracsuperscriptsubscript𝑖1𝑛subscriptπ‘Œπ‘–π‘›16Pr\Bracsuperscriptsubscript𝑖1𝑛subscriptπ‘Œπ‘–β‹…superscriptπ‘Ÿπ‘˜16𝑛superscriptπ‘Ÿπ‘˜\ParenΞ˜π‘›π›½2\Pr\Brac{\sum_{i=1}^{n}Y_{i}\geq\frac{n}{16}}\leq\Pr\Brac{\sum_{i=1}^{n}Y_{i}% \geq\frac{r^{k}}{16}\cdot\frac{n}{r^{k}}}\leq\exp\Paren{-\Theta(n)}\leq\beta/2,roman_Pr βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT β‰₯ divide start_ARG italic_n end_ARG start_ARG 16 end_ARG ≀ roman_Pr βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT β‰₯ divide start_ARG italic_r start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG 16 end_ARG β‹… divide start_ARG italic_n end_ARG start_ARG italic_r start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG ≀ roman_exp - roman_Θ ( italic_n ) ≀ italic_Ξ² / 2 ,

by a Chernoff bound, so long as 2⁒r=uβ‰₯c2π‘Ÿπ‘’π‘2r=u\geq c2 italic_r = italic_u β‰₯ italic_c for some constant c𝑐citalic_c and nβ‰₯Θ⁒(ln⁑(1/Ξ²))π‘›Ξ˜1𝛽n\geq\Theta(\ln(1/\beta))italic_n β‰₯ roman_Θ ( roman_ln ( 1 / italic_Ξ² ) ). Finally, we show that, with probability at least 1βˆ’Ξ²/21𝛽21-\beta/21 - italic_Ξ² / 2, the heaviest non-noisy bucket is also the heaviest noisy bucket, completing the proof. By Lemma(\colorbluecite), we know that, with probability at least 1βˆ’Ξ²/21𝛽21-\beta/21 - italic_Ξ² / 2, the largest magnitude of the noise in any bucket will not exceed n/16𝑛16n/16italic_n / 16, so long as nβ‰₯Θ⁒\Paren⁒ln⁑(R/(r⁒β))/\epsπ‘›Ξ˜\Parenπ‘…π‘Ÿπ›½\epsn\geq\Theta\Paren{\ln(R/(r\beta))/\eps}italic_n β‰₯ roman_Θ roman_ln ( italic_R / ( italic_r italic_Ξ² ) ) /. Thus, the heaviest non-noisy bucket will remain the heaviest bucket after noise is added to all of the buckets, completing the proof. {corollary}[User-Level Coarse Estimation] Let P𝑃Pitalic_P be a distribution over \R\R\R with mean μ∈\bracβˆ’R,Rπœ‡\brac𝑅𝑅\mu\in\brac{-R,R}italic_ΞΌ ∈ - italic_R , italic_R and kπ‘˜kitalic_k-th moment bounded by 1111. Then for all \eps>0\eps0\eps>0> 0, there exists an \eps\eps\eps-DP user-level algorithm that takes nβ‰₯n0𝑛subscript𝑛0n\geq n_{0}italic_n β‰₯ italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT many users where

n0=\cOk⁒\Paren⁒log⁑(R⁒m/Ξ²)\epssubscript𝑛0subscript\cOπ‘˜\Parenπ‘…π‘šπ›½\epsn_{0}=\cO_{k}\Paren{\frac{\log(Rm/\beta)}{\eps}}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT divide start_ARG roman_log ( italic_R italic_m / italic_Ξ² ) end_ARG start_ARG end_ARG

and outputs ΞΌcoarsesubscriptπœ‡coarse\mu_{\texttt{coarse}}italic_ΞΌ start_POSTSUBSCRIPT coarse end_POSTSUBSCRIPT such that

\Abs⁒μcoarseβˆ’ΞΌβ‰€Ξ˜β’\Paren⁒1m,\Abssubscriptπœ‡coarseπœ‡Ξ˜\Paren1π‘š\Abs{\mu_{\texttt{coarse}}-\mu}\leq\Theta\Paren{\sqrt{\frac{1}{m}}},italic_ΞΌ start_POSTSUBSCRIPT coarse end_POSTSUBSCRIPT - italic_ΞΌ ≀ roman_Θ square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_m end_ARG end_ARG ,

where mπ‘šmitalic_m is the number of samples per users. {proof} \Crefthm:sample-level-coarse-esitmation, together with \Crefthm:user-level-to-sample-level-reduction, and setting the accuracy parameter to Θ⁒\Paren⁒1mΘ\Paren1π‘š\Theta\Paren{\sqrt{\frac{1}{m}}}roman_Θ square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_m end_ARG end_ARG readily proves this corollary. {remark} This is sufficient for the purposes in the fine estimation setting as the two values the clip** radius obtains are both larger than 1m1π‘š\sqrt{\frac{1}{m}}square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_m end_ARG end_ARG. Therefore having a coarse estimate with accuracy 1m1π‘š\sqrt{\frac{1}{m}}square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_m end_ARG end_ARG, would be sufficient