-
Continual Counting with Gradual Privacy Expiration
Authors:
Joel Daniel Andersson,
Monika Henzinger,
Rasmus Pagh,
Teresa Anna Steiner,
Jalaj Upadhyay
Abstract:
Differential privacy with gradual expiration models the setting where data items arrive in a stream and at a given time $t$ the privacy loss guaranteed for a data item seen at time $(t-d)$ is $εg(d)$, where $g$ is a monotonically non-decreasing function. We study the fundamental $\textit{continual (binary) counting}$ problem where each data item consists of a bit, and the algorithm needs to output…
▽ More
Differential privacy with gradual expiration models the setting where data items arrive in a stream and at a given time $t$ the privacy loss guaranteed for a data item seen at time $(t-d)$ is $εg(d)$, where $g$ is a monotonically non-decreasing function. We study the fundamental $\textit{continual (binary) counting}$ problem where each data item consists of a bit, and the algorithm needs to output at each time step the sum of all the bits streamed so far. For a stream of length $T$ and privacy $\textit{without}$ expiration continual counting is possible with maximum (over all time steps) additive error $O(\log^2(T)/\varepsilon)$ and the best known lower bound is $Ω(\log(T)/\varepsilon)$; closing this gap is a challenging open problem.
We show that the situation is very different for privacy with gradual expiration by giving upper and lower bounds for a large set of expiration functions $g$. Specifically, our algorithm achieves an additive error of $ O(\log(T)/ε)$ for a large set of privacy expiration functions. We also give a lower bound that shows that if $C$ is the additive error of any $ε$-DP algorithm for this problem, then the product of $C$ and the privacy expiration function after $2C$ steps must be $Ω(\log(T)/ε)$. Our algorithm matches this lower bound as its additive error is $O(\log(T)/ε)$, even when $g(2C) = O(1)$.
Our empirical evaluation shows that we achieve a slowly growing privacy loss with significantly smaller empirical privacy loss for large values of $d$ than a natural baseline algorithm.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Private graph colouring with limited defectiveness
Authors:
Aleksander B. G. Christiansen,
Eva Rotenberg,
Teresa Anna Steiner,
Juliette Vlieghe
Abstract:
Differential privacy is the gold standard in the problem of privacy preserving data analysis, which is crucial in a wide range of disciplines. Vertex colouring is one of the most fundamental questions about a graph. In this paper, we study the vertex colouring problem in the differentially private setting.
To be edge-differentially private, a colouring algorithm needs to be defective: a colourin…
▽ More
Differential privacy is the gold standard in the problem of privacy preserving data analysis, which is crucial in a wide range of disciplines. Vertex colouring is one of the most fundamental questions about a graph. In this paper, we study the vertex colouring problem in the differentially private setting.
To be edge-differentially private, a colouring algorithm needs to be defective: a colouring is d-defective if a vertex can share a colour with at most d of its neighbours. Without defectiveness, the only differentially private colouring algorithm needs to assign n different colours to the n different vertices. We show the following lower bound for the defectiveness: a differentially private c-edge colouring algorithm of a graph of maximum degree Δ > 0 has defectiveness at least d = Ω (log n / (log c+log Δ)).
We also present an ε-differentially private algorithm to Θ ( Δ / log n + 1 / ε)-colour a graph with defectiveness at most Θ(log n).
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
Differentially Private Approximate Pattern Matching
Authors:
Teresa Anna Steiner
Abstract:
In this paper, we consider the $k$-approximate pattern matching problem under differential privacy, where the goal is to report or count all substrings of a given string $S$ which have a Hamming distance at most $k$ to a pattern $P$, or decide whether such a substring exists. In our definition of privacy, individual positions of the string $S$ are protected. To be able to answer queries under diff…
▽ More
In this paper, we consider the $k$-approximate pattern matching problem under differential privacy, where the goal is to report or count all substrings of a given string $S$ which have a Hamming distance at most $k$ to a pattern $P$, or decide whether such a substring exists. In our definition of privacy, individual positions of the string $S$ are protected. To be able to answer queries under differential privacy, we allow some slack on $k$, i.e. we allow reporting or counting substrings of $S$ with a distance at most $(1+γ)k+α$ to $P$, for a multiplicative error $γ$ and an additive error $α$. We analyze which values of $α$ and $γ$ are necessary or sufficient to solve the $k$-approximate pattern matching problem while satisfying $ε$-differential privacy. Let $n$ denote the length of $S$. We give 1) an $ε$-differentially private algorithm with an additive error of $O(ε^{-1}\log n)$ and no multiplicative error for the existence variant; 2) an $ε$-differentially private algorithm with an additive error $O(ε^{-1}\max(k,\log n)\cdot\log n)$ for the counting variant; 3) an $ε$-differentially private algorithm with an additive error of $O(ε^{-1}\log n)$ and multiplicative error $O(1)$ for the reporting variant for a special class of patterns. The error bounds hold with high probability. All of these algorithms return a witness, that is, if there exists a substring of $S$ with distance at most $k$ to $P$, then the algorithm returns a substring of $S$ with distance at most $(1+γ)k+α$ to $P$. Further, we complement these results by a lower bound, showing that any algorithm for the existence variant which also returns a witness must have an additive error of $Ω(ε^{-1}\log n)$ with constant probability.
△ Less
Submitted 13 November, 2023;
originally announced November 2023.
-
Differentially Private Histogram, Predecessor, and Set Cardinality under Continual Observation
Authors:
Monika Henzinger,
A. R. Sricharan,
Teresa Anna Steiner
Abstract:
Differential privacy is the de-facto privacy standard in data analysis. The classic model of differential privacy considers the data to be static. The dynamic setting, called differential privacy under continual observation, captures many applications more realistically. In this work we consider several natural dynamic data structure problems under continual observation, where we want to maintain…
▽ More
Differential privacy is the de-facto privacy standard in data analysis. The classic model of differential privacy considers the data to be static. The dynamic setting, called differential privacy under continual observation, captures many applications more realistically. In this work we consider several natural dynamic data structure problems under continual observation, where we want to maintain information about a changing data set such that we can answer certain sets of queries at any given time while satisfying $ε$-differential privacy. The problems we consider include (a) maintaining a histogram and various extensions of histogram queries such as quantile queries, (b) maintaining a predecessor search data structure of a dynamically changing set in a given ordered universe, and (c) maintaining the cardinality of a dynamically changing set. For (a) we give new error bounds parameterized in the maximum output of any query $c_{\max}$: our algorithm gives an upper bound of $O(d\log^2dc_{\max}+\log T)$ for computing histogram, the maximum and minimum column sum, quantiles on the column sums, and related queries. The bound holds for unknown $c_{\max}$ and $T$. For (b), we give a general reduction to orthogonal range counting. Further, we give an improvement for the case where only insertions are allowed. We get a data structure which for a given query, returns an interval that contains the predecessor, and at most $O(\log^2 u \sqrt{\log T})$ more elements, where $u$ is the size of the universe. The bound holds for unknown $T$. Lastly, for (c), we give a parameterized upper bound of $O(\min(d,\sqrt{K\log T}))$, where $K$ is an upper bound on the number of updates. We show a matching lower bound. Finally, we show how to extend the bound for (c) for unknown $K$ and $T$.
△ Less
Submitted 17 June, 2023;
originally announced June 2023.
-
Compressed Indexing for Consecutive Occurrences
Authors:
Paweł Gawrychowski,
Garance Gourdel,
Tatiana Starikovskaya,
Teresa Anna Steiner
Abstract:
The fundamental question considered in algorithms on strings is that of indexing, that is, preprocessing a given string for specific queries. By now we have a number of efficient solutions for this problem when the queries ask for an exact occurrence of a given pattern $P$. However, practical applications motivate the necessity of considering more complex queries, for example concerning near occur…
▽ More
The fundamental question considered in algorithms on strings is that of indexing, that is, preprocessing a given string for specific queries. By now we have a number of efficient solutions for this problem when the queries ask for an exact occurrence of a given pattern $P$. However, practical applications motivate the necessity of considering more complex queries, for example concerning near occurrences of two patterns. Recently, Bille et al. [CPM 2021] introduced a variant of such queries, called gapped consecutive occurrences, in which a query consists of two patterns $P_{1}$ and $P_{2}$ and a range $[a,b]$, and one must find all consecutive occurrences $(q_1,q_2)$ of $P_{1}$ and $P_{2}$ such that $q_2-q_1 \in [a,b]$. By their results, we cannot hope for a very efficient indexing structure for such queries, even if $a=0$ is fixed (although at the same time they provided a non-trivial upper bound). Motivated by this, we focus on a text given as a straight-line program (SLP) and design an index taking space polynomial in the size of the grammar that answers such queries in time optimal up to polylog factors.
△ Less
Submitted 3 April, 2023;
originally announced April 2023.
-
Differentially Private Data Structures under Continual Observation for Histograms and Related Queries
Authors:
Monika Henzinger,
A. R. Sricharan,
Teresa Anna Steiner
Abstract:
Binary counting under continual observation is a well-studied fundamental problem in differential privacy. A natural extension is maintaining column sums, also known as histogram, over a stream of rows from $\{0,1\}^d$, and answering queries about those sums, e.g. the maximum column sum or the median, while satisfying differential privacy. Jain et al. (2021) showed that computing the maximum colum…
▽ More
Binary counting under continual observation is a well-studied fundamental problem in differential privacy. A natural extension is maintaining column sums, also known as histogram, over a stream of rows from $\{0,1\}^d$, and answering queries about those sums, e.g. the maximum column sum or the median, while satisfying differential privacy. Jain et al. (2021) showed that computing the maximum column sum under continual observation while satisfying event-level differential privacy requires an error either polynomial in the dimension $d$ or the stream length $T$. On the other hand, no $o(d\log^2 T)$ upper bound for $ε$-differential privacy or $o(\sqrt{d}\log^{3/2} T)$ upper bound for $(ε,δ)$-differential privacy are known. In this work, we give new parameterized upper bounds for maintaining histogram, maximum column sum, quantiles of the column sums, and any set of at most $d$ low-sensitivity, monotone, real valued queries on the column sums. Our solutions achieve an error of approximately $O(d\log^2 c_{\max}+\log T)$ for $ε$-differential privacy and approximately $O(\sqrt{d}\log^{3/2}c_{\max}+\log T)$ for $(ε,δ)$-differential privacy, where $c_{\max}$ is the maximum value that the queries we want to answer can assume on the given data set.
Furthermore, we show that such an improvement is not possible for a slightly expanded notion of neighboring streams by giving a lower bound of $Ω(d \log T)$. This explains why our improvement cannot be achieved with the existing mechanisms for differentially private histograms, as they remain differentially private even for this expanded notion of neighboring streams.
△ Less
Submitted 22 February, 2023;
originally announced February 2023.
-
Gapped String Indexing in Subquadratic Space and Sublinear Query Time
Authors:
Philip Bille,
Inge Li Gørtz,
Moshe Lewenstein,
Solon P. Pissis,
Eva Rotenberg,
Teresa Anna Steiner
Abstract:
In Gapped String Indexing, the goal is to compactly represent a string $S$ of length $n$ such that for any query consisting of two strings $P_1$ and $P_2$, called patterns, and an integer interval $[α, β]$, called gap range, we can quickly find occurrences of $P_1$ and $P_2$ in $S$ with distance in $[α, β]$. Gapped String Indexing is a central problem in computational biology and text mining and h…
▽ More
In Gapped String Indexing, the goal is to compactly represent a string $S$ of length $n$ such that for any query consisting of two strings $P_1$ and $P_2$, called patterns, and an integer interval $[α, β]$, called gap range, we can quickly find occurrences of $P_1$ and $P_2$ in $S$ with distance in $[α, β]$. Gapped String Indexing is a central problem in computational biology and text mining and has thus received significant research interest, including parameterized and heuristic approaches. Despite this interest, the best-known time-space trade-offs for Gapped String Indexing are the straightforward $O(n)$ space and $O(n+occ)$ query time or $Ω(n^2)$ space and $\tilde{O}(|P_1| + |P_2| + occ)$ query time.
We break through this barrier obtaining the first interesting trade-offs with polynomially subquadratic space and polynomially sublinear query time. In particular, we show that, for every $0\leq δ\leq 1$, there is a data structure for Gapped String Indexing with either $\tilde{O}(n^{2-δ/3})$ or $\tilde{O}(n^{3-2δ})$ space and $\tilde{O}(|P_1| + |P_2| + n^δ\cdot (occ+1))$ query time, where $occ$ is the number of reported occurrences.
As a new tool towards obtaining our main result, we introduce the Shifted Set Intersection problem. We show that this problem is equivalent to the indexing variant of 3SUM (3SUM Indexing). Via a series of reductions, we obtain a solution to the Gapped String Indexing problem. Furthermore, we enhance our data structure for deciding Shifted Set Intersection, so that we can support the reporting variant of the problem. Via the obtained equivalence to 3SUM Indexing, we thus give new improved data structures for the reporting variant of 3SUM Indexing, and we show how this improves upon the state-of-the-art solution for Jumbled Indexing for any alphabet of constant size $σ>5$.
△ Less
Submitted 5 March, 2024; v1 submitted 30 November, 2022;
originally announced November 2022.
-
The Fine-Grained Complexity of Episode Matching
Authors:
Philip Bille,
Inge Li Gørtz,
Shay Mozes,
Teresa Anna Steiner,
Oren Weimann
Abstract:
Given two strings $S$ and $P$, the Episode Matching problem is to find the shortest substring of $S$ that contains $P$ as a subsequence. The best known upper bound for this problem is $\tilde O(nm)$ by Das et al. (1997) , where $n,m$ are the lengths of $S$ and $P$, respectively. Although the problem is well studied and has many applications in data mining, this bound has never been improved. In th…
▽ More
Given two strings $S$ and $P$, the Episode Matching problem is to find the shortest substring of $S$ that contains $P$ as a subsequence. The best known upper bound for this problem is $\tilde O(nm)$ by Das et al. (1997) , where $n,m$ are the lengths of $S$ and $P$, respectively. Although the problem is well studied and has many applications in data mining, this bound has never been improved. In this paper we show why this is the case by proving that no $O((nm)^{1-ε})$ algorithm (even for binary strings) exists, unless the Strong Exponential Time Hypothesis (SETH) is false. We then consider the indexing version of the problem, where $S$ is preprocessed into a data structure for answering episode matching queries $P$. We show that for any $τ$, there is a data structure using $O(n+\left(\frac{n}τ\right)^k)$ space that answers episode matching queries for any $P$ of length $k$ in $O(k\cdot τ\cdot \log \log n )$ time. We complement this upper bound with an almost matching lower bound, showing that any data structure that answers episode matching queries for patterns of length $k$ in time $O(n^δ)$, must use $Ω(n^{k-kδ-o(1)})$ space, unless the Strong $k$-Set Disjointness Conjecture is false. Finally, for the special case of $k=2$, we present a faster construction of the data structure using fast min-plus multiplication of bounded integer matrices.
△ Less
Submitted 14 February, 2024; v1 submitted 19 August, 2021;
originally announced August 2021.
-
Gapped Indexing for Consecutive Occurrences
Authors:
Philip Bille,
Inge Li Gørtz,
Max Rishøj Pedersen,
Teresa Anna Steiner
Abstract:
The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient pattern matching queries. Typical queries include existential queries (decide if the pattern occurs in S), reporting queries (return all positions where the pattern occurs), and counting queries (return the number of occurrences of the pattern). In this paper we consider a variant…
▽ More
The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient pattern matching queries. Typical queries include existential queries (decide if the pattern occurs in S), reporting queries (return all positions where the pattern occurs), and counting queries (return the number of occurrences of the pattern). In this paper we consider a variant of string indexing, where the goal is to compactly represent the string such that given two patterns P1 and P2 and a gap range [α,β] we can quickly find the consecutive occurrences of P1 and P2 with distance in [α,β], i.e., pairs of occurrences immediately following each other and with distance within the range. We present data structures that use Õ(n) space and query time Õ(|P1|+|P2|+n^(2/3)) for existence and counting and Õ(|P1|+|P2|+n^(2/3)*occ^(1/3)) for reporting. We complement this with a conditional lower bound based on the set intersection problem showing that any solution using Õ(n) space must use \tildeΩ}(|P1|+|P2|+\sqrt{n}) query time. To obtain our results we develop new techniques and ideas of independent interest including a new suffix tree decomposition and hardness of a variant of the set intersection problem.
△ Less
Submitted 4 February, 2021;
originally announced February 2021.
-
String Indexing for Top-$k$ Close Consecutive Occurrences
Authors:
Philip Bille,
Inge Li Gørtz,
Max Rishøj Pedersen,
Eva Rotenberg,
Teresa Anna Steiner
Abstract:
The classic string indexing problem is to preprocess a string $S$ into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string $P$, report all occurrences of $P$ within $S$. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-$k$ close consecutive occurrences problem (SITCCO). Here…
▽ More
The classic string indexing problem is to preprocess a string $S$ into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string $P$, report all occurrences of $P$ within $S$. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-$k$ close consecutive occurrences problem (SITCCO). Here, a consecutive occurrence is a pair $(i,j)$, $i < j$, such that $P$ occurs at positions $i$ and $j$ in $S$ and there is no occurrence of $P$ between $i$ and $j$, and their distance is defined as $j-i$. Given a pattern $P$ and a parameter $k$, the goal is to report the top-$k$ consecutive occurrences of $P$ in $S$ of minimal distance. The challenge is to compactly represent $S$ while supporting queries in time close to the length of $P$ and $k$. We give three time-space trade-offs for the problem. Let $n$ be the length of $S$, $m$ the length of $P$, and $ε\in(0,1]$. Our first result achieves $O(n\log n)$ space and optimal query time of $O(m+k)$. Our second and third results achieve linear space and query times either $O(m+k^{1+ε})$ or $O(m + k \log^{1+ε} n)$. Along the way, we develop several techniques of independent interest, including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees.
△ Less
Submitted 14 February, 2024; v1 submitted 8 July, 2020;
originally announced July 2020.
-
String Indexing with Compressed Patterns
Authors:
Philip Bille,
Inge Li Gørtz,
Teresa Anna Steiner
Abstract:
Given a string $S$ of length $n$, the classic string indexing problem is to preprocess $S$ into a compact data structure that supports efficient subsequent pattern queries. In this paper we consider the basic variant where the pattern is given in compressed form and the goal is to achieve query time that is fast in terms of the compressed size of the pattern. This captures the common client-server…
▽ More
Given a string $S$ of length $n$, the classic string indexing problem is to preprocess $S$ into a compact data structure that supports efficient subsequent pattern queries. In this paper we consider the basic variant where the pattern is given in compressed form and the goal is to achieve query time that is fast in terms of the compressed size of the pattern. This captures the common client-server scenario, where a client submits a query and communicates it in compressed form to a server. Instead of the server decompressing the query before processing it, we consider how to efficiently process the compressed query directly. Our main result is a novel linear space data structure that achieves near-optimal query time for patterns compressed with the classic Lempel-Ziv compression scheme. Along the way we develop several data structural techniques of independent interest, including a novel data structure that compactly encodes all LZ77 compressed suffixes of a string in linear space and a general decomposition of tries that reduces the search time from logarithmic in the size of the trie to logarithmic in the length of the pattern.
△ Less
Submitted 14 February, 2024; v1 submitted 26 September, 2019;
originally announced September 2019.