-
Exploiting New Properties of String Net Frequency for Efficient Computation
Authors:
Peaker Guo,
Patrick Eades,
Anthony Wirth,
Justin Zobel
Abstract:
Knowing which strings in a massive text are significant -- that is, which strings are common and distinct from other strings -- is valuable for several applications, including text compression and tokenization. Frequency in itself is not helpful for significance, because the commonest strings are the shortest strings. A compelling alternative is net frequency, which has the property that strings w…
▽ More
Knowing which strings in a massive text are significant -- that is, which strings are common and distinct from other strings -- is valuable for several applications, including text compression and tokenization. Frequency in itself is not helpful for significance, because the commonest strings are the shortest strings. A compelling alternative is net frequency, which has the property that strings with positive net frequency are of maximal length. However, net frequency remains relatively unexplored, and there is no prior art showing how to compute it efficiently. We first introduce a characteristic of net frequency that simplifies the original definition. With this, we study strings with positive net frequency in Fibonacci words. We then use our characteristic and solve two key problems related to net frequency. First, \textsc{single-nf}, how to compute the net frequency of a given string of length $m$, in an input text of length $n$ over an alphabet size $σ$. Second, \textsc{all-nf}, given length-$n$ input text, how to report every string of positive net frequency. Our methods leverage suffix arrays, components of the Burrows-Wheeler transform, and solution to the coloured range listing problem. We show that, for both problems, our data structure has $O(n)$ construction cost: with this structure, we solve \textsc{single-nf} in $O(m + σ)$ time and \textsc{all-nf} in $O(n)$ time. Experimentally, we find our method to be around 100 times faster than reasonable baselines for \textsc{single-nf}. For \textsc{all-nf}, our results show that, even with prior knowledge of the set of strings with positive net frequency, simply confirming that their net frequency is positive takes longer than with our purpose-designed method.
△ Less
Submitted 23 April, 2024; v1 submitted 19 April, 2024;
originally announced April 2024.
-
Improved Algorithms for Maximum Coverage in Dynamic and Random Order Streams
Authors:
Amit Chakrabarti,
Andrew McGregor,
Anthony Wirth
Abstract:
The maximum coverage problem is to select $k$ sets from a collection of sets such that the cardinality of the union of the selected sets is maximized. We consider $(1-1/e-ε)$-approximation algorithms for this NP-hard problem in three standard data stream models.
1. {\em Dynamic Model.} The stream consists of a sequence of sets being inserted and deleted. Our multi-pass algorithm uses…
▽ More
The maximum coverage problem is to select $k$ sets from a collection of sets such that the cardinality of the union of the selected sets is maximized. We consider $(1-1/e-ε)$-approximation algorithms for this NP-hard problem in three standard data stream models.
1. {\em Dynamic Model.} The stream consists of a sequence of sets being inserted and deleted. Our multi-pass algorithm uses $ε^{-2} k \cdot \text{polylog}(n,m)$ space. The best previous result (Assadi and Khanna, SODA 2018) used $(n +ε^{-4} k) \text{polylog}(n,m)$ space. While both algorithms use $O(ε^{-1} \log n)$ passes, our analysis shows that when $ε$ is a constant, it is possible to reduce the number of passes by a $1/\log \log n$ factor without incurring additional space.
2. {\em Random Order Model.} In this model, there are no deletions and the sets forming the instance are uniformly randomly permuted to form the input stream. We show that a single pass and $k \text{polylog}(n,m)$ space suffices for arbitrary small constant $ε$. The best previous result, by Warneke et al.~(ESA 2023), used $k^2 \text{polylog}(n,m)$ space.
3. {\em Insert-Only Model.} Lastly, our results, along with numerous previous results, use a sub-sampling technique introduced by McGregor and Vu (ICDT 2017) to sparsify the input instance. We explain how this technique and others used in the paper can be implemented such that the amortized update time of our algorithm is polylogarithmic. This also implies an improvement of the state-of-the-art insert only algorithms in terms of the update time: $\text{polylog}(m,n)$ update time suffices whereas the best previous result by Jaud et al.~(SEA 2023) required update time that was linear in $k$.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
Fast Parallel Algorithms for Submodular $p$-Superseparable Maximization
Authors:
Philip Cervenjak,
Junhao Gan,
Anthony Wirth
Abstract:
Maximizing a non-negative, monontone, submodular function $f$ over $n$ elements under a cardinality constraint $k$ (SMCC) is a well-studied NP-hard problem. It has important applications in, e.g., machine learning and influence maximization. Though the theoretical problem admits polynomial-time approximation algorithms, solving it in practice often involves frequently querying submodular functions…
▽ More
Maximizing a non-negative, monontone, submodular function $f$ over $n$ elements under a cardinality constraint $k$ (SMCC) is a well-studied NP-hard problem. It has important applications in, e.g., machine learning and influence maximization. Though the theoretical problem admits polynomial-time approximation algorithms, solving it in practice often involves frequently querying submodular functions that are expensive to compute. This has motivated significant research into designing parallel approximation algorithms in the adaptive complexity model; adaptive complexity (adaptivity) measures the number of sequential rounds of $\text{poly}(n)$ function queries an algorithm requires. The state-of-the-art algorithms can achieve $(1-\frac{1}{e}-\varepsilon)$-approximate solutions with $O(\frac{1}{\varepsilon^2}\log n)$ adaptivity, which approaches the known adaptivity lower-bounds. However, the $O(\frac{1}{\varepsilon^2} \log n)$ adaptivity only applies to maximizing worst-case functions that are unlikely to appear in practice. Thus, in this paper, we consider the special class of $p$-superseparable submodular functions, which places a reasonable constraint on $f$, based on the parameter $p$, and is more amenable to maximization, while also having real-world applicability. Our main contribution is the algorithm LS+GS, a finer-grained version of the existing LS+PGB algorithm, designed for instances of SMCC when $f$ is $p$-superseparable; it achieves an expected $(1-\frac{1}{e}-\varepsilon)$-approximate solution with $O(\frac{1}{\varepsilon^2}\log(p k))$ adaptivity independent of $n$. Additionally, unrelated to $p$-superseparability, our LS+GS algorithm uses only $O(\frac{n}{\varepsilon} + \frac{\log n}{\varepsilon^2})$ oracle queries, which has an improved dependence on $\varepsilon^{-1}$ over the state-of-the-art LS+PGB; this is achieved through the design of a novel thresholding subroutine.
△ Less
Submitted 2 February, 2024; v1 submitted 21 November, 2023;
originally announced November 2023.
-
Sublinear-Space Streaming Algorithms for Estimating Graph Parameters on Sparse Graphs
Authors:
Xiuge Chen,
Rajesh Chitnis,
Patrick Eades,
Anthony Wirth
Abstract:
In this paper, we design sub-linear space streaming algorithms for estimating three fundamental parameters -- maximum independent set, minimum dominating set and maximum matching -- on sparse graph classes, i.e., graphs which satisfy $m=O(n)$ where $m,n$ is the number of edges, vertices respectively. Each of the three graph parameters we consider can have size $Ω(n)$ even on sparse graph classes,…
▽ More
In this paper, we design sub-linear space streaming algorithms for estimating three fundamental parameters -- maximum independent set, minimum dominating set and maximum matching -- on sparse graph classes, i.e., graphs which satisfy $m=O(n)$ where $m,n$ is the number of edges, vertices respectively. Each of the three graph parameters we consider can have size $Ω(n)$ even on sparse graph classes, and hence for sublinear-space algorithms we are restricted to parameter estimation instead of attempting to find a solution.
△ Less
Submitted 26 May, 2023;
originally announced May 2023.
-
Maximum Coverage in Sublinear Space, Faster
Authors:
Stephen Jaud,
Anthony Wirth,
Farhana Choudhury
Abstract:
Given a collection of $m$ sets from a universe $\mathcal{U}$, the Maximum Set Coverage problem consists of finding $k$ sets whose union has largest cardinality. This problem is NP-Hard, but the solution can be approximated by a polynomial time algorithm up to a factor $1-1/e$. However, this algorithm does not scale well with the input size. In a streaming context, practical high-quality solutions…
▽ More
Given a collection of $m$ sets from a universe $\mathcal{U}$, the Maximum Set Coverage problem consists of finding $k$ sets whose union has largest cardinality. This problem is NP-Hard, but the solution can be approximated by a polynomial time algorithm up to a factor $1-1/e$. However, this algorithm does not scale well with the input size. In a streaming context, practical high-quality solutions are found, but with space complexity that scales linearly with respect to the size of the universe $|\mathcal{U}|$. However, one randomized streaming algorithm has been shown to produce a $1-1/e-\varepsilon$ approximation of the optimal solution with a space complexity that scales only poly-logarithmically with respect to $m$ and $|\mathcal{U}|$. In order to achieve such a low space complexity, the authors used a technique called subsampling, based on independent-wise hash functions, and $F_0$-sketching. This article focuses on this sublinear-space algorithm and introduces methods to reduce the time cost of subsampling. Firstly, we give some optimizations that do not alter the space complexity, number of passes and approximation quality of the original algorithm. In particular, we reanalyze the error bounds to show that the original independence factor of $Ω(\varepsilon^{-2} k \log m)$ can be fine-tuned to $Ω(k \log m)$. Secondly we show that $F_0$-sketching can be replaced by a much more simple mechanism. Finally, our experimental results show that even a pairwise-independent hash-function sampler does not produce worse solution than the original algorithm, while running significantly faster by several orders of magnitude.
△ Less
Submitted 12 December, 2023; v1 submitted 13 February, 2023;
originally announced February 2023.
-
Tight Data Access Bounds for Private Top-$k$ Selection
Authors:
Hao Wu,
Olga Ohrimenko,
Anthony Wirth
Abstract:
We study the top-$k$ selection problem under the differential privacy model: $m$ items are rated according to votes of a set of clients. We consider a setting in which algorithms can retrieve data via a sequence of accesses, each either a random access or a sorted access; the goal is to minimize the total number of data accesses. Our algorithm requires only $O(\sqrt{mk})$ expected accesses: to our…
▽ More
We study the top-$k$ selection problem under the differential privacy model: $m$ items are rated according to votes of a set of clients. We consider a setting in which algorithms can retrieve data via a sequence of accesses, each either a random access or a sorted access; the goal is to minimize the total number of data accesses. Our algorithm requires only $O(\sqrt{mk})$ expected accesses: to our knowledge, this is the first sublinear data-access upper bound for this problem. Our analysis also shows that the well-known exponential mechanism requires only $O(\sqrt{m})$ expected accesses. Accompanying this, we develop the first lower bounds for the problem, in three settings: only random accesses; only sorted accesses; a sequence of accesses of either kind. We show that, to avoid $Ω(m)$ access cost, supporting *both* kinds of access is necessary, and that in this case our algorithm's access cost is optimal.
△ Less
Submitted 30 May, 2023; v1 submitted 30 January, 2023;
originally announced January 2023.
-
Single Round-trip Hierarchical ORAM via Succinct Indices
Authors:
William Holland,
Olga Ohrimenko,
Anthony Wirth
Abstract:
Access patterns to data stored remotely create a side channel that is known to leak information even if the content of the data is encrypted. To protect against access pattern leakage, Oblivious RAM is a cryptographic primitive that obscures the (actual) access trace at the expense of additional access and periodic shuffling of the server's contents. A class of ORAM solutions, known as Hierarchica…
▽ More
Access patterns to data stored remotely create a side channel that is known to leak information even if the content of the data is encrypted. To protect against access pattern leakage, Oblivious RAM is a cryptographic primitive that obscures the (actual) access trace at the expense of additional access and periodic shuffling of the server's contents. A class of ORAM solutions, known as Hierarchical ORAM, has achieved theoretically \emph{optimal} logarithmic bandwidth overhead. However, to date, Hierarchical ORAMs are seen as only theoretical artifacts. This is because they require a large number of communication round-trips to locate (shuffled) elements at the server and involve complex building blocks such as cuckoo hash tables.
To address the limitations of Hierarchical ORAM schemes in practice, we introduce Rank ORAM; the first Hierarchical ORAM that can retrieve data with a single round-trip of communication (as compared to a logarithmic number in previous work). To support non-interactive communication, we introduce a \emph{compressed} client-side data structure that stores, implicitly, the location of each element at the server. In addition, this location metadata enables a simple protocol design that dispenses with the need for complex cuckoo hash tables.
Rank ORAM requires asymptotically smaller memory than existing (non-Hierarchical) state-of-the-art practical ORAM schemes (e.g., Ring ORAM) while maintaining comparable bandwidth performance. Our experiments on real network file-system traces demonstrate a reduction in client memory, against existing approaches, of a factor of~$100$. For example, when {outsourcing} a database of $17.5$TB, required client-memory is only $290$MB vs. $40$GB for standard approaches.
△ Less
Submitted 12 June, 2024; v1 submitted 15 August, 2022;
originally announced August 2022.
-
Walking to Hide: Privacy Amplification via Random Message Exchanges in Network
Authors:
Hao Wu,
Olga Ohrimenko,
Anthony Wirth
Abstract:
The *shuffle model* is a powerful tool to amplify the privacy guarantees of the *local model* of differential privacy. In contrast to the fully decentralized manner of guaranteeing privacy in the local model, the shuffle model requires a central, trusted shuffler. To avoid this central shuffler, recent work of Liew et al. (2022) proposes shuffling locally randomized data in a decentralized manner,…
▽ More
The *shuffle model* is a powerful tool to amplify the privacy guarantees of the *local model* of differential privacy. In contrast to the fully decentralized manner of guaranteeing privacy in the local model, the shuffle model requires a central, trusted shuffler. To avoid this central shuffler, recent work of Liew et al. (2022) proposes shuffling locally randomized data in a decentralized manner, via random walks on the communication network constituted by the clients. The privacy amplification bound it thus provides depends on the topology of the underlying communication network, even for infinitely long random walks. It does not match the state-of-the-art privacy amplification bound for the shuffle model (Feldman et al., 2021).
In this work, we prove that the output of~$n$ clients' data, each perturbed by an $ε_0$-local randomizer, and shuffled by random walks with a logarithmic number of steps, is $( {O} ( (1 - e^{-ε_0} ) \sqrt{ ( e^{ε_0} / n ) \ln (1 / δ) } ), O(δ) )$-differentially private. Importantly, this bound is independent of the topology of the communication network, and asymptotically closes the gap between the privacy amplification bounds for the network shuffle model (Liew et al., 2022) and the shuffle model (Feldman et al., 2021). Our proof is based on a reduction to the shuffle model, and an analysis of the distribution of random walks of finite length. Building on this, we further show that if each client is sampled independently with probability~$p$, the privacy guarantee of the network shuffle model can be further improved to $( {O} ( (1 - e^{-ε_0} ) \sqrt{p ( e^{ε_0} / n ) \ln (1 / δ) } ) , O(δ) )$. Importantly, the subsampling is also performed in a fully decentralized manner that does not require a trusted central entity; compared with related bounds in prior work, our bound is stronger.
△ Less
Submitted 19 June, 2022;
originally announced June 2022.
-
Randomize the Future: Asymptotically Optimal Locally Private Frequency Estimation Protocol for Longitudinal Data
Authors:
Olga Ohrimenko,
Anthony Wirth,
Hao Wu
Abstract:
Longitudinal data tracking under Local Differential Privacy (LDP) is a challenging task. Baseline solutions that repeatedly invoke a protocol designed for one-time computation lead to linear decay in the privacy or utility guarantee with respect to the number of computations. To avoid this, the recent approach of Erlingsson et al. (2020) exploits the potential sparsity of user data that changes on…
▽ More
Longitudinal data tracking under Local Differential Privacy (LDP) is a challenging task. Baseline solutions that repeatedly invoke a protocol designed for one-time computation lead to linear decay in the privacy or utility guarantee with respect to the number of computations. To avoid this, the recent approach of Erlingsson et al. (2020) exploits the potential sparsity of user data that changes only infrequently. Their protocol targets the fundamental problem of frequency estimation protocol for longitudinal binary data, with $\ell_\infty$ error of $O ( (1 / ε) \cdot (\log d)^{3 / 2} \cdot k \cdot \sqrt{ n \cdot \log ( d / β) } )$, where $ε$ is the privacy budget, $d$ is the number of time periods, $k$ is the maximum number of changes of user data, and $β$ is the failure probability. Notably, the error bound scales polylogarithmically with $d$, but linearly with $k$.
In this paper, we break through the linear dependence on $k$ in the estimation error. Our new protocol has error $O ( (1 / ε) \cdot (\log d) \cdot \sqrt{ k \cdot n \cdot \log ( d / β) } )$, matching the lower bound up to a logarithmic factor. The protocol is an online one, that outputs an estimate at each time period. The key breakthrough is a new randomizer for sequential data, FutureRand, with two key features. The first is a composition strategy that correlates the noise across the non-zero elements of the sequence. The second is a pre-computation technique which, by exploiting the symmetry of input space, enables the randomizer to output the results on the fly, without knowing future inputs. Our protocol closes the error gap between existing online and offline algorithms.
△ Less
Submitted 11 April, 2022; v1 submitted 22 December, 2021;
originally announced December 2021.
-
Dynamic Structural Clustering on Graphs
Authors:
Boyu Ruan,
Junhao Gan,
Hao Wu,
Anthony Wirth
Abstract:
Structural Clustering ($DynClu$) is one of the most popular graph clustering paradigms. In this paper, we consider $StrClu$ under two commonly adapted similarities, namely Jaccard similarity and cosine similarity on a dynamic graph, $G = \langle V, E\rangle$, subject to edge insertions and deletions (updates). The goal is to maintain certain information under updates, so that the $StrClu$ clusteri…
▽ More
Structural Clustering ($DynClu$) is one of the most popular graph clustering paradigms. In this paper, we consider $StrClu$ under two commonly adapted similarities, namely Jaccard similarity and cosine similarity on a dynamic graph, $G = \langle V, E\rangle$, subject to edge insertions and deletions (updates). The goal is to maintain certain information under updates, so that the $StrClu$ clustering result on~$G$ can be retrieved in $O(|V| + |E|)$ time, upon request. The state-of-the-art worst-case cost is $O(|V|)$ per update; we improve this update-time bound significantly with the $ρ$-approximate notion. Specifically, for a specified failure probability, $δ^*$, and every sequence of $M$ updates (no need to know $M$'s value in advance), our algorithm, $DynELM$, achieves $O(\log^2 |V| + \log |V| \cdot \log \frac{M}{δ^*})$ amortized cost for each update, at all times in linear space. Moreover, $DynELM$ provides a provable "sandwich" guarantee on the clustering quality at all times after \emph{each update} with probability at least $1 - δ^*$. We further develop $DynELM$ into our ultimate algorithm, $DynStrClu$, which also supports cluster-group-by queries. Given $Q\subseteq V$, this puts the non-empty intersection of $Q$ and each $StrClu$ cluster into a distinct group. $DynStrClu$ not only achieves all the guarantees of $DynELM$, but also runs cluster-group-by queries in $O(|Q|\cdot \log |V|)$ time. We demonstrate the performance of our algorithms via extensive experiments, on 15 real datasets. Experimental results confirm that our algorithms are up to three orders of magnitude more efficient than state-of-the-art competitors, and still provide quality structural clustering results. Furthermore, we study the difference between the two similarities w.r.t. the quality of approximate clustering results.
△ Less
Submitted 25 August, 2021;
originally announced August 2021.
-
Asymptotically Optimal Locally Private Heavy Hitters via Parameterized Sketches
Authors:
Hao Wu,
Anthony Wirth
Abstract:
We present two new local differentially private algorithms for frequency estimation. One solves the fundamental frequency oracle problem; the other solves the well-known heavy hitters identification problem. Consistent with prior art, these are randomized algorithms. As a function of failure probability~$β$, the former achieves optimal worst-case estimation error for every~$β$, while the latter is…
▽ More
We present two new local differentially private algorithms for frequency estimation. One solves the fundamental frequency oracle problem; the other solves the well-known heavy hitters identification problem. Consistent with prior art, these are randomized algorithms. As a function of failure probability~$β$, the former achieves optimal worst-case estimation error for every~$β$, while the latter is optimal when~$β$ is at least inverse polynomial in~$n$, the number of users. In both algorithms, server running time is~$\tilde{O}(n)$ while user running time is~$\tilde{O}(1)$. Our frequency-oracle algorithm achieves lower estimation error than the prior works of Bassily et al. (NeurIPS 2017). On the other hand, our heavy hitters identification method is as easily implementable as as TreeHist (Bassily et al., 2017) and has superior worst-case error, by a factor of $Ω(\sqrt{\log n})$.
△ Less
Submitted 16 February, 2022; v1 submitted 14 June, 2021;
originally announced June 2021.
-
Parameterized Correlation Clustering in Hypergraphs and Bipartite Graphs
Authors:
Nate Veldt,
Anthony Wirth,
David F. Gleich
Abstract:
Motivated by applications in community detection and dense subgraph discovery, we consider new clustering objectives in hypergraphs and bipartite graphs. These objectives are parameterized by one or more resolution parameters in order to enable diverse knowledge discovery in complex data.
For both hypergraph and bipartite objectives, we identify parameter regimes that are equivalent to existing…
▽ More
Motivated by applications in community detection and dense subgraph discovery, we consider new clustering objectives in hypergraphs and bipartite graphs. These objectives are parameterized by one or more resolution parameters in order to enable diverse knowledge discovery in complex data.
For both hypergraph and bipartite objectives, we identify parameter regimes that are equivalent to existing objectives and share their (polynomial-time) approximation algorithms. We first show that our parameterized hypergraph correlation clustering objective is related to higher-order notions of normalized cut and modularity in hypergraphs. It is further amenable to approximation algorithms via hyperedge expansion techniques.
Our parameterized bipartite correlation clustering objective generalizes standard unweighted bipartite correlation clustering, as well as bicluster deletion. For a certain choice of parameters it is also related to our hypergraph objective. Although in general it is NP-hard, we highlight a parameter regime for the bipartite objective where the problem reduces to the bipartite matching problem and thus can be solved in polynomial time. For other parameter settings, we present approximation algorithms using linear program rounding techniques. These results allow us to introduce the first constant-factor approximation for bicluster deletion, the task of removing a minimum number of edges to partition a bipartite graph into disjoint bi-cliques.
In several experimental results, we highlight the flexibility of our framework and the diversity of results that can be obtained in different parameter settings. This includes clustering bipartite graphs across a range of parameters, detecting motif-rich clusters in an email network and a food web, and forming clusters of retail products in a product review hypergraph, that are highly correlated with known product categories.
△ Less
Submitted 19 June, 2020; v1 submitted 21 February, 2020;
originally announced February 2020.
-
Graph Clustering in All Parameter Regimes
Authors:
Junhao Gan,
David F. Gleich,
Nate Veldt,
Anthony Wirth,
Xin Zhang
Abstract:
Resolution parameters in graph clustering represent a size and quality trade-off. We address the task of efficiently solving a parameterized graph clustering objective for all values of a resolution parameter. Specifically, we consider an objective we call LambdaPrime, involving a parameter $λ\in (0,1)$. This objective is related to other parameterized clustering problems, such as parametric gener…
▽ More
Resolution parameters in graph clustering represent a size and quality trade-off. We address the task of efficiently solving a parameterized graph clustering objective for all values of a resolution parameter. Specifically, we consider an objective we call LambdaPrime, involving a parameter $λ\in (0,1)$. This objective is related to other parameterized clustering problems, such as parametric generalizations of modularity, and captures a number of specific clustering problems as special cases, including sparsest cut and cluster deletion. While previous work provides approximation results for a single resolution parameter, we seek a set of approximately optimal clusterings for all values of $λ$ in polynomial time. In particular, we ask the question, how small a family of clusterings suffices to optimize -- or to approximately optimize -- the LambdaPrime objective over the full possible spectrum of $λ$?
We obtain a family of logarithmically many clusterings by solving the parametric linear programming relaxation of LambdaPrime at a logarithmic number of parameter values, and round their solutions using existing approximation algorithms. We prove that this number is tight up to a constant factor. Specifically, for a certain class of ring graphs, a logarithmic number of feasible solutions is required to provide a constant-factor approximation for the LambdaPrime LP relaxation in all parameter regimes. We additionally show that for any graph with $n$ nodes and $m$ edges, there exists a set of $m$ or fewer clusterings such that for every $λ\in (0,1)$, the family contains an exact solution to the LambdaPrime objective. There also exists a set of $O(\log n)$ clusterings that provide a $(1+\varepsilon)$-approximate solution in all parameter regimes; we demonstrate simple graph classes for which these bounds are tight.
△ Less
Submitted 14 October, 2019;
originally announced October 2019.
-
Learning Resolution Parameters for Graph Clustering
Authors:
Nate Veldt,
David F. Gleich,
Anthony Wirth
Abstract:
Finding clusters of well-connected nodes in a graph is an extensively studied problem in graph-based data analysis. Because of its many applications, a large number of distinct graph clustering objective functions and algorithms have already been proposed and analyzed. To aid practitioners in determining the best clustering approach to use in different applications, we present new techniques for a…
▽ More
Finding clusters of well-connected nodes in a graph is an extensively studied problem in graph-based data analysis. Because of its many applications, a large number of distinct graph clustering objective functions and algorithms have already been proposed and analyzed. To aid practitioners in determining the best clustering approach to use in different applications, we present new techniques for automatically learning how to set clustering resolution parameters. These parameters control the size and structure of communities that are formed by optimizing a generalized objective function. We begin by formalizing the notion of a parameter fitness function, which measures how well a fixed input clustering approximately solves a generalized clustering objective for a specific resolution parameter value. Under reasonable assumptions, which suit two key graph clustering applications, such a parameter fitness function can be efficiently minimized using a bisection-like method, yielding a resolution parameter that fits well with the example clustering. We view our framework as a type of single-shot hyperparameter tuning, as we are able to learn a good resolution parameter with just a single example. Our general approach can be applied to learn resolution parameters for both local and global graph clustering objectives. We demonstrate its utility in several experiments on real-world data where it is helpful to learn resolution parameters from a given example clustering.
△ Less
Submitted 12 March, 2019;
originally announced March 2019.
-
Correlation Clustering in Data Streams
Authors:
Kook ** Ahn,
Graham Cormode,
Sudipto Guha,
Andrew McGregor,
Anthony Wirth
Abstract:
Clustering is a fundamental tool for analyzing large data sets. A rich body of work has been devoted to designing data-stream algorithms for the relevant optimization problems such as $k$-center, $k$-median, and $k$-means. Such algorithms need to be both time and and space efficient. In this paper, we address the problem of correlation clustering in the dynamic data stream model. The stream consis…
▽ More
Clustering is a fundamental tool for analyzing large data sets. A rich body of work has been devoted to designing data-stream algorithms for the relevant optimization problems such as $k$-center, $k$-median, and $k$-means. Such algorithms need to be both time and and space efficient. In this paper, we address the problem of correlation clustering in the dynamic data stream model. The stream consists of updates to the edge weights of a graph on $n$ nodes and the goal is to find a node-partition such that the end-points of negative-weight edges are typically in different clusters whereas the end-points of positive-weight edges are typically in the same cluster. We present polynomial-time, $O(n\cdot \ \mbox{polylog}~n)$-space approximation algorithms for natural problems that arise.
We first develop data structures based on linear sketches that allow the "quality" of a given node-partition to be measured. We then combine these data structures with convex programming and sampling techniques to solve the relevant approximation problem. Unfortunately, the standard LP and SDP formulations are not obviously solvable in $O(n\cdot \mbox{polylog}~n)$-space. Our work presents space-efficient algorithms for the convex programming required, as well as approaches to reduce the adaptivity of the sampling.
△ Less
Submitted 5 December, 2018;
originally announced December 2018.
-
Correlation Clustering Generalized
Authors:
David F. Gleich,
Nate Veldt,
Anthony Wirth
Abstract:
We present new results for LambdaCC and MotifCC, two recently introduced variants of the well-studied correlation clustering problem. Both variants are motivated by applications to network analysis and community detection, and have non-trivial approximation algorithms. We first show that the standard linear programming relaxation of LambdaCC has a $Θ(\log n)$ integrality gap for a certain choice o…
▽ More
We present new results for LambdaCC and MotifCC, two recently introduced variants of the well-studied correlation clustering problem. Both variants are motivated by applications to network analysis and community detection, and have non-trivial approximation algorithms. We first show that the standard linear programming relaxation of LambdaCC has a $Θ(\log n)$ integrality gap for a certain choice of the parameter $λ$. This sheds light on previous challenges encountered in obtaining parameter-independent approximation results for LambdaCC. We generalize a previous constant-factor algorithm to provide the best results, from the LP-rounding approach, for an extended range of $λ$. MotifCC generalizes correlation clustering to the hypergraph setting. In the case of hyperedges of degree $3$ with weights satisfying probability constraints, we improve the best approximation factor from $9$ to $8$. We show that in general our algorithm gives a $4(k-1)$ approximation when hyperedges have maximum degree $k$ and probability weights. We additionally present approximation results for LambdaCC and MotifCC where we restrict to forming only two clusters.
△ Less
Submitted 25 September, 2018;
originally announced September 2018.
-
A Projection Method for Metric-Constrained Optimization
Authors:
Nate Veldt,
David Gleich,
Anthony Wirth,
James Saunderson
Abstract:
We outline a new approach for solving optimization problems which enforce triangle inequalities on output variables. We refer to this as metric-constrained optimization, and give several examples where problems of this form arise in machine learning applications and theoretical approximation algorithms for graph clustering. Although these problem are interesting from a theoretical perspective, the…
▽ More
We outline a new approach for solving optimization problems which enforce triangle inequalities on output variables. We refer to this as metric-constrained optimization, and give several examples where problems of this form arise in machine learning applications and theoretical approximation algorithms for graph clustering. Although these problem are interesting from a theoretical perspective, they are challenging to solve in practice due to the high memory requirement of black-box solvers. In order to address this challenge we first prove that the metric-constrained linear program relaxation of correlation clustering is equivalent to a special case of the metric nearness problem. We then developed a general solver for metric-constrained linear and quadratic programs by generalizing and improving a simple projection algorithm originally developed for metric nearness. We give several novel approximation guarantees for using our framework to find lower bounds for optimal solutions to several challenging graph clustering problems. We also demonstrate the power of our framework by solving optimizing problems involving up to 10^{8} variables and 10^{11} constraints.
△ Less
Submitted 5 June, 2018;
originally announced June 2018.
-
Unifying Sparsest Cut, Cluster Deletion, and Modularity Clustering Objectives with Correlation Clustering
Authors:
Nate Veldt,
David Gleich,
Anthony Wirth
Abstract:
Graph clustering, or community detection, is the task of identifying groups of closely related objects in a large network. In this paper we introduce a new community-detection framework called LambdaCC that is based on a specially weighted version of correlation clustering. A key component in our methodology is a clustering resolution parameter, $λ$, which implicitly controls the size and structur…
▽ More
Graph clustering, or community detection, is the task of identifying groups of closely related objects in a large network. In this paper we introduce a new community-detection framework called LambdaCC that is based on a specially weighted version of correlation clustering. A key component in our methodology is a clustering resolution parameter, $λ$, which implicitly controls the size and structure of clusters formed by our framework. We show that, by increasing this parameter, our objective effectively interpolates between two different strategies in graph clustering: finding a sparse cut and forming dense subgraphs. Our methodology unifies and generalizes a number of other important clustering quality functions including modularity, sparsest cut, and cluster deletion, and places them all within the context of an optimization problem that has been well studied from the perspective of approximation algorithms. Our approach is particularly relevant in the regime of finding dense clusters, as it leads to a 2-approximation for the cluster deletion problem. We use our approach to cluster several graphs, including large collaboration networks and social networks.
△ Less
Submitted 13 July, 2018; v1 submitted 15 December, 2017;
originally announced December 2017.
-
Correlation Clustering with Low-Rank Matrices
Authors:
Nate Veldt,
Anthony Wirth,
David F. Gleich
Abstract:
Correlation clustering is a technique for aggregating data based on qualitative information about which pairs of objects are labeled 'similar' or 'dissimilar.' Because the optimization problem is NP-hard, much of the previous literature focuses on finding approximation algorithms. In this paper we explore how to solve the correlation clustering objective exactly when the data to be clustered can b…
▽ More
Correlation clustering is a technique for aggregating data based on qualitative information about which pairs of objects are labeled 'similar' or 'dissimilar.' Because the optimization problem is NP-hard, much of the previous literature focuses on finding approximation algorithms. In this paper we explore how to solve the correlation clustering objective exactly when the data to be clustered can be represented by a low-rank matrix. We prove in particular that correlation clustering can be solved in polynomial time when the underlying matrix is positive semidefinite with small constant rank, but that the task remains NP-hard in the presence of even one negative eigenvalue. Based on our theoretical results, we develop an algorithm for efficiently "solving" low-rank positive semidefinite correlation clustering by employing a procedure for zonotope vertex enumeration. We demonstrate the effectiveness and speed of our algorithm by using it to solve several clustering problems on both synthetic and real-world data.
△ Less
Submitted 17 March, 2017; v1 submitted 21 November, 2016;
originally announced November 2016.
-
Efficient Parallel Algorithms for k-Center Clustering
Authors:
Jessica McClintock,
Anthony Wirth
Abstract:
The k-center problem is one of several classic NP-hard clustering questions. For contemporary massive data sets, RAM-based algorithms become impractical. And although there exist good sequential algorithms for k-center, they are not easily parallelizable.
In this paper, we design and implement parallel approximation algorithms for this problem. We observe that Gonzalez's greedy algorithm can be…
▽ More
The k-center problem is one of several classic NP-hard clustering questions. For contemporary massive data sets, RAM-based algorithms become impractical. And although there exist good sequential algorithms for k-center, they are not easily parallelizable.
In this paper, we design and implement parallel approximation algorithms for this problem. We observe that Gonzalez's greedy algorithm can be efficiently parallelized in several MapReduce rounds; in practice, we find that two rounds are sufficient, leading to a 4-approximation. We contrast this with an existing parallel algorithm for k-center that runs in a constant number of rounds, and offers a 10-approximation. In depth runtime analysis reveals that this scheme is often slow, and that its sampling procedure only runs if k is sufficiently small, relative to the input size. To trade off runtime for approximation guarantee, we parameterize this sampling algorithm, and find in our experiments that the algorithm is not only faster, but sometimes more effective. Yet the parallel version of Gonzalez is about 100 times faster than both its sequential version and the parallel sampling algorithm, barely compromising solution quality.
△ Less
Submitted 11 April, 2016;
originally announced April 2016.
-
Access Time Tradeoffs in Archive Compression
Authors:
Matthias Petri,
Alistair Moffat,
P. C. Nagesh,
Anthony Wirth
Abstract:
Web archives, query and proxy logs, and so on, can all be very large and highly repetitive; and are accessed only sporadically and partially, rather than continually and holistically. This type of data is ideal for compression-based archiving, provided that random-access to small fragments of the original data can be achieved without needing to decompress everything. The recent RLZ (relative Lempe…
▽ More
Web archives, query and proxy logs, and so on, can all be very large and highly repetitive; and are accessed only sporadically and partially, rather than continually and holistically. This type of data is ideal for compression-based archiving, provided that random-access to small fragments of the original data can be achieved without needing to decompress everything. The recent RLZ (relative Lempel Ziv) compression approach uses a semi-static model extracted from the text to be compressed, together with a greedy factorization of the whole text encoded using static integer codes. Here we demonstrate more precisely than before the scenarios in which RLZ excels. We contrast RLZ with alternatives based on block-based adaptive methods, including approaches that "prime" the encoding for each block, and measure a range of implementation options using both hard-disk (HDD) and solid-state disk (SSD) drives. For HDD, the dominant factor affecting access speed is the compression rate achieved, even when this involves larger dictionaries and larger blocks. When the data is on SSD the same effects are present, but not as markedly, and more complex trade-offs apply.
△ Less
Submitted 29 February, 2016;
originally announced February 2016.
-
Incidence Geometries and the Pass Complexity of Semi-Streaming Set Cover
Authors:
Amit Chakrabarti,
Anthony Wirth
Abstract:
Set cover, over a universe of size $n$, may be modelled as a data-streaming problem, where the $m$ sets that comprise the instance are to be read one by one. A semi-streaming algorithm is allowed only $O(n\, \mathrm{poly}\{\log n, \log m\})$ space to process this stream. For each $p \ge 1$, we give a very simple deterministic algorithm that makes $p$ passes over the input stream and returns an app…
▽ More
Set cover, over a universe of size $n$, may be modelled as a data-streaming problem, where the $m$ sets that comprise the instance are to be read one by one. A semi-streaming algorithm is allowed only $O(n\, \mathrm{poly}\{\log n, \log m\})$ space to process this stream. For each $p \ge 1$, we give a very simple deterministic algorithm that makes $p$ passes over the input stream and returns an appropriately certified $(p+1)n^{1/(p+1)}$-approximation to the optimum set cover. More importantly, we proceed to show that this approximation factor is essentially tight, by showing that a factor better than $0.99\,n^{1/(p+1)}/(p+1)^2$ is unachievable for a $p$-pass semi-streaming algorithm, even allowing randomisation. In particular, this implies that achieving a $Θ(\log n)$-approximation requires $Ω(\log n/\log\log n)$ passes, which is tight up to the $\log\log n$ factor. These results extend to a relaxation of the set cover problem where we are allowed to leave an $\varepsilon$ fraction of the universe uncovered: the tight bounds on the best approximation factor achievable in $p$ passes turn out to be $Θ_p(\min\{n^{1/(p+1)}, \varepsilon^{-1/p}\})$. Our lower bounds are based on a construction of a family of high-rank incidence geometries, which may be thought of as vast generalisations of affine planes. This construction, based on algebraic techniques, appears flexible enough to find other applications and is therefore interesting in its own right.
△ Less
Submitted 16 July, 2015;
originally announced July 2015.
-
Large-Scale Pattern Search Using Reduced-Space On-Disk Suffix Arrays
Authors:
Simon Gog,
Alistair Moffat,
J. Shane Culpepper,
Andrew Turpin,
Anthony Wirth
Abstract:
The suffix array is an efficient data structure for in-memory pattern search. Suffix arrays can also be used for external-memory pattern search, via two-level structures that use an internal index to identify the correct block of suffix pointers. In this paper we describe a new two-level suffix array-based index structure that requires significantly less disk space than previous approaches. Key to…
▽ More
The suffix array is an efficient data structure for in-memory pattern search. Suffix arrays can also be used for external-memory pattern search, via two-level structures that use an internal index to identify the correct block of suffix pointers. In this paper we describe a new two-level suffix array-based index structure that requires significantly less disk space than previous approaches. Key to the saving is the use of disk blocks that are based on prefixes rather than the more usual uniform-sampling approach, allowing reductions between blocks and subparts of other blocks. We also describe a new in-memory structure based on a condensed BWT string, and show that it allows common patterns to be resolved without access to the text. Experiments using 64 GB of English web text and a laptop computer with just 4 GB of main memory demonstrate the speed and versatility of the new approach. For this data the index is around one- third the size of previous two-level mechanisms; and the memory footprint of as little as 1% of the text size means that queries can be processed more quickly than is possible with a compact FM-INDEX.
△ Less
Submitted 26 March, 2013;
originally announced March 2013.