-
Improved Space-Efficient Approximate Nearest Neighbor Search Using Function Inversion
Authors:
Samuel McCauley
Abstract:
Approximate nearest neighbor search (ANN) data structures have widespread applications in machine learning, computational biology, and text processing. The goal of ANN is to preprocess a set S so that, given a query q, we can find a point y whose distance from q approximates the smallest distance from q to any point in S. For most distance functions, the best-known ANN bounds for high-dimensional…
▽ More
Approximate nearest neighbor search (ANN) data structures have widespread applications in machine learning, computational biology, and text processing. The goal of ANN is to preprocess a set S so that, given a query q, we can find a point y whose distance from q approximates the smallest distance from q to any point in S. For most distance functions, the best-known ANN bounds for high-dimensional point sets are obtained using techniques based on locality-sensitive hashing (LSH).
Unfortunately, space efficiency is a major challenge for LSH-based data structures. Classic LSH techniques require a very large amount of space, oftentimes polynomial in |S|. A long line of work has developed intricate techniques to reduce this space usage, but these techniques suffer from downsides: they must be hand tailored to each specific LSH, are often complicated, and their space reduction comes at the cost of significantly increased query times.
In this paper we explore a new way to improve the space efficiency of LSH using function inversion techniques, originally developed in (Fiat and Naor 2000).
We begin by describing how function inversion can be used to improve LSH data structures. This gives a fairly simple, black box method to reduce LSH space usage.
Then, we give a data structure that leverages function inversion to improve the query time of the best known near-linear space data structure for approximate nearest neighbor search under Euclidean distance: the ALRW data structure of (Andoni, Laarhoven, Razenshteyn, and Waingarten 2017). ALRW was previously shown to be optimal among "list-of-points" data structures for both Euclidean and Manhattan ANN; thus, in addition to giving improved bounds, our results imply that list-of-points data structures are not optimal for Euclidean or Manhattan ANN.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
SPIDER: Improved Succinct Rank and Select Performance
Authors:
Matthew D. Laws,
Jocelyn Bliven,
Kit Conklin,
Elyes Laalai,
Samuel McCauley,
Zach S. Sturdevant
Abstract:
Rank and select data structures seek to preprocess a bit vector to quickly answer two kinds of queries: rank(i) gives the number of 1 bits in slots 0 through i, and select(j) gives the first slot s with rank(s) = j. A succinct data structure can answer these queries while using space much smaller than the size of the original bit vector.
State of the art succinct rank and select data structures…
▽ More
Rank and select data structures seek to preprocess a bit vector to quickly answer two kinds of queries: rank(i) gives the number of 1 bits in slots 0 through i, and select(j) gives the first slot s with rank(s) = j. A succinct data structure can answer these queries while using space much smaller than the size of the original bit vector.
State of the art succinct rank and select data structures use as little as 4% extra space while answering rank and select queries quickly. Rank queries can be answered using only a handful of array accesses. Select queries can be answered by starting with similar array accesses, followed by a linear scan.
Despite these strong results, a tradeoff remains: data structures that use under 4% space are significantly slower at answering rank and select queries than less-space-efficient data structures (using, say, > 20% extra space).
In this paper we make significant progress towards closing this gap. We give a new data structure, SPIDER, which uses 3.82% extra space. SPIDER gives the best rank query time for data sets of 8 billion or more bits, even compared to less space-efficient data structures. For select queries, SPIDER outperforms all data structures that use less than 4% space, and significantly closes the gap in select performance between data structures with less than 4% space, and those that use more (over 20%) space.
SPIDER makes two main technical contributions. For rank queries, it improves performance by interleaving the metadata with the bit vector to improve cache efficiency. For select queries, it uses predictions to almost eliminate the cost of the linear scan. These predictions are inspired by recent results on data structures with machine-learned predictions, adapted to the succinct data structure setting. Our results hold on both real and synthetic data, showing that these predictions are effective in practice.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
Root-to-Leaf Scheduling in Write-Optimized Trees
Authors:
Christopher Chung,
William Jannen,
Samuel McCauley,
Bertrand Simon
Abstract:
Write-optimized dictionaries are a class of cache-efficient data structures that buffer updates and apply them in batches to optimize the amortized cache misses per update. For example, a B^epsilon tree inserts updates as messages at the root. B^epsilon trees only move ("flush") messages when they have total size close to a cache line, optimizing the amount of work done per cache line written. Thu…
▽ More
Write-optimized dictionaries are a class of cache-efficient data structures that buffer updates and apply them in batches to optimize the amortized cache misses per update. For example, a B^epsilon tree inserts updates as messages at the root. B^epsilon trees only move ("flush") messages when they have total size close to a cache line, optimizing the amount of work done per cache line written. Thus, recently-inserted messages reside at or near the root and are only flushed down the tree after a sufficient number of new messages arrive. Although this lazy approach works well for many operations, some types of updates do not complete until the update message reaches a leaf. For example, deferred queries and secure deletes must flush through all nodes along their root-to-leaf path before taking effect. What happens when we want to service a large number of (say) secure deletes as quickly as possible? Classic techniques leave us with an unsavory choice. On the one hand, we can group the delete messages using a write-optimized approach and move them down the tree lazily. But then many individual deletes may be left incomplete for an extended period of time, as their messages wait to be grouped with a sufficiently large number of related messages. On the other hand, we can ignore cache efficiency and perform a root-to-leaf flush for each delete. This begins work on individual deletes immediately, but harms system throughput. This paper investigates a new framework for efficiently flushing collections of messages from the root to their leaves in a write-optimized data structure. Our goal is to minimize the average time that messages reach the leaves. We give an algorithm that O(1)-approximates the optimal average completion time in this model. Along the way, we give a new 4-approximation algorithm for scheduling parallel tasks for weighted completion time with tree precedence constraints.
△ Less
Submitted 26 April, 2024;
originally announced April 2024.
-
Incremental Topological Ordering and Cycle Detection with Predictions
Authors:
Samuel McCauley,
Benjamin Moseley,
Aidin Niaparast,
Shikha Singh
Abstract:
This paper leverages the framework of algorithms-with-predictions to design data structures for two fundamental dynamic graph problems: incremental topological ordering and cycle detection. In these problems, the input is a directed graph on $n$ nodes, and the $m$ edges arrive one by one. The data structure must maintain a topological ordering of the vertices at all times and detect if the newly i…
▽ More
This paper leverages the framework of algorithms-with-predictions to design data structures for two fundamental dynamic graph problems: incremental topological ordering and cycle detection. In these problems, the input is a directed graph on $n$ nodes, and the $m$ edges arrive one by one. The data structure must maintain a topological ordering of the vertices at all times and detect if the newly inserted edge creates a cycle. The theoretically best worst-case algorithms for these problems have high update cost (polynomial in $n$ and $m$). In practice, greedy heuristics (that recompute the solution from scratch each time) perform well but can have high update cost in the worst case.
In this paper, we bridge this gap by leveraging predictions to design a learned new data structure for the problems. Our data structure guarantees consistency, robustness, and smoothness with respect to predictions -- that is, it has the best possible running time under perfect predictions, never performs worse than the best-known worst-case methods, and its running time degrades smoothly with the prediction error. Moreover, we demonstrate empirically that predictions, learned from a very small training dataset, are sufficient to provide significant speed-ups on real datasets.
△ Less
Submitted 16 February, 2024;
originally announced February 2024.
-
Online List Labeling with Predictions
Authors:
Samuel McCauley,
Benjamin Moseley,
Aidin Niaparast,
Shikha Singh
Abstract:
A growing line of work shows how learned predictions can be used to break through worst-case barriers to improve the running time of an algorithm. However, incorporating predictions into data structures with strong theoretical guarantees remains underdeveloped. This paper takes a step in this direction by showing that predictions can be leveraged in the fundamental online list labeling problem. In…
▽ More
A growing line of work shows how learned predictions can be used to break through worst-case barriers to improve the running time of an algorithm. However, incorporating predictions into data structures with strong theoretical guarantees remains underdeveloped. This paper takes a step in this direction by showing that predictions can be leveraged in the fundamental online list labeling problem. In the problem, n items arrive over time and must be stored in sorted order in an array of size Theta(n). The array slot of an element is its label and the goal is to maintain sorted order while minimizing the total number of elements moved (i.e., relabeled). We design a new list labeling data structure and bound its performance in two models. In the worst-case learning-augmented model, we give guarantees in terms of the error in the predictions. Our data structure provides strong guarantees: it is optimal for any prediction error and guarantees the best-known worst-case bound even when the predictions are entirely erroneous. We also consider a stochastic error model and bound the performance in terms of the expectation and variance of the error. Finally, the theoretical results are demonstrated empirically. In particular, we show that our data structure has strong performance on real temporal data sets where predictions are constructed from elements that arrived in the past, as is typically done in a practical use case.
△ Less
Submitted 20 June, 2023; v1 submitted 17 May, 2023;
originally announced May 2023.
-
Telesco** Filter: A Practical Adaptive Filter
Authors:
David J. Lee,
Samuel McCauley,
Shikha Singh,
Max Stein
Abstract:
Filters are fast, small and approximate set membership data structures. They are often used to filter out expensive accesses to a remote set S for negative queries (that is, a query x not in S). Filters have one-sided errors: on a negative query, a filter may say "present" with a tunable false-positve probability of epsilon. Correctness is traded for space: filters only use log (1/ε) + O(1) bits p…
▽ More
Filters are fast, small and approximate set membership data structures. They are often used to filter out expensive accesses to a remote set S for negative queries (that is, a query x not in S). Filters have one-sided errors: on a negative query, a filter may say "present" with a tunable false-positve probability of epsilon. Correctness is traded for space: filters only use log (1/ε) + O(1) bits per element.
The false-positive guarantees of most filters, however, hold only for a single query. In particular, if x is a false positive of a filter, a subsequent query to x is a false positive with probability 1, not epsilon. With this in mind, recent work has introduced the notion of an adaptive filter. A filter is adaptive if each query has false positive epsilon, regardless of what queries were made in the past. This requires "fixing" false positives as they occur.
Adaptive filters not only provide strong false positive guarantees in adversarial environments but also improve performance on query practical workloads by eliminating repeated false positives.
Existing work on adaptive filters falls into two categories. First, there are practical filters based on cuckoo filters that attempt to fix false positives heuristically, without meeting the adaptivity guarantee. Meanwhile, the broom filter is a very complex adaptive filter that meets the optimal theoretical bounds.
In this paper, we bridge this gap by designing a practical, provably adaptive filter: the telesco** adaptive filter. We provide theoretical false-positive and space guarantees of our filter, along with empirical results where we compare its false positive performance against state-of-the-art filters. We also test the throughput of our filters, showing that they achieve comparable performance to similar non-adaptive filters.
△ Less
Submitted 6 July, 2021;
originally announced July 2021.
-
Support Optimality and Adaptive Cuckoo Filters
Authors:
Tsvi Kopelowitz,
Samuel McCauley,
Ely Porat
Abstract:
Filters (such as Bloom Filters) are data structures that speed up network routing and measurement operations by storing a compressed representation of a set. Filters are space efficient, but can make bounded one-sided errors: with tunable probability epsilon, they may report that a query element is stored in the filter when it is not. This is called a false positive. Recent research has focused on…
▽ More
Filters (such as Bloom Filters) are data structures that speed up network routing and measurement operations by storing a compressed representation of a set. Filters are space efficient, but can make bounded one-sided errors: with tunable probability epsilon, they may report that a query element is stored in the filter when it is not. This is called a false positive. Recent research has focused on designing methods for dynamically adapting filters to false positives, reducing the number of false positives when some elements are queried repeatedly.
Ideally, an adaptive filter would incur a false positive with bounded probability epsilon for each new query element, and would incur o(epsilon) total false positives over all repeated queries to that element. We call such a filter support optimal.
In this paper we design a new Adaptive Cuckoo Filter and show that it is support optimal (up to additive logarithmic terms) over any n queries when storing a set of size n. Our filter is simple: fixing previous false positives requires a simple cuckoo operation, and the filter does not need to store any additional metadata. This data structure is the first practical data structure that is support optimal, and the first filter that does not require additional space to fix false positives.
We complement these bounds with experiments showing that our data structure is effective at fixing false positives on network traces, outperforming previous Adaptive Cuckoo Filters.
Finally, we investigate adversarial adaptivity, a stronger notion of adaptivity in which an adaptive adversary repeatedly queries the filter, using the result of previous queries to drive the false positive rate as high as possible. We prove a lower bound showing that a broad family of filters, including all known Adaptive Cuckoo Filters, can be forced by such an adversary to incur a large number of false positives.
△ Less
Submitted 21 May, 2021;
originally announced May 2021.
-
Approximate Similarity Search Under Edit Distance Using Locality-Sensitive Hashing
Authors:
Samuel McCauley
Abstract:
Edit distance similarity search, also called approximate pattern matching, is a fundamental problem with widespread database applications. The goal of the problem is to preprocess $n$ strings of length $d$, to quickly answer queries $q$ of the form: if there is a database string within edit distance $r$ of $q$, return a database string within edit distance $cr$ of $q$. Previous approaches to this…
▽ More
Edit distance similarity search, also called approximate pattern matching, is a fundamental problem with widespread database applications. The goal of the problem is to preprocess $n$ strings of length $d$, to quickly answer queries $q$ of the form: if there is a database string within edit distance $r$ of $q$, return a database string within edit distance $cr$ of $q$. Previous approaches to this problem either rely on very large (superconstant) approximation ratios $c$, or very small search radii $r$. Outside of a narrow parameter range, these solutions are not competitive with trivially searching through all $n$ strings.
In this work give a simple and easy-to-implement hash function that can quickly answer queries for a wide range of parameters. Specifically, our strategy can answer queries in time $\tilde{O}(d3^rn^{1/c})$. The best known practical results require $c \gg r$ to achieve any correctness guarantee; meanwhile, the best known theoretical results are very involved and difficult to implement, and require query time at least $24^r$. Our results significantly broaden the range of parameters for which we can achieve nontrivial bounds, while retaining the practicality of a locality-sensitive hash function.
We also show how to apply our ideas to the closely-related Approximate Nearest Neighbor problem for edit distance, obtaining similar time bounds.
△ Less
Submitted 8 July, 2020; v1 submitted 2 July, 2019;
originally announced July 2019.
-
Efficient Rational Proofs with Strong Utility-Gap Guarantees
Authors:
**g Chen,
Samuel McCauley,
Shikha Singh
Abstract:
As modern computing moves towards smaller devices and powerful cloud platforms, more and more computation is being delegated to powerful service providers. Interactive proofs are a widely-used model to design efficient protocols for verifiable computation delegation. Rational proofs are payment-based interactive proofs. The payments are designed to incentivize the provers to give correct answers.…
▽ More
As modern computing moves towards smaller devices and powerful cloud platforms, more and more computation is being delegated to powerful service providers. Interactive proofs are a widely-used model to design efficient protocols for verifiable computation delegation. Rational proofs are payment-based interactive proofs. The payments are designed to incentivize the provers to give correct answers. If the provers misreport the answer then they incur a payment loss of at least 1/u, where u is the utility gap of the protocol.
In this work, we tightly characterize the power of rational proofs that are super efficient, that is, require only logarithmic time and communication for verification. We also characterize the power of single-round rational protocols that require only logarithmic space and randomness for verification. Our protocols have strong (that is, polynomial, logarithmic, and even constant) utility gap. Finally, we show when and how rational protocols can be converted to give the completeness and soundness guarantees of classical interactive proofs.
△ Less
Submitted 12 September, 2018; v1 submitted 3 July, 2018;
originally announced July 2018.
-
Adaptive MapReduce Similarity Joins
Authors:
Samuel McCauley,
Francesco Silvestri
Abstract:
Similarity joins are a fundamental database operation. Given data sets S and R, the goal of a similarity join is to find all points x in S and y in R with distance at most r. Recent research has investigated how locality-sensitive hashing (LSH) can be used for similarity join, and in particular two recent lines of work have made exciting progress on LSH-based join performance. Hu, Tao, and Yi (POD…
▽ More
Similarity joins are a fundamental database operation. Given data sets S and R, the goal of a similarity join is to find all points x in S and y in R with distance at most r. Recent research has investigated how locality-sensitive hashing (LSH) can be used for similarity join, and in particular two recent lines of work have made exciting progress on LSH-based join performance. Hu, Tao, and Yi (PODS 17) investigated joins in a massively parallel setting, showing strong results that adapt to the size of the output. Meanwhile, Ahle, Aumüller, and Pagh (SODA 17) showed a sequential algorithm that adapts to the structure of the data, matching classic bounds in the worst case but improving them significantly on more structured data. We show that this adaptive strategy can be adapted to the parallel setting, combining the advantages of these approaches. In particular, we show that a simple modification to Hu et al.'s algorithm achieves bounds that depend on the density of points in the dataset as well as the total outsize of the output. Our algorithm uses no extra parameters over other LSH approaches (in particular, its execution does not depend on the structure of the dataset), and is likely to be efficient in practice.
△ Less
Submitted 16 April, 2018;
originally announced April 2018.
-
Set Similarity Search for Skewed Data
Authors:
Samuel McCauley,
Jesper W. Mikkelsen,
Rasmus Pagh
Abstract:
Set similarity join, as well as the corresponding indexing problem set similarity search, are fundamental primitives for managing noisy or uncertain data. For example, these primitives can be used in data cleaning to identify different representations of the same object. In many cases one can represent an object as a sparse 0-1 vector, or equivalently as the set of nonzero entries in such a vector…
▽ More
Set similarity join, as well as the corresponding indexing problem set similarity search, are fundamental primitives for managing noisy or uncertain data. For example, these primitives can be used in data cleaning to identify different representations of the same object. In many cases one can represent an object as a sparse 0-1 vector, or equivalently as the set of nonzero entries in such a vector. A set similarity join can then be used to identify those pairs that have an exceptionally large dot product (or intersection, when viewed as sets). We choose to focus on identifying vectors with large Pearson correlation, but results extend to other similarity measures. In particular, we consider the indexing problem of identifying correlated vectors in a set S of vectors sampled from {0,1}^d. Given a query vector y and a parameter alpha in (0,1), we need to search for an alpha-correlated vector x in a data structure representing the vectors of S. This kind of similarity search has been intensely studied in worst-case (non-random data) settings.
Existing theoretically well-founded methods for set similarity search are often inferior to heuristics that take advantage of skew in the data distribution, i.e., widely differing frequencies of 1s across the d dimensions. The main contribution of this paper is to analyze the set similarity problem under a random data model that reflects the kind of skewed data distributions seen in practice, allowing theoretical results much stronger than what is possible in worst-case settings. Our indexing data structure is a recursive, data-dependent partitioning of vectors inspired by recent advances in set similarity search. Previous data-dependent methods do not seem to allow us to exploit skew in item frequencies, so we believe that our work sheds further light on the power of data dependence.
△ Less
Submitted 9 April, 2018;
originally announced April 2018.
-
Bloom Filters, Adaptivity, and the Dictionary Problem
Authors:
Michael A. Bender,
Martin Farach-Colton,
Mayank Goswami,
Rob Johnson,
Samuel McCauley,
Shikha Singh
Abstract:
The Bloom filter---or, more generally, an approximate membership query data structure (AMQ)---maintains a compact, probabilistic representation of a set S of keys from a universe U. An AMQ supports lookups, inserts, and (for some AMQs) deletes. A query for an x in S is guaranteed to return "present." A query for x not in S returns "absent" with probability at least 1-epsilon, where epsilon is a tu…
▽ More
The Bloom filter---or, more generally, an approximate membership query data structure (AMQ)---maintains a compact, probabilistic representation of a set S of keys from a universe U. An AMQ supports lookups, inserts, and (for some AMQs) deletes. A query for an x in S is guaranteed to return "present." A query for x not in S returns "absent" with probability at least 1-epsilon, where epsilon is a tunable false positive probability. If a query returns "present," but x is not in S, then x is a false positive of the AMQ. Because AMQs have a nonzero probability of false-positives, they require far less space than explicit set representations.
AMQs are widely used to speed up dictionaries that are stored remotely (e.g., on disk/across a network). Most AMQs offer weak guarantees on the number of false positives they will return on a sequence of queries. The false-positive probability of epsilon holds only for a single query. It is easy for an adversary to drive an AMQ's false-positive rate towards 1 by simply repeating false positives.
This paper shows what it takes to get strong guarantees on the number of false positives. We say that an AMQs is adaptive if it guarantees a false-positive probability of epsilon for every query, regardless of answers to previous queries. First, we prove that it is impossible to build a small adaptive AMQ, even when the AMQ is immediately told whenever it returns a false positive. We then show how to build an adaptive AMQ that partitions its state into a small local component and a larger remote component. In addition to being adaptive, the local component of our AMQ dominates existing AMQs in all regards. It uses optimal space up to lower-order terms and supports queries and updates in worst-case constant time, with high probability. Thus, we show that adaptivity has no cost.
△ Less
Submitted 26 August, 2018; v1 submitted 5 November, 2017;
originally announced November 2017.
-
Non-Cooperative Rational Interactive Proofs
Authors:
**g Chen,
Samuel McCauley,
Shikha Singh
Abstract:
Interactive-proof games model the scenario where an honest party interacts with powerful but strategic provers, to elicit from them the correct answer to a computational question. Interactive proofs are increasingly used as a framework to design protocols for computation outsourcing.
Existing interactive-proof games largely fall into two categories: either as games of cooperation such as multi-p…
▽ More
Interactive-proof games model the scenario where an honest party interacts with powerful but strategic provers, to elicit from them the correct answer to a computational question. Interactive proofs are increasingly used as a framework to design protocols for computation outsourcing.
Existing interactive-proof games largely fall into two categories: either as games of cooperation such as multi-prover interactive proofs and cooperative rational proofs, where the provers work together as a team; or as games of conflict such as refereed games, where the provers directly compete with each other in a zero-sum game. Neither of these extremes truly capture the strategic nature of service providers in outsourcing applications. How to design and analyze non-cooperative interactive proofs is an important open problem.
In this paper, we introduce a mechanism-design approach to define a multi-prover interactive-proof model in which the provers are rational and non-cooperative---they act to maximize their expected utility given others' strategies. We define a strong notion of backwards induction as our solution concept to analyze the resulting extensive-form game with imperfect information.
Our protocols provide utility gap guarantees, which are analogous to soundness gap in classic interactive proofs. At a high level, a utility gap of u means that the protocol is robust against provers that may not care about a utility loss of 1/u.
We fully characterize the complexity of our proof system under different utility gap guarantees. For example, we show that with a polynomial utility gap, the power of non-cooperative rational interactive proofs is exactly P^NEXP.
△ Less
Submitted 11 August, 2021; v1 submitted 1 August, 2017;
originally announced August 2017.
-
Rational Proofs with Multiple Provers
Authors:
**g Chen,
Samuel McCauley,
Shikha Singh
Abstract:
Interactive proofs (IP) model a world where a verifier delegates computation to an untrustworthy prover, verifying the prover's claims before accepting them. IP protocols have applications in areas such as verifiable computation outsourcing, computation delegation, cloud computing. In these applications, the verifier may pay the prover based on the quality of his work. Rational interactive proofs…
▽ More
Interactive proofs (IP) model a world where a verifier delegates computation to an untrustworthy prover, verifying the prover's claims before accepting them. IP protocols have applications in areas such as verifiable computation outsourcing, computation delegation, cloud computing. In these applications, the verifier may pay the prover based on the quality of his work. Rational interactive proofs (RIP), introduced by Azar and Micali (2012), are an interactive-proof system with payments, in which the prover is rational rather than untrustworthy---he may lie, but only to increase his payment. Rational proofs leverage the provers' rationality to obtain simple and efficient protocols. Azar and Micali show that RIP=IP(=PSAPCE). They leave the question of whether multiple provers are more powerful than a single prover for rational and classical proofs as an open problem.
In this paper, we introduce multi-prover rational interactive proofs (MRIP). Here, a verifier cross-checks the provers' answers with each other and pays them according to the messages exchanged. The provers are cooperative and maximize their total expected payment if and only if the verifier learns the correct answer to the problem. We further refine the model of MRIP to incorporate utility gap, which is the loss in payment suffered by provers who mislead the verifier to the wrong answer.
We define the class of MRIP protocols with constant, noticeable and negligible utility gaps. We give tight characterization for all three MRIP classes. We show that under standard complexity-theoretic assumptions, MRIP is more powerful than both RIP and MIP ; and this is true even the utility gap is required to be constant. Furthermore the full power of each MRIP class can be achieved using only two provers and three rounds. (A preliminary version of this paper appeared at ITCS 2016. This is the full version that contains new results.)
△ Less
Submitted 11 November, 2017; v1 submitted 30 April, 2015;
originally announced April 2015.
-
Run Generation Revisited: What Goes Up May or May Not Come Down
Authors:
Michael A. Bender,
Samuel McCauley,
Andrew McGregor,
Shikha Singh,
Hoa T. Vu
Abstract:
In this paper, we revisit the classic problem of run generation. Run generation is the first phase of external-memory sorting, where the objective is to scan through the data, reorder elements using a small buffer of size M , and output runs (contiguously sorted chunks of elements) that are as long as possible.
We develop algorithms for minimizing the total number of runs (or equivalently, maxim…
▽ More
In this paper, we revisit the classic problem of run generation. Run generation is the first phase of external-memory sorting, where the objective is to scan through the data, reorder elements using a small buffer of size M , and output runs (contiguously sorted chunks of elements) that are as long as possible.
We develop algorithms for minimizing the total number of runs (or equivalently, maximizing the average run length) when the runs are allowed to be sorted or reverse sorted. We study the problem in the online setting, both with and without resource augmentation, and in the offline setting.
(1) We analyze alternating-up-down replacement selection (runs alternate between sorted and reverse sorted), which was studied by Knuth as far back as 1963. We show that this simple policy is asymptotically optimal. Specifically, we show that alternating-up-down replacement selection is 2-competitive and no deterministic online algorithm can perform better.
(2) We give online algorithms having smaller competitive ratios with resource augmentation. Specifically, we exhibit a deterministic algorithm that, when given a buffer of size 4M , is able to match or beat any optimal algorithm having a buffer of size M . Furthermore, we present a randomized online algorithm which is 7/4-competitive when given a buffer twice that of the optimal.
(3) We demonstrate that performance can also be improved with a small amount of foresight. We give an algorithm, which is 3/2-competitive, with foreknowledge of the next 3M elements of the input stream. For the extreme case where all future elements are known, we design a PTAS for computing the optimal strategy a run generation algorithm must follow.
(4) Finally, we present algorithms tailored for nearly sorted inputs which are guaranteed to have optimal solutions with sufficiently long runs.
△ Less
Submitted 24 April, 2015;
originally announced April 2015.