Search | arXiv e-print repository

Adaptive Quotient Filters

Authors: Richard Wen, Hunter McCoy, David Tench, Guido Tagliavini, Michael A. Bender, Alex Conway, Martin Farach-Colton, Rob Johnson, Prashant Pandey

Abstract: Adaptive filters, such as telesco** and adaptive cuckoo filters, update their representation upon detecting a false positive to avoid repeating the same error in the future. Adaptive filters require an auxiliary structure, typically much larger than the main filter and often residing on slow storage, to facilitate adaptation. However, existing adaptive filters are not practical and have seen no… ▽ More Adaptive filters, such as telesco** and adaptive cuckoo filters, update their representation upon detecting a false positive to avoid repeating the same error in the future. Adaptive filters require an auxiliary structure, typically much larger than the main filter and often residing on slow storage, to facilitate adaptation. However, existing adaptive filters are not practical and have seen no adoption in real-world systems due to two main reasons. Firstly, they offer weak adaptivity guarantees, meaning that fixing a new false positive can cause a previously fixed false positive to come back. Secondly, the sub-optimal design of the auxiliary structure results in adaptivity overheads so substantial that they can actually diminish the overall system performance compared to a traditional filter. In this paper, we design and implement AdaptiveQF, the first practical adaptive filter with minimal adaptivity overhead and strong adaptivity guarantees, which means that the performance and false-positive guarantees continue to hold even for adversarial workloads. The AdaptiveQF is based on the state-of-the-art quotient filter design and preserves all the critical features of the quotient filter such as cache efficiency and mergeability. Furthermore, we employ a new auxiliary structure design which results in considerably low adaptivity overhead and makes the AdaptiveQF practical in real systems. △ Less

Submitted 16 May, 2024; originally announced May 2024.

arXiv:2405.00807 [pdf, ps, other]

Nearly Optimal List Labeling

Authors: Michael A. Bender, Alex Conway, Martín Farach-Colton, Hanna Komlós, Michal Koucký, William Kuszmaul, Michael Saks

Abstract: The list-labeling problem captures the basic task of storing a dynamically changing set of up to $n$ elements in sorted order in an array of size $m = (1 + Θ(1))n$. The goal is to support insertions and deletions while moving around elements within the array as little as possible. Until recently, the best known upper bound stood at $O(\log^2 n)$ amortized cost. This bound, which was first establ… ▽ More The list-labeling problem captures the basic task of storing a dynamically changing set of up to $n$ elements in sorted order in an array of size $m = (1 + Θ(1))n$. The goal is to support insertions and deletions while moving around elements within the array as little as possible. Until recently, the best known upper bound stood at $O(\log^2 n)$ amortized cost. This bound, which was first established in 1981, was finally improved two years ago, when a randomized $O(\log^{3/2} n)$ expected-cost algorithm was discovered. The best randomized lower bound for this problem remains $Ω(\log n)$, and closing this gap is considered to be a major open problem in data structures. In this paper, we present the See-Saw Algorithm, a randomized list-labeling solution that achieves a nearly optimal bound of $O(\log n \operatorname{polyloglog} n)$ amortized expected cost. This bound is achieved despite at least three lower bounds showing that this type of result is impossible for large classes of solutions. △ Less

Submitted 1 May, 2024; originally announced May 2024.

Comments: 39 pages

arXiv:2404.16623 [pdf, other]

Layered List Labeling

Authors: Michael A. Bender, Alex Conway, Martin Farach-Colton, Hanna Komlos, William Kuszmaul

Abstract: The list-labeling problem is one of the most basic and well-studied algorithmic primitives in data structures, with an extensive literature spanning upper bounds, lower bounds, and data management applications. The classical algorithm for this problem, dating back to 1981, has amortized cost $O(\log^2 n)$. Subsequent work has led to improvements in three directions: \emph{low-latency} (worst-case)… ▽ More The list-labeling problem is one of the most basic and well-studied algorithmic primitives in data structures, with an extensive literature spanning upper bounds, lower bounds, and data management applications. The classical algorithm for this problem, dating back to 1981, has amortized cost $O(\log^2 n)$. Subsequent work has led to improvements in three directions: \emph{low-latency} (worst-case) bounds; \emph{high-throughput} (expected) bounds; and (adaptive) bounds for \emph{important workloads}. Perhaps surprisingly, these three directions of research have remained almost entirely disjoint -- this is because, so far, the techniques that allow for progress in one direction have forced worsening bounds in the others. Thus there would appear to be a tension between worst-case, adaptive, and expected bounds. List labeling has been proposed for use in databases at least as early as PODS'99, but a database needs good throughput, response time, and needs to adapt to common workloads (e.g., bulk loads), and no current list-labeling algorithm achieve good bounds for all three. We show that this tension is not fundamental. In fact, with the help of new data-structural techniques, one can actually \emph{combine} any three list-labeling solutions in order to cherry-pick the best worst-case, adaptive, and expected bounds from each of them. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: PODS 2024, 19 pages, 4 figures

arXiv:2401.08858 [pdf, ps, other]

File System Aging

Authors: Alex Conway, Ainesh Bakshi, Arghya Bhattacharya, Rory Bennett, Yizheng Jiao, Eric Knorr, Yang Zhan, Michael A. Bender, William Jannen, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, Martin Farach-Colton

Abstract: File systems must allocate space for files without knowing what will be added or removed in the future. Over the life of a file system, this may cause suboptimal file placement decisions that eventually lead to slower performance, or aging. Conventional wisdom suggests that file system aging is a solved problem in the common case; heuristics to avoid aging, such as colocating related files and dat… ▽ More File systems must allocate space for files without knowing what will be added or removed in the future. Over the life of a file system, this may cause suboptimal file placement decisions that eventually lead to slower performance, or aging. Conventional wisdom suggests that file system aging is a solved problem in the common case; heuristics to avoid aging, such as colocating related files and data blocks, are effective until a storage device fills up, at which point space pressure exacerbates fragmentation-based aging. However, this article describes both realistic and synthetic workloads that can cause these heuristics to fail, inducing large performance declines due to aging, even when the storage device is nearly empty. We argue that these slowdowns are caused by poor layout. We demonstrate a correlation between the read performance of a directory scan and the locality within a file system's access patterns, using a dynamic layout score. We complement these results with microbenchmarks that show that space pressure can cause a substantial amount of inter-file and intra-file fragmentation. However, our results suggest that the effect of free-space fragmentation on read performance is best described as accelerating the file system aging process. The effect on write performance is non-existent in some cases, and, in most cases, an order of magnitude smaller than the read degradation from fragmentation caused by normal usage. In short, many file systems are exquisitely prone to read aging after a variety of write patterns. We show, however, that aging is not inevitable. BetrFS, a file system based on write-optimized dictionaries, exhibits almost no aging in our experiments. We present a framework for understanding and predicting aging, and identify the key features of BetrFS that avoid aging. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: 36 pages, 12 figures. Article is an extension of Conway et al. FAST 17. (see https://www.usenix.org/conference/fast17/technical-sessions/presentation/conway) and Conway et al. HotStorage 19. (see https://www.usenix.org/conference/hotstorage19/presentation/conway)

ACM Class: H.3.2; D.4.3; D.4.2; D.4.8; E.1; E.5; H.3.4

arXiv:2306.12682 [pdf, other]

Counting occurrences of patterns in permutations

Authors: Andrew R Conway, Anthony J Guttmann

Abstract: We develop a new, powerful method for counting elements in a multiset. As a first application, we use this algorithm to study the number of occurrences of patterns in a permutation. For patterns of length 3 there are two Wilf classes, and the general behaviour of these is reasonably well-known. We slightly extend some of the known results in that case, and exhaustively study the case of patterns o… ▽ More We develop a new, powerful method for counting elements in a multiset. As a first application, we use this algorithm to study the number of occurrences of patterns in a permutation. For patterns of length 3 there are two Wilf classes, and the general behaviour of these is reasonably well-known. We slightly extend some of the known results in that case, and exhaustively study the case of patterns of length 4, about which there is little previous knowledge. For such patterns, there are seven Wilf classes, and based on extensive enumerations and careful series analysis, we have conjectured the asymptotic behaviour for all classes. △ Less

Submitted 4 March, 2024; v1 submitted 22 June, 2023; originally announced June 2023.

Comments: 32 pages. Updated references from previous version. Removal on earlier discussion of Stieltjes sequences, which was incomplete and confusing

MSC Class: 05A15 (Primary) 05A05; 06-04 (Secondary)

arXiv:2306.10062 [pdf, other]

Revealing the structure of language model capabilities

Authors: Ryan Burnell, Han Hao, Andrew R. A. Conway, Jose Hernandez Orallo

Abstract: Building a theoretical understanding of the capabilities of large language models (LLMs) is vital for our ability to predict and explain the behavior of these systems. Here, we investigate the structure of LLM capabilities by extracting latent capabilities from patterns of individual differences across a varied population of LLMs. Using a combination of Bayesian and frequentist factor analysis, we… ▽ More Building a theoretical understanding of the capabilities of large language models (LLMs) is vital for our ability to predict and explain the behavior of these systems. Here, we investigate the structure of LLM capabilities by extracting latent capabilities from patterns of individual differences across a varied population of LLMs. Using a combination of Bayesian and frequentist factor analysis, we analyzed data from 29 different LLMs across 27 cognitive tasks. We found evidence that LLM capabilities are not monolithic. Instead, they are better explained by three well-delineated factors that represent reasoning, comprehension and core language modeling. Moreover, we found that these three factors can explain a high proportion of the variance in model performance. These results reveal a consistent structure in the capabilities of different LLMs and demonstrate the multifaceted nature of these capabilities. We also found that the three abilities show different relationships to model properties such as model size and instruction tuning. These patterns help refine our understanding of scaling laws and indicate that changes to a model that improve one ability might simultaneously impair others. Based on these findings, we suggest that benchmarks could be streamlined by focusing on tasks that tap into each broad model ability. △ Less

Submitted 14 June, 2023; originally announced June 2023.

Comments: 10 pages, 3 figures + references and appendices, for data and analysis code see https://github.com/RyanBurnell/revealing-LLM-capabilities

arXiv:2211.06694 [pdf, other]

Pain Detection in Masked Faces during Procedural Sedation

Authors: Y. Zarghami, S. Mafeld, A. Conway, B. Taati

Abstract: Pain monitoring is essential to the quality of care for patients undergoing a medical procedure with sedation. An automated mechanism for detecting pain could improve sedation dose titration. Previous studies on facial pain detection have shown the viability of computer vision methods in detecting pain in unoccluded faces. However, the faces of patients undergoing procedures are often partially oc… ▽ More Pain monitoring is essential to the quality of care for patients undergoing a medical procedure with sedation. An automated mechanism for detecting pain could improve sedation dose titration. Previous studies on facial pain detection have shown the viability of computer vision methods in detecting pain in unoccluded faces. However, the faces of patients undergoing procedures are often partially occluded by medical devices and face masks. A previous preliminary study on pain detection on artificially occluded faces has shown a feasible approach to detect pain from a narrow band around the eyes. This study has collected video data from masked faces of 14 patients undergoing procedures in an interventional radiology department and has trained a deep learning model using this dataset. The model was able to detect expressions of pain accurately and, after causal temporal smoothing, achieved an average precision (AP) of 0.72 and an area under the receiver operating characteristic curve (AUC) of 0.82. These results outperform baseline models and show viability of computer vision approaches for pain detection of masked faces during procedural sedation. Cross-dataset performance is also examined when a model is trained on a publicly available dataset and tested on the sedation videos. The ways in which pain expressions differ in the two datasets are qualitatively examined. △ Less

Submitted 12 November, 2022; originally announced November 2022.

Comments: Accepted for presentation at the FG2023 workshop on Artificial Intelligence for Automated Human Health-care and Monitoring: 6 pages, 10 figures

arXiv:2210.04068 [pdf, other]

IcebergHT: High Performance PMEM Hash Tables Through Stability and Low Associativity

Authors: Prashant Pandey, Michael A. Bender, Alex Conway, Martín Farach-Colton, William Kuszmaul, Guido Tagliavini, Rob Johnson

Abstract: Modern hash table designs strive to minimize space while maximizing speed. The most important factor in speed is the number of cache lines accessed during updates and queries. This is especially important on PMEM, which is slower than DRAM and in which writes are more expensive than reads. This paper proposes two stronger design objectives: stability and low-associativity. A stable hash table do… ▽ More Modern hash table designs strive to minimize space while maximizing speed. The most important factor in speed is the number of cache lines accessed during updates and queries. This is especially important on PMEM, which is slower than DRAM and in which writes are more expensive than reads. This paper proposes two stronger design objectives: stability and low-associativity. A stable hash table doesn't move items around, and a hash table has low associativity if there are only a few locations where an item can be stored. Low associativity ensures that queries need to examine only a few memory locations, and stability ensures that insertions write to very few cache lines. Stability also simplifies scaling and crash safety. We present IcebergHT, a fast, crash-safe, concurrent, and space-efficient hash table for PMEM based on the design principles of stability and low associativity. IcebergHT combines in-memory metadata with a new hashing technique, iceberg hashing, that is (1) space efficient, (2) stable, and (3) supports low associativity. In contrast, existing hash-tables either modify numerous cache lines during insertions (e.g. cuckoo hashing), access numerous cache lines during queries (e.g. linear probing), or waste space (e.g. chaining). Moreover, the combination of (1)-(3) yields several emergent benefits: IcebergHT scales better than other hash tables, supports crash-safety, and has excellent performance on PMEM (where writes are particularly expensive). △ Less

Submitted 11 October, 2022; v1 submitted 8 October, 2022; originally announced October 2022.

arXiv:2203.02763 [pdf, ps, other]

Online List Labeling: Breaking the $\log^2n$ Barrier

Authors: Michael A. Bender, Alex Conway, Martín Farach-Colton, Hanna Komlós, William Kuszmaul, Nicole Wein

Abstract: The online list labeling problem is an algorithmic primitive with a large literature of upper bounds, lower bounds, and applications. The goal is to store a dynamically-changing set of $n$ items in an array of $m$ slots, while maintaining the invariant that the items appear in sorted order, and while minimizing the relabeling cost, defined to be the number of items that are moved per insertion/del… ▽ More The online list labeling problem is an algorithmic primitive with a large literature of upper bounds, lower bounds, and applications. The goal is to store a dynamically-changing set of $n$ items in an array of $m$ slots, while maintaining the invariant that the items appear in sorted order, and while minimizing the relabeling cost, defined to be the number of items that are moved per insertion/deletion. For the linear regime, where $m = (1 + Θ(1)) n$, an upper bound of $O(\log^2 n)$ on the relabeling cost has been known since 1981. A lower bound of $Ω(\log^2 n)$ is known for deterministic algorithms and for so-called smooth algorithms, but the best general lower bound remains $Ω(\log n)$. The central open question in the field is whether $O(\log^2 n)$ is optimal for all algorithms. In this paper, we give a randomized data structure that achieves an expected relabeling cost of $O(\log^{3/2} n)$ per operation. More generally, if $m = (1 + \varepsilon) n$ for $\varepsilon = O(1)$, the expected relabeling cost becomes $O(\varepsilon^{-1} \log^{3/2} n)$. Our solution is history independent, meaning that the state of the data structure is independent of the order in which items are inserted/deleted. For history-independent data structures, we also prove a matching lower bound: for all $ε$ between $1 / n^{1/3}$ and some sufficiently small positive constant, the optimal expected cost for history-independent list-labeling solutions is $Θ(\varepsilon^{-1}\log^{3/2} n)$. △ Less

Submitted 12 September, 2022; v1 submitted 5 March, 2022; originally announced March 2022.

Comments: Full version for FOCS 2022 camera ready

arXiv:2111.12800 [pdf, other]

Tiny Pointers

Authors: Michael A. Bender, Alex Conway, Martín Farach-Colton, William Kuszmaul, Guido Tagliavini

Abstract: This paper introduces a new data-structural object that we call the tiny pointer. In many applications, traditional $\log n $-bit pointers can be replaced with $o (\log n )$-bit tiny pointers at the cost of only a constant-factor time overhead. We develop a comprehensive theory of tiny pointers, and give optimal constructions for both fixed-size tiny pointers (i.e., settings in which all of the ti… ▽ More This paper introduces a new data-structural object that we call the tiny pointer. In many applications, traditional $\log n $-bit pointers can be replaced with $o (\log n )$-bit tiny pointers at the cost of only a constant-factor time overhead. We develop a comprehensive theory of tiny pointers, and give optimal constructions for both fixed-size tiny pointers (i.e., settings in which all of the tiny pointers must be the same size) and variable-size tiny pointers (i.e., settings in which the average tiny-pointer size must be small, but some tiny pointers can be larger). If a tiny pointer references an element in an array filled to load factor $1 - 1 / k$, then the optimal tiny-pointer size is $Θ(\log \log \log n + \log k) $ bits in the fixed-size case, and $ Θ(\log k) $ expected bits in the variable-size case. Our tiny-pointer constructions also require us to revisit several classic problems having to do with balls and bins; these results may be of independent interest. Using tiny pointers, we revisit five classic data-structure problems: the data-retrieval problem, succinct dynamic binary search trees, space-efficient stable dictionaries, space-efficient dictionaries with variable-size keys, and the internal-memory stash problem. These are all well-studied problems, and in each case tiny pointers allow for us to take a natural space-inefficient solution that uses pointers and make it space-efficient for free. △ Less

Submitted 24 November, 2021; originally announced November 2021.

arXiv:2109.04548 [pdf, other]

Iceberg Hashing: Optimizing Many Hash-Table Criteria at Once

Authors: Michael A. Bender, Alex Conway, Martín Farach-Colton, William Kuszmaul, Guido Tagliavini

Abstract: Despite being one of the oldest data structures in computer science, hash tables continue to be the focus of a great deal of both theoretical and empirical research. A central reason for this is that many of the fundamental properties that one desires from a hash table are difficult to achieve simultaneously; thus many variants offering different trade-offs have been proposed. This paper introdu… ▽ More Despite being one of the oldest data structures in computer science, hash tables continue to be the focus of a great deal of both theoretical and empirical research. A central reason for this is that many of the fundamental properties that one desires from a hash table are difficult to achieve simultaneously; thus many variants offering different trade-offs have been proposed. This paper introduces Iceberg hashing, a hash table that simultaneously offers the strongest known guarantees on a large number of core properties. Iceberg hashing supports constant-time operations while improving on the state of the art for space efficiency, cache efficiency, and low failure probability. Iceberg hashing is also the first hash table to support a load factor of up to $1 - o(1)$ while being stable, meaning that the position where an element is stored only ever changes when resizes occur. In fact, in the setting where keys are $Θ(\log n)$ bits, the space guarantees that Iceberg hashing offers, namely that it uses at most $\log \binom{|U|}{n} + O(n \log \log n)$ bits to store $n$ items from a universe $U$, matches a lower bound by Demaine et al. that applies to any stable hash table. Iceberg hashing introduces new general-purpose techniques for some of the most basic aspects of hash-table design. Notably, our indirection-free technique for dynamic resizing, which we call waterfall addressing, and our techniques for achieving stability and very-high probability guarantees, can be applied to any hash table that makes use of the front-yard/backyard paradigm for hash table design. △ Less

Submitted 22 October, 2023; v1 submitted 9 September, 2021; originally announced September 2021.

arXiv:2007.00854 [pdf, other]

doi 10.1007/978-3-030-60347-2_2

Random errors are not necessarily politically neutral

Authors: Michelle Blom, Andrew Conway, Peter J. Stuckey, Vanessa Teague, Damjan Vukcevic

Abstract: Errors are inevitable in the implementation of any complex process. Here we examine the effect of random errors on Single Transferable Vote (STV) elections, a common approach to deciding multi-seat elections. It is usually expected that random errors should have nearly equal effects on all candidates, and thus be fair. We find to the contrary that random errors can introduce systematic bias into e… ▽ More Errors are inevitable in the implementation of any complex process. Here we examine the effect of random errors on Single Transferable Vote (STV) elections, a common approach to deciding multi-seat elections. It is usually expected that random errors should have nearly equal effects on all candidates, and thus be fair. We find to the contrary that random errors can introduce systematic bias into election results. This is because, even if the errors are random, votes for different candidates occur in different patterns that are affected differently by random errors. In the STV context, the most important effect of random errors is to invalidate the ballot. This removes far more votes for those candidates whose supporters tend to list a lot of preferences, because their ballots are much more likely to be invalidated by random error. Different validity rules for different voting styles mean that errors are much more likely to penalise some types of votes than others. For close elections this systematic bias can change the result of the election. △ Less

Submitted 28 September, 2020; v1 submitted 1 July, 2020; originally announced July 2020.

Journal ref: Electronic Voting, E-Vote-ID 2020, Lecture Notes in Computer Science 12455 (2020) 19-35

arXiv:2004.00235 [pdf, other]

You can do RLAs for IRV

Authors: Michelle Blom, Andrew Conway, Dan King, Laurent Sandrolini, Philip B. Stark, Peter J. Stuckey, Vanessa Teague

Abstract: The City and County of San Francisco, CA, has used Instant Runoff Voting (IRV) for some elections since 2004. This report describes the first ever process pilot of Risk Limiting Audits for IRV, for the San Francisco District Attorney's race in November, 2019. We found that the vote-by-mail outcome could be efficiently audited to well under the 0.05 risk limit given a sample of only 200 ballots. Al… ▽ More The City and County of San Francisco, CA, has used Instant Runoff Voting (IRV) for some elections since 2004. This report describes the first ever process pilot of Risk Limiting Audits for IRV, for the San Francisco District Attorney's race in November, 2019. We found that the vote-by-mail outcome could be efficiently audited to well under the 0.05 risk limit given a sample of only 200 ballots. All the software we developed for the pilot is open source. △ Less

Submitted 1 April, 2020; originally announced April 2020.

arXiv:1901.03108 [pdf, ps, other]

Auditing Indian Elections

Authors: Vishal Mohanty, Nicholas Akinyokun, Andrew Conway, Chris Culnane, Philip B. Stark, Vanessa Teague

Abstract: Indian Electronic Voting Machines (EVMs) will be fitted with printers that produce Voter-Verifiable Paper Audit Trails (VVPATs) in time for the 2019 general election. VVPATs provide evidence that each vote was recorded as the voter intended, without having to trust the perfection or security of the EVMs. However, confidence in election results requires more: VVPATs must be preserved inviolate an… ▽ More Indian Electronic Voting Machines (EVMs) will be fitted with printers that produce Voter-Verifiable Paper Audit Trails (VVPATs) in time for the 2019 general election. VVPATs provide evidence that each vote was recorded as the voter intended, without having to trust the perfection or security of the EVMs. However, confidence in election results requires more: VVPATs must be preserved inviolate and then actually used to check the reported election result in a trustworthy way that the public can verify. A full manual tally from the VVPATs could be prohibitively expensive and time-consuming; moreover, it is difficult for the public to determine whether a full hand count was conducted accurately. We show how Risk-Limiting Audits (RLAs) could provide high confidence in Indian election results. Compared to full hand recounts, RLAs typically require manually inspecting far fewer VVPATs when the outcome is correct, and are much easier for the electorate to observe in adequate detail to determine whether the result is trustworthy. △ Less

Submitted 25 January, 2019; v1 submitted 10 January, 2019; originally announced January 2019.

arXiv:1807.01804 [pdf, other]

Optimal Ball Recycling

Authors: Michael A. Bender, Jake Christensen, Alex Conway, Martín Farach-Colton, Rob Johnson, Meng-Tsung Tsai

Abstract: Balls-and-bins games have been a wildly successful tool for modeling load balancing problems. In this paper, we study a new scenario, which we call the ball recycling game, defined as follows: Throw m balls into n bins i.i.d. according to a given probability distribution p. Then, at each time step, pick a non-empty bin and recycle its balls: take the balls from the selected bin and re-throw them… ▽ More Balls-and-bins games have been a wildly successful tool for modeling load balancing problems. In this paper, we study a new scenario, which we call the ball recycling game, defined as follows: Throw m balls into n bins i.i.d. according to a given probability distribution p. Then, at each time step, pick a non-empty bin and recycle its balls: take the balls from the selected bin and re-throw them according to p. This balls-and-bins game closely models memory-access heuristics in databases. The goal is to have a bin-picking method that maximizes the recycling rate, defined to be the expected number of balls recycled per step in the stationary distribution. We study two natural strategies for ball recycling: Fullest Bin, which greedily picks the bin with the maximum number of balls, and Random Ball, which picks a ball at random and recycles its bin. We show that for general p, Random Ball is constant-optimal, whereas Fullest Bin can be pessimal. However, when p = u, the uniform distribution, Fullest Bin is optimal to within an additive constant. △ Less

Submitted 2 November, 2018; v1 submitted 4 July, 2018; originally announced July 2018.

arXiv:1805.09423 [pdf, ps, other]

Optimal Hashing in External Memory

Authors: Alex Conway, Martin Farach-Colton, Philip Shilane

Abstract: Hash tables are a ubiquitous class of dictionary data structures. However, standard hash table implementations do not translate well into the external memory model, because they do not incorporate locality for insertions. Iacono and Patracsu established an update/query tradeoff curve for external hash tables: a hash table that performs insertions in $O(λ/B)$ amortized IOs requires $Ω(\log_λN)$ e… ▽ More Hash tables are a ubiquitous class of dictionary data structures. However, standard hash table implementations do not translate well into the external memory model, because they do not incorporate locality for insertions. Iacono and Patracsu established an update/query tradeoff curve for external hash tables: a hash table that performs insertions in $O(λ/B)$ amortized IOs requires $Ω(\log_λN)$ expected IOs for queries, where $N$ is the number of items that can be stored in the data structure, $B$ is the size of a memory transfer, $M$ is the size of memory, and $λ$ is a tuning parameter. They provide a hashing data structure that meets this curve for $λ$ that is $Ω(\log\log M + \log_M N)$. Their data structure, which we call an \defn{IP hash table}, is complicated and, to the best of our knowledge, has not been implemented. In this paper, we present a new and much simpler optimal external memory hash table, the \defn{Bundle of Arrays Hash Table} (BOA). BOAs are based on size-tiered LSMs, a well-studied data structure, and are almost as easy to implement. The BOA is optimal for a narrower range of $λ$. However, the simplicity of BOAs allows them to be readily modified to achieve the following results: \begin{itemize} \item A new external memory data structure, the \defn{Bundle of Trees Hash Table} (BOT), that matches the performance of the IP hash table, while retaining some of the simplicity of the BOAs. \item The \defn{cache-oblivious Bundle of Trees Hash Table} (COBOT), the first cache-oblivious hash table. This data structure matches the optimality of BOTs and IP hash tables over the same range of $λ$. \end{itemize} △ Less

Submitted 23 May, 2018; originally announced May 2018.

arXiv:1611.02015 [pdf, other]

doi 10.1145/3014812.3014837

An analysis of New South Wales electronic vote counting

Authors: Andrew Conway, Michelle Blom, Lee Naish, Vanessa Teague

Abstract: We re-examine the 2012 local government elections in New South Wales, Australia. The count was conducted electronically using a randomised form of the Single Transferable Vote (STV). It was already well known that randomness does make a difference to outcomes in some seats. We describe how the process could be amended to include a demonstration that the randomness was chosen fairly. Second, and… ▽ More We re-examine the 2012 local government elections in New South Wales, Australia. The count was conducted electronically using a randomised form of the Single Transferable Vote (STV). It was already well known that randomness does make a difference to outcomes in some seats. We describe how the process could be amended to include a demonstration that the randomness was chosen fairly. Second, and more significantly, we found an error in the official counting software, which caused a mistake in the count in the council of Griffith, where candidate Rina Mercuri narrowly missed out on a seat. We believe the software error incorrectly decreased Mercuri's winning probability to about 10%---according to our count she should have won with 91% probability. The NSW Electoral Commission (NSWEC) corrected their code when we pointed out the error, and made their own announcement. We have since investigated the 2016 local government election (held after correcting the error above) and found two new errors. We notified the NSWEC about these errors a few days after they posted the results. △ Less

Submitted 7 November, 2016; originally announced November 2016.

arXiv:1610.09806 [pdf, other]

The design of efficient algorithms for enumeration

Authors: Andrew R. Conway

Abstract: Many algorithms have been developed for enumerating various combinatorial objects in time exponentially less than the number of objects. Two common classes of algorithms are dynamic programming and the transfer matrix method. This paper covers the design and implementation of such algorithms. A host of general techniques for improving efficiency are described. Three quite different example probl… ▽ More Many algorithms have been developed for enumerating various combinatorial objects in time exponentially less than the number of objects. Two common classes of algorithms are dynamic programming and the transfer matrix method. This paper covers the design and implementation of such algorithms. A host of general techniques for improving efficiency are described. Three quite different example problems are used for examples: 1324 pattern avoiding permutations, three-dimensional polycubes, and two-dimensional directed animals. For those new to the field, this paper is designed to be an introduction to many of the tricks for producing efficient enumeration algorithms. For those more experienced, it will hopefully help them understand the interrelationship and implications of a variety of techniques, many or most of which will be familiar. The author certainly found his understanding improved as a result of writing this paper. △ Less

Submitted 15 May, 2017; v1 submitted 31 October, 2016; originally announced October 2016.

arXiv:1610.00127 [pdf, ps, other]

Auditing Australian Senate Ballots

Authors: Berj Chilingirian, Zara Perumal, Ronald L. Rivest, Grahame Bowland, Andrew Conway, Philip B. Stark, Michelle Blom, Chris Culnane, Vanessa Teague

Abstract: We explain why the Australian Electoral Commission should perform an audit of the paper Senate ballots against the published preference data files. We suggest four different post-election audit methods appropriate for Australian Senate elections. We have developed prototype code for all of them and tested it on preference data from the 2016 election. We explain why the Australian Electoral Commission should perform an audit of the paper Senate ballots against the published preference data files. We suggest four different post-election audit methods appropriate for Australian Senate elections. We have developed prototype code for all of them and tested it on preference data from the 2016 election. △ Less

Submitted 1 October, 2016; originally announced October 2016.

arXiv:1405.6802 [pdf, other]

On the growth rate of 1324-avoiding permutations

Authors: Andrew R Conway, Anthony J Guttmann

Abstract: We give an improved algorithm for counting the number of $1324$-avoiding permutations, resulting in 5 further terms of the generating function. We analyse the known coefficients and find compelling evidence that unlike other classical length-4 pattern-avoiding permutations, the generating function in this case does not have an algebraic singularity. Rather, the number of 1324-avoiding permutations… ▽ More We give an improved algorithm for counting the number of $1324$-avoiding permutations, resulting in 5 further terms of the generating function. We analyse the known coefficients and find compelling evidence that unlike other classical length-4 pattern-avoiding permutations, the generating function in this case does not have an algebraic singularity. Rather, the number of 1324-avoiding permutations of length $n$ behaves as $$B\cdot μ^n \cdot μ_1^{n^σ} \cdot n^g.$$ We estimate $μ=11.60 \pm 0.01,$ $σ=1/2,$ $μ_1 = 0.0398 \pm 0.0010,$ $g = -1.1 \pm 0.2$ and $B =9.5 \pm 1.0.$ △ Less

Submitted 27 May, 2014; originally announced May 2014.

Comments: 20 pages, 10 figures

MSC Class: 05A05; 05A15; 05A16

Showing 1–20 of 20 results for author: Conway, A