-
Online sorting and online TSP: randomized, stochastic, and high-dimensional
Authors:
Mikkel Abrahamsen,
Ioana O. Bercea,
Lorenzo Beretta,
Jonas Klausen,
László Kozma
Abstract:
In the online sorting problem, $n$ items are revealed one by one and have to be placed (immediately and irrevocably) into empty cells of a size-$n$ array. The goal is to minimize the sum of absolute differences between items in consecutive cells. This natural problem was recently introduced by Aamand, Abrahamsen, Beretta, and Kleist (SODA 2023) as a tool in their study of online geometric packing…
▽ More
In the online sorting problem, $n$ items are revealed one by one and have to be placed (immediately and irrevocably) into empty cells of a size-$n$ array. The goal is to minimize the sum of absolute differences between items in consecutive cells. This natural problem was recently introduced by Aamand, Abrahamsen, Beretta, and Kleist (SODA 2023) as a tool in their study of online geometric packing problems. They showed that when the items are reals from the interval $[0,1]$ a competitive ratio of $O(\sqrt{n})$ is achievable, and no deterministic algorithm can improve this ratio asymptotically.
In this paper, we extend and generalize the study of online sorting in three directions:
- randomized: we settle the open question of Aamand et al. by showing that the $O(\sqrt{n})$ competitive ratio for the online sorting of reals cannot be improved even with the use of randomness;
- stochastic: we consider inputs consisting of $n$ samples drawn uniformly at random from an interval, and give an algorithm with an improved competitive ratio of $\widetilde{O}(n^{1/4})$. The result reveals connections between online sorting and the design of efficient hash tables;
- high-dimensional: we show that $\widetilde{O}(\sqrt{n})$-competitive online sorting is possible even for items from $\mathbb{R}^d$, for arbitrary fixed $d$, in an adversarial model. This can be viewed as an online variant of the classical TSP problem where tasks (cities to visit) are revealed one by one and the salesperson assigns each task (immediately and irrevocably) to its timeslot. Along the way, we also show a tight $O(\log{n})$-competitiveness result for uniform metrics, i.e., where items are of different types and the goal is to order them so as to minimize the number of switches between consecutive items of different types.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Heterogeneous Image-based Classification Using Distributional Data Analysis
Authors:
Alec Reinhardt,
Newsha Nikzad,
Raven J. Hollis,
Galia Jacobson,
Millicent A. Roach,
Mohamed Badawy,
Peter Chul Park,
Laura Beretta,
Prasun K Jalal,
David T. Fuentes,
Eugene J. Koay,
Suprateek Kundu
Abstract:
Diagnostic imaging has gained prominence as potential biomarkers for early detection and diagnosis in a diverse array of disorders including cancer. However, existing methods routinely face challenges arising from various factors such as image heterogeneity. We develop a novel imaging-based distributional data analysis (DDA) approach that incorporates the probability (quantile) distribution of the…
▽ More
Diagnostic imaging has gained prominence as potential biomarkers for early detection and diagnosis in a diverse array of disorders including cancer. However, existing methods routinely face challenges arising from various factors such as image heterogeneity. We develop a novel imaging-based distributional data analysis (DDA) approach that incorporates the probability (quantile) distribution of the pixel-level features as covariates. The proposed approach uses a smoothed quantile distribution (via a suitable basis representation) as functional predictors in a scalar-on-functional quantile regression model. Some distinctive features of the proposed approach include the ability to: (i) account for heterogeneity within the image; (ii) incorporate granular information spanning the entire distribution; and (iii) tackle variability in image sizes for unregistered images in cancer applications. Our primary goal is risk prediction in Hepatocellular carcinoma that is achieved via predicting the change in tumor grades at post-diagnostic visits using pre-diagnostic enhancement pattern map** (EPM) images of the liver. Along the way, the proposed DDA approach is also used for case versus control diagnosis and risk stratification objectives. Our analysis reveals that when coupled with global structural radiomics features derived from the corresponding T1-MRI scans, the proposed smoothed quantile distributions derived from EPM images showed considerable improvements in sensitivity and comparable specificity in contrast to classification based on routinely used summary measures that do not account for image heterogeneity. Given that there are limited predictive modeling approaches based on heterogeneous images in cancer, the proposed method is expected to provide considerable advantages in image-based early detection and risk prediction.
△ Less
Submitted 11 March, 2024;
originally announced March 2024.
-
Approximate Earth Mover's Distance in Truly-Subquadratic Time
Authors:
Lorenzo Beretta,
Aviad Rubinstein
Abstract:
We design an additive approximation scheme for estimating the cost of the min-weight bipartite matching problem: given a bipartite graph with non-negative edge costs and $\varepsilon > 0$, our algorithm estimates the cost of matching all but $O(\varepsilon)$-fraction of the vertices in truly subquadratic time $O(n^{2-δ(\varepsilon)})$.
Our algorithm has a natural interpretation for computing the…
▽ More
We design an additive approximation scheme for estimating the cost of the min-weight bipartite matching problem: given a bipartite graph with non-negative edge costs and $\varepsilon > 0$, our algorithm estimates the cost of matching all but $O(\varepsilon)$-fraction of the vertices in truly subquadratic time $O(n^{2-δ(\varepsilon)})$.
Our algorithm has a natural interpretation for computing the Earth Mover's Distance (EMD), up to a $\varepsilon$-additive approximation. Notably, we make no assumptions about the underlying metric (more generally, the costs do not have to satisfy triangle inequality). Note that compared to the size of the instance (an arbitrary $n \times n$ cost matrix), our algorithm runs in {\em sublinear} time.
Our algorithm can approximate a slightly more general problem: max-cardinality bipartite matching with a knapsack constraint, where the goal is to maximize the number of vertices that can be matched up to a total cost $B$.
△ Less
Submitted 10 November, 2023; v1 submitted 30 October, 2023;
originally announced October 2023.
-
Multi-Swap $k$-Means++
Authors:
Lorenzo Beretta,
Vincent Cohen-Addad,
Silvio Lattanzi,
Nikos Parotsidis
Abstract:
The $k$-means++ algorithm of Arthur and Vassilvitskii (SODA 2007) is often the practitioners' choice algorithm for optimizing the popular $k$-means clustering objective and is known to give an $O(\log k)$-approximation in expectation. To obtain higher quality solutions, Lattanzi and Sohler (ICML 2019) proposed augmenting $k$-means++ with $O(k \log \log k)$ local search steps obtained through the…
▽ More
The $k$-means++ algorithm of Arthur and Vassilvitskii (SODA 2007) is often the practitioners' choice algorithm for optimizing the popular $k$-means clustering objective and is known to give an $O(\log k)$-approximation in expectation. To obtain higher quality solutions, Lattanzi and Sohler (ICML 2019) proposed augmenting $k$-means++ with $O(k \log \log k)$ local search steps obtained through the $k$-means++ sampling distribution to yield a $c$-approximation to the $k$-means clustering problem, where $c$ is a large absolute constant. Here we generalize and extend their local search algorithm by considering larger and more sophisticated local search neighborhoods hence allowing to swap multiple centers at the same time. Our algorithm achieves a $9 + \varepsilon$ approximation ratio, which is the best possible for local search. Importantly we show that our approach yields substantial practical improvements, we show significant quality improvements over the approach of Lattanzi and Sohler (ICML 2019) on several datasets.
△ Less
Submitted 28 September, 2023;
originally announced September 2023.
-
Locally Uniform Hashing
Authors:
Ioana O. Bercea,
Lorenzo Beretta,
Jonas Klausen,
Jakob Bæk Tejs Houen,
Mikkel Thorup
Abstract:
Hashing is a common technique used in data processing, with a strong impact on the time and resources spent on computation. Hashing also affects the applicability of theoretical results that often assume access to (unrealistic) uniform/fully-random hash functions. In this paper, we are concerned with designing hash functions that are practical and come with strong theoretical guarantees on their p…
▽ More
Hashing is a common technique used in data processing, with a strong impact on the time and resources spent on computation. Hashing also affects the applicability of theoretical results that often assume access to (unrealistic) uniform/fully-random hash functions. In this paper, we are concerned with designing hash functions that are practical and come with strong theoretical guarantees on their performance.
To this end, we present tornado tabulation hashing, which is simple, fast, and exhibits a certain full, local randomness property that provably makes diverse algorithms perform almost as if (abstract) fully-random hashing was used. For example, this includes classic linear probing, the widely used HyperLogLog algorithm of Flajolet, Fusy, Gandouet, Meunier [AOFA 97] for counting distinct elements, and the one-permutation hashing of Li, Owen, and Zhang [NIPS 12] for large-scale machine learning. We also provide a very efficient solution for the classical problem of obtaining fully-random hashing on a fixed (but unknown to the hash function) set of $n$ keys using $O(n)$ space. As a consequence, we get more efficient implementations of the splitting trick of Dietzfelbinger and Rink [ICALP'09] and the succinct space uniform hashing of Pagh and Pagh [SICOMP'08].
Tornado tabulation hashing is based on a simple method to systematically break dependencies in tabulation-based hashing techniques.
△ Less
Submitted 28 September, 2023; v1 submitted 27 August, 2023;
originally announced August 2023.
-
Online Sorting and Translational Packing of Convex Polygons
Authors:
Anders Aamand,
Mikkel Abrahamsen,
Lorenzo Beretta,
Linda Kleist
Abstract:
We investigate several online packing problems in which convex polygons arrive one by one and have to be placed irrevocably into a container, while the aim is to minimize the used space. Among other variants, we consider strip packing and bin packing, where the container is the infinite horizontal strip $[0,\infty)\times [0,1]$ or a collection of $1 \times 1$ bins, respectively.
We draw interest…
▽ More
We investigate several online packing problems in which convex polygons arrive one by one and have to be placed irrevocably into a container, while the aim is to minimize the used space. Among other variants, we consider strip packing and bin packing, where the container is the infinite horizontal strip $[0,\infty)\times [0,1]$ or a collection of $1 \times 1$ bins, respectively.
We draw interesting connections to the following online sorting problem OnlineSorting$[γ,n]$: We receive a stream of real numbers $s_1,\ldots,s_n$, $s_i\in[0,1]$, one by one. Each real must be placed in an array $A$ with $γn$ initially empty cells without knowing the subsequent reals. The goal is to minimize the sum of differences of consecutive reals in $A$. The offline optimum is to place the reals in sorted order so the cost is at most $1$. We show that for any $Δ$-competitive online algorithm of OnlineSorting$[γ,n]$, it holds that $γΔ\inΩ(\log n/\log \log n)$.
We use this lower bound to prove the non-existence of competitive algorithms for various online translational packing problems of convex polygons, among them strip packing, bin packing and perimeter packing. This also implies that there exists no online algorithm that can pack all streams of pieces of diameter and total area at most $δ$ into the unit square. These results are in contrast to the case when the pieces are restricted to rectangles, for which competitive algorithms are known. Likewise, the offline versions of packing convex polygons have constant factor approximation algorithms.
As a complement, we also include algorithms for both online sorting and translation-only online strip packing with non-trivial competitive ratios. Our algorithm for strip packing relies on a new technique for recursively subdividing the strip into parallelograms of varying height, thickness and slope.
△ Less
Submitted 8 April, 2024; v1 submitted 7 December, 2021;
originally announced December 2021.
-
An Optimal Algorithm for Finding Champions in Tournament Graphs
Authors:
Lorenzo Beretta,
Franco Maria Nardini,
Roberto Trani,
Rossano Venturini
Abstract:
A tournament graph is a complete directed graph, which can be used to model a round-robin tournament between $n$ players. In this paper, we address the problem of finding a champion of the tournament, also known as Copeland winner, which is a player that wins the highest number of matches. In detail, we aim to investigate algorithms that find the champion by playing a low number of matches. Solvin…
▽ More
A tournament graph is a complete directed graph, which can be used to model a round-robin tournament between $n$ players. In this paper, we address the problem of finding a champion of the tournament, also known as Copeland winner, which is a player that wins the highest number of matches. In detail, we aim to investigate algorithms that find the champion by playing a low number of matches. Solving this problem allows us to speed up several Information Retrieval and Recommender System applications, including question answering, conversational search, etc. Indeed, these applications often search for the champion inducing a round-robin tournament among the players by employing a machine learning model to estimate who wins each pairwise comparison. Our contribution, thus, allows finding the champion by performing a low number of model inferences. We prove that any deterministic or randomized algorithm finding a champion with constant success probability requires $Ω(\ell n)$ comparisons, where $\ell$ is the number of matches lost by the champion. We then present an asymptotically-optimal deterministic algorithm matching this lower bound without knowing $\ell$, and we extend our analysis to three variants of the problem. Lastly, we conduct a comprehensive experimental assessment of the proposed algorithms on a question answering task on public data. Results show that our proposed algorithms speed up the retrieval of the champion up to $13\times$ with respect to the state-of-the-art algorithm that perform the full tournament.
△ Less
Submitted 18 April, 2023; v1 submitted 26 November, 2021;
originally announced November 2021.
-
Better Sum Estimation via Weighted Sampling
Authors:
Lorenzo Beretta,
Jakub Tětek
Abstract:
Given a large set $U$ where each item $a\in U$ has weight $w(a)$, we want to estimate the total weight $W=\sum_{a\in U} w(a)$ to within factor of $1\pm\varepsilon$ with some constant probability $>1/2$. Since $n=|U|$ is large, we want to do this without looking at the entire set $U$. In the traditional setting in which we are allowed to sample elements from $U$ uniformly, sampling $Ω(n)$ items is…
▽ More
Given a large set $U$ where each item $a\in U$ has weight $w(a)$, we want to estimate the total weight $W=\sum_{a\in U} w(a)$ to within factor of $1\pm\varepsilon$ with some constant probability $>1/2$. Since $n=|U|$ is large, we want to do this without looking at the entire set $U$. In the traditional setting in which we are allowed to sample elements from $U$ uniformly, sampling $Ω(n)$ items is necessary to provide any non-trivial guarantee on the estimate. Therefore, we investigate this problem in different settings: in the \emph{proportional} setting we can sample items with probabilities proportional to their weights, and in the \emph{hybrid} setting we can sample both proportionally and uniformly. These settings have applications, for example, in sublinear-time algorithms and distribution testing.
Sum estimation in the proportional and hybrid setting has been considered before by Motwani, Panigrahy, and Xu [ICALP, 2007]. In their paper, they give both upper and lower bounds in terms of $n$. Their bounds are near-matching in terms of $n$, but not in terms of $\varepsilon$. In this paper, we improve both their upper and lower bounds. Our bounds are matching up to constant factors in both settings, in terms of both $n$ and $\varepsilon$. No lower bounds with dependency on $\varepsilon$ were known previously. In the proportional setting, we improve their $\tilde{O}(\sqrt{n}/\varepsilon^{7/2})$ algorithm to $O(\sqrt{n}/\varepsilon)$. In the hybrid setting, we improve $\tilde{O}(\sqrt[3]{n}/ \varepsilon^{9/2})$ to $O(\sqrt[3]{n}/\varepsilon^{4/3})$. Our algorithms are also significantly simpler and do not have large constant factors. We also investigate the previously unexplored setting where $n$ is unknown to the algorithm. Finally, we show how our techniques apply to the problem of edge counting in graphs.
△ Less
Submitted 28 October, 2021;
originally announced October 2021.
-
Online Packing to Minimize Area or Perimeter
Authors:
Mikkel Abrahamsen,
Lorenzo Beretta
Abstract:
We consider online packing problems where we get a stream of axis-parallel rectangles. The rectangles have to be placed in the plane without overlap**, and each rectangle must be placed without knowing the subsequent rectangles. The goal is to minimize the perimeter or the area of the axis-parallel bounding box of the rectangles. We either allow rotations by 90 degrees or translations only.
Fo…
▽ More
We consider online packing problems where we get a stream of axis-parallel rectangles. The rectangles have to be placed in the plane without overlap**, and each rectangle must be placed without knowing the subsequent rectangles. The goal is to minimize the perimeter or the area of the axis-parallel bounding box of the rectangles. We either allow rotations by 90 degrees or translations only.
For the perimeter version we give algorithms with an absolute competitive ratio slightly less than 4 when only translations are allowed and when rotations are also allowed.
We then turn our attention to minimizing the area and show that the asymptotic competitive ratio of any algorithm is at least $Ω(\sqrt{n})$, where $n$ is the number of rectangles in the stream, and this holds with and without rotations. We then present algorithms that match this bound in both cases. We also show that the competitive ratio cannot be bounded as a function of OPT. We then consider two special cases.
The first is when all the given rectangles have aspect ratios bounded by some constant. The particular variant where all the rectangles are squares and we want to minimize the area of the bounding square has been studied before and an algorithm with a competitive ratio of 8 has been given [Fekete and Hoffmann, Algorithmica, 2017]. We improve the analysis of the algorithm and show that the ratio is at most 6, which is tight.
The second special case is when all edges have length at least 1. Here, the $Ω(\sqrt n)$ lower bound still holds, and we turn our attention to lower bounds depending on OPT. We show that any algorithm has an asymptotic competitive ratio of at least $Ω(\sqrt{OPT})$ for the translational case and $Ω(\sqrt[4]{OPT})$ when rotations are allowed. For both versions, we give algorithms that match the respective lower bounds.
△ Less
Submitted 25 January, 2021; v1 submitted 22 January, 2021;
originally announced January 2021.