-
Massively-Parallel Heat Map Sorting and Applications To Explainable Clustering
Authors:
Sepideh Aghamolaei,
Mohammad Ghodsi
Abstract:
Given a set of points labeled with $k$ labels, we introduce the heat map sorting problem as reordering and merging the points and dimensions while preserving the clusters (labels). A cluster is preserved if it remains connected, i.e., if it is not split into several clusters and no two clusters are merged.
We prove the problem is NP-hard and we give a fixed-parameter algorithm with a constant nu…
▽ More
Given a set of points labeled with $k$ labels, we introduce the heat map sorting problem as reordering and merging the points and dimensions while preserving the clusters (labels). A cluster is preserved if it remains connected, i.e., if it is not split into several clusters and no two clusters are merged.
We prove the problem is NP-hard and we give a fixed-parameter algorithm with a constant number of rounds in the massively parallel computation model, where each machine has a sublinear memory and the total memory of the machines is linear. We give an approximation algorithm for a NP-hard special case of the problem. We empirically compare our algorithm with k-means and density-based clustering (DBSCAN) using a dimensionality reduction via locality-sensitive hashing on several directed and undirected graphs of email and computer networks.
△ Less
Submitted 14 September, 2023;
originally announced September 2023.
-
A 2-Approximation Algorithm for Data-Distributed Metric k-Center
Authors:
Sepideh Aghamolaei,
Mohammad Ghodsi
Abstract:
In a metric space, a set of point sets of roughly the same size and an integer $k\geq 1$ are given as the input and the goal of data-distributed $k$-center is to find a subset of size $k$ of the input points as the set of centers to minimize the maximum distance from the input points to their closest centers. Metric $k$-center is known to be NP-hard which carries to the data-distributed setting.…
▽ More
In a metric space, a set of point sets of roughly the same size and an integer $k\geq 1$ are given as the input and the goal of data-distributed $k$-center is to find a subset of size $k$ of the input points as the set of centers to minimize the maximum distance from the input points to their closest centers. Metric $k$-center is known to be NP-hard which carries to the data-distributed setting.
We give a $2$-approximation algorithm of $k$-center for sublinear $k$ in the data-distributed setting, which is tight. This algorithm works in several models, including the massively parallel computation model (MPC).
△ Less
Submitted 8 September, 2023;
originally announced September 2023.
-
A Massively Parallel Dynamic Programming for Approximate Rectangle Escape Problem
Authors:
Sepideh Aghamolaei,
Mohammad Ghodsi
Abstract:
Sublinear time complexity is required by the massively parallel computation (MPC) model. Breaking dynamic programs into a set of sparse dynamic programs that can be divided, solved, and merged in sublinear time.
The rectangle escape problem (REP) is defined as follows: For $n$ axis-aligned rectangles inside an axis-aligned bounding box $B$, extend each rectangle in only one of the four direction…
▽ More
Sublinear time complexity is required by the massively parallel computation (MPC) model. Breaking dynamic programs into a set of sparse dynamic programs that can be divided, solved, and merged in sublinear time.
The rectangle escape problem (REP) is defined as follows: For $n$ axis-aligned rectangles inside an axis-aligned bounding box $B$, extend each rectangle in only one of the four directions: up, down, left, or right until it reaches $B$ and the density $k$ is minimized, where $k$ is the maximum number of extensions of rectangles to the boundary that pass through a point inside bounding box $B$. REP is NP-hard for $k>1$. If the rectangles are points of a grid (or unit squares of a grid), the problem is called the square escape problem (SEP) and it is still NP-hard.
We give a $2$-approximation algorithm for SEP with $k\geq2$ with time complexity $O(n^{3/2}k^2)$. This improves the time complexity of existing algorithms which are at least quadratic. Also, the approximation ratio of our algorithm for $k\geq 3$ is $3/2$ which is tight. We also give a $8$-approximation algorithm for REP with time complexity $O(n\log n+nk)$ and give a MPC version of this algorithm for $k=O(1)$ which is the first parallel algorithm for this problem.
△ Less
Submitted 1 September, 2023;
originally announced September 2023.
-
An Efficient Construction of Yao-Graph in Data-Distributed Settings
Authors:
Sepideh Aghamolaei,
Mohammad Ghodsi
Abstract:
A sparse graph that preserves an approximation of the shortest paths between all pairs of points in a plane is called a geometric spanner. Using range trees of sublinear size, we design an algorithm in massively parallel computation (MPC) model for constructing a geometric spanner known as Yao-graph. This improves the total time and the total memory of existing algorithms for geometric spanners fr…
▽ More
A sparse graph that preserves an approximation of the shortest paths between all pairs of points in a plane is called a geometric spanner. Using range trees of sublinear size, we design an algorithm in massively parallel computation (MPC) model for constructing a geometric spanner known as Yao-graph. This improves the total time and the total memory of existing algorithms for geometric spanners from subquadratic to near-linear.
△ Less
Submitted 29 August, 2023;
originally announced August 2023.
-
Clustering Geometrically-Modeled Points in the Aggregated Uncertainty Model
Authors:
Vahideh Keikha,
Sepideh Aghamolaei,
Ali Mohades,
Mohammad Ghodsi
Abstract:
The $k$-center problem is to choose a subset of size $k$ from a set of $n$ points such that the maximum distance from each point to its nearest center is minimized. Let $Q=\{Q_1,\ldots,Q_n\}$ be a set of polygons or segments in the region-based uncertainty model, in which each $Q_i$ is an uncertain point, where the exact locations of the points in $Q_i$ are unknown. The geometric objects segments…
▽ More
The $k$-center problem is to choose a subset of size $k$ from a set of $n$ points such that the maximum distance from each point to its nearest center is minimized. Let $Q=\{Q_1,\ldots,Q_n\}$ be a set of polygons or segments in the region-based uncertainty model, in which each $Q_i$ is an uncertain point, where the exact locations of the points in $Q_i$ are unknown. The geometric objects segments and polygons can be models of a point set. We define the uncertain version of the $k$-center problem as a generalization in which the objective is to find $k$ points from $Q$ to cover the remaining regions of $Q$ with minimum or maximum radius of the cluster to cover at least one or all exact instances of each $Q_i$, respectively. We modify the region-based model to allow multiple points to be chosen from a region and call the resulting model the aggregated uncertainty model. All these problems contain the point version as a special case, so they are all NP-hard with a lower bound 1.822. We give approximation algorithms for uncertain $k$-center of a set of segments and polygons. We also have implemented some of our algorithms on a data-set to show our theoretical performance guarantees can be achieved in practice.
△ Less
Submitted 26 January, 2022; v1 submitted 27 November, 2021;
originally announced November 2021.
-
Computing The Packedness of Curves
Authors:
Sepideh Aghamolaei,
Vahideh Keikha,
Mohammad Ghodsi,
Ali Mohades
Abstract:
A polygonal curve $P$ with $n$ vertices is $c$-packed, if the sum of the lengths of the parts of the edges of the curve that are inside any disk of radius $r$ is at most $cr$, for any $r>0$. Similarly, the concept of $c$-packedness can be defined for any scaling of a given shape.
Assuming $L$ is the diameter of $P$ and $δ$ is the minimum distance between points on disjoint edges of $P$, we show…
▽ More
A polygonal curve $P$ with $n$ vertices is $c$-packed, if the sum of the lengths of the parts of the edges of the curve that are inside any disk of radius $r$ is at most $cr$, for any $r>0$. Similarly, the concept of $c$-packedness can be defined for any scaling of a given shape.
Assuming $L$ is the diameter of $P$ and $δ$ is the minimum distance between points on disjoint edges of $P$, we show the approximation factor of the existing $O(\frac{\log (L/δ)}εn^3)$ time algorithm is $1+ε$-approximation algorithm. The massively parallel versions of these algorithms run in $O(\log (L/δ))$ rounds. We improve the existing $O((\frac{n}{ε^3})^{\frac 4 3}\polylog \frac n ε)$ time $(6+ε)$-approximation algorithm by providing a $(4+ε)$-approximation $O(n(\log^2 n)(\log^2 \frac{1}ε)+\frac{n}ε)$ time algorithm, and the existing $O(n^2)$ time $2$-approximation algorithm improving the existing $O(n^2\log n)$ time $2$-approximation algorithm.
Our exact $c$-packedness algorithm takes $O(n^5)$ time, which is the first exact algorithm for disks. We show using $α$-fat shapes instead of disks adds a factor $α^2$ to the approximation.
We also give a data-structure for computing the curve-length inside query disks. It has $O(n^6\log n)$ construction time, uses $O(n^6)$ space, and has query time $O(\log n+k)$, where $k$ is the number of intersected segments with the query shape. We also give a massively parallel algorithm for relative $c$-packedness with $O(1)$ rounds.
△ Less
Submitted 21 December, 2020; v1 submitted 8 December, 2020;
originally announced December 2020.
-
Catching a Polygonal Fish with a Minimum Net
Authors:
Sepideh Aghamolaei
Abstract:
Given a polygon $P$ in the plane that can be translated, rotated and enlarged arbitrarily inside a unit square, the goal is to find a set of lines such that at least one of them always hits $P$ and the number of lines is minimized. We prove the solution is always a regular grid or a set of equidistant parallel lines, whose distance depends on $P$.
Given a polygon $P$ in the plane that can be translated, rotated and enlarged arbitrarily inside a unit square, the goal is to find a set of lines such that at least one of them always hits $P$ and the number of lines is minimized. We prove the solution is always a regular grid or a set of equidistant parallel lines, whose distance depends on $P$.
△ Less
Submitted 12 January, 2021; v1 submitted 10 August, 2020;
originally announced August 2020.
-
A Data-Structure for Approximate Longest Common Subsequence of A Set of Strings
Authors:
Sepideh Aghamolaei
Abstract:
Given a set of $k$ strings $I$, their longest common subsequence (LCS) is the string with the maximum length that is a subset of all the strings in $I$. A data-structure for this problem preprocesses $I$ into a data-structure such that the LCS of a set of query strings $Q$ with the strings of $I$ can be computed faster. Since the problem is NP-hard for arbitrary $k$, we allow an error that allows…
▽ More
Given a set of $k$ strings $I$, their longest common subsequence (LCS) is the string with the maximum length that is a subset of all the strings in $I$. A data-structure for this problem preprocesses $I$ into a data-structure such that the LCS of a set of query strings $Q$ with the strings of $I$ can be computed faster. Since the problem is NP-hard for arbitrary $k$, we allow an error that allows some characters to be replaced by other characters. We define the approximation version of the problem with an extra input $m$, which is the length of the regular expression (regex) that describes the input, and the approximation factor is the logarithm of the number of possibilities in the regex returned by the algorithm, divided by the logarithm regex with the minimum number of possibilities. Then, we use a tree data-structure to achieve sublinear-time LCS queries. We also explain how the idea can be extended to the longest increasing subsequence (LIS) problem.
△ Less
Submitted 12 January, 2021; v1 submitted 4 August, 2020;
originally announced August 2020.
-
Symmetries: From Proofs To Algorithms And Back
Authors:
Sepideh Aghamolaei
Abstract:
We call an objective function or algorithm symmetric with respect to an input if after swap** two parts of the input in any algorithm, the solution of the algorithm and the output remain the same. More formally, for a permutation $π$ of an indexed input, and another permutation $π'$ of the same input, such that swap** two items converts $π$ to $π'$, $f(π)=f(π')$, where $f$ is the objective fun…
▽ More
We call an objective function or algorithm symmetric with respect to an input if after swap** two parts of the input in any algorithm, the solution of the algorithm and the output remain the same. More formally, for a permutation $π$ of an indexed input, and another permutation $π'$ of the same input, such that swap** two items converts $π$ to $π'$, $f(π)=f(π')$, where $f$ is the objective function.
After reviewing samples of the algorithms that exploit symmetry, we give several new ones, for finding lower-bounds, beating adversaries in online algorithms, designing parallel algorithms and data summarization. We show how to use the symmetry between the sampled points to get a lower/upper bound on the solution. This mostly depends on the equivalence class of the parts of the input that when swapped, do not change the solution or its cost.
△ Less
Submitted 13 January, 2021; v1 submitted 27 July, 2020;
originally announced July 2020.
-
Point-Location in The Arrangement of Curves
Authors:
Sepideh Aghamolaei,
Mohammad Ghodsi
Abstract:
An arrangement of $n$ curves in the plane is given. The query is a point $q$ and the goal is to find the face of the arrangement that contains $q$. A data-structure for point-location, preprocesses the curves into a data structure of polynomial size in $n$, such that the queries can be answered in time polylogarithmic in $n$.
We design a data structure for solving the point location problem quer…
▽ More
An arrangement of $n$ curves in the plane is given. The query is a point $q$ and the goal is to find the face of the arrangement that contains $q$. A data-structure for point-location, preprocesses the curves into a data structure of polynomial size in $n$, such that the queries can be answered in time polylogarithmic in $n$.
We design a data structure for solving the point location problem queries in $O(\log C(n)+\log S(n))$ time using $O(T(n)+S(n)\log(S(n)))$ preprocessing time, if a polygonal subdivision of total size $S(n)$, with cell complexity at most $C(n)$ can be computed in time $T(n)$, such that the order of the parts of the curves inside each cell has a monotone order with respect to at least one segment of the boundary of the cell. We call such a partitioning a curve-monotone polygonal subdivision.
△ Less
Submitted 4 December, 2020; v1 submitted 22 July, 2020;
originally announced July 2020.
-
Approximating The p-Mean Curve of Large Data-Sets
Authors:
Sepideh Aghamolaei,
Mohammad Ghodsi
Abstract:
A set of piecewise linear functions, called polylines, $P_1,\ldots,P_L$ each with at most $n$ vertices can be simplified into a polyline $M$ with $k$ vertices, such that the Fréchet distances $ε_1,\ldots,ε_L$ to each of these polylines are minimized under the $L_p$ distance. We call $M$ for $L_p$ with $p\geq 1$ a $p$-mean curve ($p$-MC).
We discuss $p\geq 1$, for which $L_p$ distance satisfies t…
▽ More
A set of piecewise linear functions, called polylines, $P_1,\ldots,P_L$ each with at most $n$ vertices can be simplified into a polyline $M$ with $k$ vertices, such that the Fréchet distances $ε_1,\ldots,ε_L$ to each of these polylines are minimized under the $L_p$ distance. We call $M$ for $L_p$ with $p\geq 1$ a $p$-mean curve ($p$-MC).
We discuss $p\geq 1$, for which $L_p$ distance satisfies the triangle inequality and $p$-mean has not been discussed before for most values $p$. Computing the $p$-mean polyline is NP-hard for $L=Ω(1)$ and some values of $p$, so we discuss approximation algorithms.
We give a $O(n^2\log k)$ time exact algorithm for $L=2$ and $p\geq 1$. Also, we reduce the Fréchet distance to the discrete Fréchet distance which adds a factor $2$ to both $k$ and $ε$. Then we use our exact algorithm to find a $3$-approximation for $L>2$ in $\operatorname{poly}(n,L)$ time. Our method is based on a generalization of the free-space diagram (FSD) for Fréchet distance and composable core-sets for approximate summaries.
△ Less
Submitted 27 August, 2021; v1 submitted 13 May, 2020;
originally announced May 2020.
-
A Composable Coreset for k-Center in Doubling Metrics
Authors:
Sepideh Aghamolaei,
Mohammad Ghodsi
Abstract:
A set of points $P$ in a metric space and a constant integer $k$ are given. The $k$-center problem finds $k$ points as centers among $P$, such that the maximum distance of any point of $P$ to their closest centers $(r)$ is minimized.
Doubling metrics are metric spaces in which for any $r$, a ball of radius $r$ can be covered using a constant number of balls of radius $r/2$. Fixed dimensional Euc…
▽ More
A set of points $P$ in a metric space and a constant integer $k$ are given. The $k$-center problem finds $k$ points as centers among $P$, such that the maximum distance of any point of $P$ to their closest centers $(r)$ is minimized.
Doubling metrics are metric spaces in which for any $r$, a ball of radius $r$ can be covered using a constant number of balls of radius $r/2$. Fixed dimensional Euclidean spaces are doubling metrics. The lower bound on the approximation factor of $k$-center is $1.822$ in Euclidean spaces, however, $(1+ε)$-approximation algorithms with exponential dependency on $\frac{1}ε$ and $k$ exist.
For a given set of sets $P_1,\ldots,P_L$, a composable coreset independently computes subsets $C_1\subset P_1, \ldots, C_L\subset P_L$, such that $\cup_{i=1}^L C_i$ contains an approximation of a measure of the set $\cup_{i=1}^L P_i$.
We introduce a $(1+ε)$-approximation composable coreset for $k$-center, which in doubling metrics has size sublinear in $|P|$. This results in a $(2+ε)$-approximation algorithm for $k$-center in MapReduce with a constant number of rounds in doubling metrics for any $ε>0$ and sublinear communications, which is based on parametric pruning.
We prove the exponential nature of the trade-off between the number of centers $(k)$ and the radius $(r)$, and give a composable coreset for a related problem called dual clustering. Also, we give a new version of the parametric pruning algorithm with $O(\frac{nk}ε)$ running time, $O(n)$ space and $2+ε$ approximation factor for metric $k$-center.
△ Less
Submitted 24 April, 2019; v1 submitted 5 February, 2019;
originally announced February 2019.