-
Fast Approximations and Coresets for (k, l)-Median under Dynamic Time War**
Authors:
Jacobus Conradi,
Benedikt Kolbe,
Ioannis Psarros,
Dennis Rohde
Abstract:
We present algorithms for the computation of $\varepsilon$-coresets for $k$-median clustering of point sequences in $\mathbb{R}^d$ under the $p$-dynamic time war** (DTW) distance. Coresets under DTW have not been investigated before, and the analysis is not directly accessible to existing methods as DTW is not a metric. The three main ingredients that allow our construction of coresets are the a…
▽ More
We present algorithms for the computation of $\varepsilon$-coresets for $k$-median clustering of point sequences in $\mathbb{R}^d$ under the $p$-dynamic time war** (DTW) distance. Coresets under DTW have not been investigated before, and the analysis is not directly accessible to existing methods as DTW is not a metric. The three main ingredients that allow our construction of coresets are the adaptation of the $\varepsilon$-coreset framework of sensitivity sampling, bounds on the VC dimension of approximations to the range spaces of balls under DTW, and new approximation algorithms for the $k$-median problem under DTW. We achieve our results by investigating approximations of DTW that provide a trade-off between the provided accuracy and amenability to known techniques. In particular, we observe that given $n$ curves under DTW, one can directly construct a metric that approximates DTW on this set, permitting the use of the wealth of results on metric spaces for clustering purposes. The resulting approximations are the first with polynomial running time and achieve a very similar approximation factor as state-of-the-art techniques. We apply our results to produce a practical algorithm approximating $(k,\ell)$-median clustering under DTW.
△ Less
Submitted 7 March, 2024; v1 submitted 15 December, 2023;
originally announced December 2023.
-
Data-driven soiling detection in PV modules
Authors:
Alexandros Kalimeris,
Ioannis Psarros,
Giorgos Giannopoulos,
Manolis Terrovitis,
George Papastefanatos,
Gregory Kotsis
Abstract:
Soiling is the accumulation of dirt in solar panels which leads to a decreasing trend in solar energy yield and may be the cause of vast revenue losses. The effect of soiling can be reduced by washing the panels, which is, however, a procedure of non-negligible cost. Moreover, soiling monitoring systems are often unreliable or very costly. We study the problem of estimating the soiling ratio in ph…
▽ More
Soiling is the accumulation of dirt in solar panels which leads to a decreasing trend in solar energy yield and may be the cause of vast revenue losses. The effect of soiling can be reduced by washing the panels, which is, however, a procedure of non-negligible cost. Moreover, soiling monitoring systems are often unreliable or very costly. We study the problem of estimating the soiling ratio in photo-voltaic (PV) modules, i.e., the ratio of the real power output to the power output that would be produced if solar panels were clean. A key advantage of our algorithms is that they estimate soiling, without needing to train on labelled data, i.e., periods of explicitly monitoring the soiling in each park, and without relying on generic analytical formulas which do not take into account the peculiarities of each installation. We consider as input a time series comprising a minimum set of measurements, that are available to most PV park operators. Our experimental evaluation shows that we significantly outperform current state-of-the-art methods for estimating soiling ratio.
△ Less
Submitted 30 January, 2023;
originally announced January 2023.
-
Random projections for curves in high dimensions
Authors:
Ioannis Psarros,
Dennis Rohde
Abstract:
Modern time series analysis requires the ability to handle datasets that are inherently high-dimensional; examples include applications in climatology, where measurements from numerous sensors must be taken into account, or inventory tracking of large shops, where the dimension is defined by the number of tracked items. The standard way to mitigate computational issues arising from the high dimens…
▽ More
Modern time series analysis requires the ability to handle datasets that are inherently high-dimensional; examples include applications in climatology, where measurements from numerous sensors must be taken into account, or inventory tracking of large shops, where the dimension is defined by the number of tracked items. The standard way to mitigate computational issues arising from the high dimensionality of the data is by applying some dimension reduction technique that preserves the structural properties of the ambient space. The dissimilarity between two time series is often measured by ``discrete'' notions of distance, e.g. the dynamic time war** or the discrete Fréchet distance. Since all these distance functions are computed directly on the points of a time series, they are sensitive to different sampling rates or gaps. The continuous Fréchet distance offers a popular alternative which aims to alleviate this by taking into account all points on the polygonal curve obtained by linearly interpolating between any two consecutive points in a sequence.
We study the ability of random projections à la Johnson and Lindenstrauss to preserve the continuous Fréchet distance of polygonal curves by effectively reducing the dimension. In particular, we show that one can reduce the dimension to $O(ε^{-2} \log N)$, where $N$ is the total number of input points while preserving the continuous Fréchet distance between any two determined polygonal curves within a factor of $1\pm ε$. We conclude with applications on clustering.
△ Less
Submitted 14 February, 2023; v1 submitted 15 July, 2022;
originally announced July 2022.
-
Approximating Length-Restricted Means under Dynamic Time War**
Authors:
Maike Buchin,
Anne Driemel,
Koen van Greevenbroek,
Ioannis Psarros,
Dennis Rohde
Abstract:
We study variants of the mean problem under the $p$-Dynamic Time War** ($p$-DTW) distance, a popular and robust distance measure for sequential data. In our setting we are given a set of finite point sequences over an arbitrary metric space and we want to compute a mean point sequence of given length that minimizes the sum of $p$-DTW distances, each raised to the $q$\textsuperscript{th} power, b…
▽ More
We study variants of the mean problem under the $p$-Dynamic Time War** ($p$-DTW) distance, a popular and robust distance measure for sequential data. In our setting we are given a set of finite point sequences over an arbitrary metric space and we want to compute a mean point sequence of given length that minimizes the sum of $p$-DTW distances, each raised to the $q$\textsuperscript{th} power, between the input sequences and the mean sequence. In general, the problem is $\mathrm{NP}$-hard and known not to be fixed-parameter tractable in the number of sequences. On the positive side, we show that restricting the length of the mean sequence significantly reduces the hardness of the problem. We give an exact algorithm running in polynomial time for constant-length means. We explore various approximation algorithms that provide a trade-off between the approximation factor and the running time. Our approximation algorithms have a running time with only linear dependency on the number of input sequences. In addition, we use our mean algorithms to obtain clustering algorithms with theoretical guarantees.
△ Less
Submitted 2 May, 2022; v1 submitted 1 December, 2021;
originally announced December 2021.
-
Tight Bounds for Approximate Near Neighbor Searching for Time Series under the Fréchet Distance
Authors:
Karl Bringmann,
Anne Driemel,
André Nusser,
Ioannis Psarros
Abstract:
We study the $c$-approximate near neighbor problem under the continuous Fréchet distance: Given a set of $n$ polygonal curves with $m$ vertices, a radius $δ> 0$, and a parameter $k \leq m$, we want to preprocess the curves into a data structure that, given a query curve $q$ with $k$ vertices, either returns an input curve with Fréchet distance at most $c\cdot δ$ to $q$, or returns that there exist…
▽ More
We study the $c$-approximate near neighbor problem under the continuous Fréchet distance: Given a set of $n$ polygonal curves with $m$ vertices, a radius $δ> 0$, and a parameter $k \leq m$, we want to preprocess the curves into a data structure that, given a query curve $q$ with $k$ vertices, either returns an input curve with Fréchet distance at most $c\cdot δ$ to $q$, or returns that there exists no input curve with Fréchet distance at most $δ$ to $q$. We focus on the case where the input and the queries are one-dimensional polygonal curves -- also called time series -- and we give a comprehensive analysis for this case. We obtain new upper bounds that provide different tradeoffs between approximation factor, preprocessing time, and query time.
Our data structures improve upon the state of the art in several ways. We show that for any $0 < \varepsilon \leq 1$ an approximation factor of $(1+\varepsilon)$ can be achieved within the same asymptotic time bounds as the previously best result for $(2+\varepsilon)$. Moreover, we show that an approximation factor of $(2+\varepsilon)$ can be obtained by using preprocessing time and space $O(nm)$, which is linear in the input size, and query time in $O(\frac{1}{\varepsilon})^{k+2}$, where the previously best result used preprocessing time in $n \cdot O(\frac{m}{\varepsilon k})^k$ and query time in $O(1)^k$. We complement our upper bounds with matching conditional lower bounds based on the Orthogonal Vectors Hypothesis. Interestingly, some of our lower bounds already hold for any super-constant value of $k$. This is achieved by proving hardness of a one-sided sparse version of the Orthogonal Vectors problem as an intermediate problem, which we believe to be of independent interest.
△ Less
Submitted 3 November, 2021; v1 submitted 16 July, 2021;
originally announced July 2021.
-
$(2+ε)$-ANN for time series under the Fréchet distance
Authors:
Anne Driemel,
Ioannis Psarros
Abstract:
We study approximate-near-neighbor data structures for time series under the continuous Fréchet distance. For an attainable approximation factor $c>1$ and a query radius $r$, an approximate-near-neighbor data structure can be used to preprocess $n$ curves in $\mathbb{R}$ (aka time series), each of complexity $m$, to answer queries with a curve of complexity $k$ by either returning a curve that lie…
▽ More
We study approximate-near-neighbor data structures for time series under the continuous Fréchet distance. For an attainable approximation factor $c>1$ and a query radius $r$, an approximate-near-neighbor data structure can be used to preprocess $n$ curves in $\mathbb{R}$ (aka time series), each of complexity $m$, to answer queries with a curve of complexity $k$ by either returning a curve that lies within Fréchet distance $cr$, or answering that there exists no curve in the input within distance $r$. In both cases, the answer is correct. Our first data structure achieves a $(5+ε)$ approximation factor, uses space in $n\cdot \mathcal{O}\left({ε^{-1}}\right)^{k} + \mathcal{O}(nm)$ and has query time in $\mathcal{O}\left(k\right)$. Our second data structure achieves a $(2+ε)$ approximation factor, uses space in $n\cdot \mathcal{O}\left(\frac{m}{kε}\right)^{k} + \mathcal{O}(nm)$ and has query time in $\mathcal{O}\left(k\cdot 2^k\right)$. Our third positive result is a probabilistic data structure based on locality-sensitive hashing, which achieves space in $\mathcal{O}(n\log n+nm)$ and query time in $\mathcal{O}(k\log n)$, and which answers queries with an approximation factor in $\mathcal{O}(k)$. All of our data structures make use of the concept of signatures, which were originally introduced for the problem of clustering time series under the Fréchet distance. In addition, we show lower bounds for this problem. Consider any data structure which achieves an approximation factor less than $2$ and which supports curves of arclength up to $L$ and answers the query using only a constant number of probes. We show that under reasonable assumptions on the word size any such data structure needs space in $L^{Ω(k)}$.
△ Less
Submitted 5 March, 2021; v1 submitted 21 August, 2020;
originally announced August 2020.
-
Sublinear data structures for short Fréchet queries
Authors:
Anne Driemel,
Ioannis Psarros,
Melanie Schmidt
Abstract:
We study metric data structures for curves in doubling spaces, such as trajectories of moving objects in Euclidean $\mathbb{R}^d$, where the distance between two curves is measured using the discrete Fréchet distance. We design data structures in an \emph{asymmetric} setting where the input is a curve (or a set of $n$ curves) each of complexity $m$ and the queries are with curves of complexity…
▽ More
We study metric data structures for curves in doubling spaces, such as trajectories of moving objects in Euclidean $\mathbb{R}^d$, where the distance between two curves is measured using the discrete Fréchet distance. We design data structures in an \emph{asymmetric} setting where the input is a curve (or a set of $n$ curves) each of complexity $m$ and the queries are with curves of complexity $k\ll m$. We show that there exist approximate data structures that are independent of the input size $N = d \cdot n \cdot m$ and we study how to maintain them dynamically if the input is given in the stream.
Concretely, we study two types of data structures: (i) distance oracles, where the task is to store a compressed version of the input curve, which can be used to answer queries for the distance of a query curve to the input curve, and (ii) nearest-neighbor data structures, where the task is to preprocess a set of input curves to answer queries for the input curve closest to the query curve. In both cases we are interested in approximation. For curves embedded in Euclidean $\mathbb{R}^d$ with constant $d$, our distance oracle uses space in $\mathcal{O}((k \log(ε^{-1}) ε^{-d})^k)$ ($ε$ is the precision parameter). The oracle performs $(1+ε)$-approximate queries in time in $\mathcal{O}(k^2)$ and is deterministic. We show how to maintain this distance oracle in the stream using polylogarithmic additional memory. In the stream, we can dynamically answer distance queries to the portion of the stream seen so far in $\mathcal{O}(k^4 \log^2 m)$ time. We apply our techniques to the second problem, approximate near neighbor (ANN) data structures, and achieve an exponential improvement in the dependency on the complexity of the input curves compared to the state of the art.
△ Less
Submitted 12 July, 2019; v1 submitted 9 July, 2019;
originally announced July 2019.
-
The VC Dimension of Metric Balls under Fréchet and Hausdorff Distances
Authors:
Anne Driemel,
André Nusser,
Jeff M. Phillips,
Ioannis Psarros
Abstract:
The Vapnik-Chervonenkis dimension provides a notion of complexity for systems of sets. If the VC dimension is small, then knowing this can drastically simplify fundamental computational tasks such as classification, range counting, and density estimation through the use of sampling bounds. We analyze set systems where the ground set $X$ is a set of polygonal curves in $\mathbb{R}^d$ and the sets…
▽ More
The Vapnik-Chervonenkis dimension provides a notion of complexity for systems of sets. If the VC dimension is small, then knowing this can drastically simplify fundamental computational tasks such as classification, range counting, and density estimation through the use of sampling bounds. We analyze set systems where the ground set $X$ is a set of polygonal curves in $\mathbb{R}^d$ and the sets $\mathcal{R}$ are metric balls defined by curve similarity metrics, such as the Fréchet distance and the Hausdorff distance, as well as their discrete counterparts. We derive upper and lower bounds on the VC dimension that imply useful sampling bounds in the setting that the number of curves is large, but the complexity of the individual curves is small. Our upper bounds are either near-quadratic or near-linear in the complexity of the curves that define the ranges and they are logarithmic in the complexity of the curves that define the ground set.
△ Less
Submitted 15 November, 2019; v1 submitted 7 March, 2019;
originally announced March 2019.
-
Near neighbor preserving dimension reduction for doubling subsets of $\ell_1$
Authors:
Ioannis Z. Emiris,
Vasilis Margonis,
Ioannis Psarros
Abstract:
Randomized dimensionality reduction has been recognized as one of the fundamental techniques in handling high-dimensional data. Starting with the celebrated Johnson-Lindenstrauss Lemma, such reductions have been studied in depth for the Euclidean $(\ell_2)$ metric, but much less for the Manhattan $(\ell_1)$ metric. Our primary motivation is the approximate nearest neighbor problem in $\ell_1$. We…
▽ More
Randomized dimensionality reduction has been recognized as one of the fundamental techniques in handling high-dimensional data. Starting with the celebrated Johnson-Lindenstrauss Lemma, such reductions have been studied in depth for the Euclidean $(\ell_2)$ metric, but much less for the Manhattan $(\ell_1)$ metric. Our primary motivation is the approximate nearest neighbor problem in $\ell_1$. We exploit its reduction to the decision-with-witness version, called approximate \textit{near} neighbor, which incurs a roughly logarithmic overhead. In 2007, Indyk and Naor, in the context of approximate nearest neighbors, introduced the notion of nearest neighbor-preserving embeddings. These are randomized embeddings between two metric spaces with guaranteed bounded distortion only for the distances between a query point and a point set. Such embeddings are known to exist for both $\ell_2$ and $\ell_1$ metrics, as well as for doubling subsets of $\ell_2$. The case that remained open were doubling subsets of $\ell_1$. In this paper, we propose a dimension reduction by means of a \textit{near} neighbor-preserving embedding for doubling subsets of $\ell_1$. Our approach is to represent the pointset with a carefully chosen covering set, then randomly project the latter. We study two types of covering sets: $c$-approximate $r$-nets and randomly shifted grids, and we discuss the tradeoff between them in terms of preprocessing time and target dimension. We employ Cauchy variables: certain concentration bounds derived should be of independent interest.
△ Less
Submitted 8 September, 2019; v1 submitted 23 February, 2019;
originally announced February 2019.
-
Products of Euclidean metrics and applications to proximity questions among curves
Authors:
Ioannis Z. Emiris,
Ioannis Psarros
Abstract:
The problem of Approximate Nearest Neighbor (ANN) search is fundamental in computer science and has benefited from significant progress in the past couple of decades. However, most work has been devoted to pointsets whereas complex shapes have not been sufficiently treated. Here, we focus on distance functions between discretized curves in Euclidean space: they appear in a wide range of applicatio…
▽ More
The problem of Approximate Nearest Neighbor (ANN) search is fundamental in computer science and has benefited from significant progress in the past couple of decades. However, most work has been devoted to pointsets whereas complex shapes have not been sufficiently treated. Here, we focus on distance functions between discretized curves in Euclidean space: they appear in a wide range of applications, from road segments to time-series in general dimension. For $\ell_p$-products of Euclidean metrics, for any $p$, we design simple and efficient data structures for ANN, based on randomized projections, which are of independent interest. They serve to solve proximity problems under a notion of distance between discretized curves, which generalizes both discrete Fréchet and Dynamic Time War** distances. These are the most popular and practical approaches to comparing such curves. We offer the first data structures and query algorithms for ANN with arbitrarily good approximation factor, at the expense of increasing space usage and preprocessing time over existing methods. Query time complexity is comparable or significantly improved by our algorithms, our algorithm is especially efficient when the length of the curves is bounded.
△ Less
Submitted 13 April, 2020; v1 submitted 18 December, 2017;
originally announced December 2017.
-
Practical linear-space Approximate Near Neighbors in high dimension
Authors:
Georgia Avarikioti,
Ioannis Z. Emiris,
Ioannis Psarros,
Georgios Samaras
Abstract:
The $c$-approximate Near Neighbor problem in high dimensional spaces has been mainly addressed by Locality Sensitive Hashing (LSH), which offers polynomial dependence on the dimension, query time sublinear in the size of the dataset, and subquadratic space requirement. For practical applications, linear space is typically imperative. Most previous work in the linear space regime focuses on the cas…
▽ More
The $c$-approximate Near Neighbor problem in high dimensional spaces has been mainly addressed by Locality Sensitive Hashing (LSH), which offers polynomial dependence on the dimension, query time sublinear in the size of the dataset, and subquadratic space requirement. For practical applications, linear space is typically imperative. Most previous work in the linear space regime focuses on the case that $c$ exceeds $1$ by a constant term. In a recently accepted paper, optimal bounds have been achieved for any $c>1$ \cite{ALRW17}.
Towards practicality, we present a new and simple data structure using linear space and sublinear query time for any $c>1$ including $c\to 1^+$. Given an LSH family of functions for some metric space, we randomly project points to the Hamming cube of dimension $\log n$, where $n$ is the number of input points. The projected space contains strings which serve as keys for buckets containing the input points. The query algorithm simply projects the query point, then examines points which are assigned to the same or nearby vertices on the Hamming cube. We analyze in detail the query time for some standard LSH families.
To illustrate our claim of practicality, we offer an open-source implementation in {\tt C++}, and report on several experiments in dimension up to 1000 and $n$ up to $10^6$. Our algorithm is one to two orders of magnitude faster than brute force search. Experiments confirm the sublinear dependence on $n$ and the linear dependence on the dimension. We have compared against state-of-the-art LSH-based library {\tt FALCONN}: our search is somewhat slower, but memory usage and preprocessing time are significantly smaller.
△ Less
Submitted 21 December, 2016;
originally announced December 2016.
-
High-dimensional approximate $r$-nets
Authors:
Georgia Avarikioti,
Ioannis Z. Emiris,
Loukas Kavouras,
Ioannis Psarros
Abstract:
The construction of $r$-nets offers a powerful tool in computational and metric geometry. We focus on high-dimensional spaces and present a new randomized algorithm which efficiently computes approximate $r$-nets with respect to Euclidean distance. For any fixed $ε>0$, the approximation factor is $1+ε$ and the complexity is polynomial in the dimension and subquadratic in the number of points. The…
▽ More
The construction of $r$-nets offers a powerful tool in computational and metric geometry. We focus on high-dimensional spaces and present a new randomized algorithm which efficiently computes approximate $r$-nets with respect to Euclidean distance. For any fixed $ε>0$, the approximation factor is $1+ε$ and the complexity is polynomial in the dimension and subquadratic in the number of points. The algorithm succeeds with high probability. More specifically, the best previously known LSH-based construction of Eppstein et al.\ \cite{EHS15} is improved in terms of complexity by reducing the dependence on $ε$, provided that $ε$ is sufficiently small. Our method does not require LSH but, instead, follows Valiant's \cite{Val15} approach in designing a sequence of reductions of our problem to other problems in different spaces, under Euclidean distance or inner product, for which $r$-nets are computed efficiently and the error can be controlled. Our result immediately implies efficient solutions to a number of geometric problems in high dimension, such as finding the $(1+ε)$-approximate $k$th nearest neighbor distance in time subquadratic in the size of the input.
△ Less
Submitted 6 May, 2017; v1 submitted 16 July, 2016;
originally announced July 2016.
-
Randomized embeddings with slack, and high-dimensional Approximate Nearest Neighbor
Authors:
Evangelos Anagnostopoulos,
Ioannis Z. Emiris,
Ioannis Psarros
Abstract:
The approximate nearest neighbor problem ($ε$-ANN) in high dimensional Euclidean space has been mainly addressed by Locality Sensitive Hashing (LSH), which has polynomial dependence in the dimension, sublinear query time, but subquadratic space requirement. In this paper, we introduce a new definition of "low-quality" embeddings for metric spaces. It requires that, for some query point $q$, there…
▽ More
The approximate nearest neighbor problem ($ε$-ANN) in high dimensional Euclidean space has been mainly addressed by Locality Sensitive Hashing (LSH), which has polynomial dependence in the dimension, sublinear query time, but subquadratic space requirement. In this paper, we introduce a new definition of "low-quality" embeddings for metric spaces. It requires that, for some query point $q$, there exists an approximate nearest neighbor among the pre-images of the $k>1$ approximate nearest neighbors in the target space. Focusing on Euclidean spaces, we employ random projections in order to reduce the original problem to one in a space of dimension inversely proportional to $k$.
The $k$ approximate nearest neighbors can be efficiently retrieved by a data structure such as BBD-trees. The same approach is applied to the problem of computing an approximate near neighbor, where we obtain a data structure requiring linear space, and query time in $O(d n^ρ)$, for $ρ\approx 1-ε^2/\log(1/ε)$. This directly implies a solution for $ε$-ANN, while achieving a better exponent in the query time than the method based on BBD-trees. Better bounds are obtained in the case of doubling subsets of $\ell_2$, by combining our method with $r$-nets.
We implement our method in C++, and present experimental results in dimension up to $500$ and $10^6$ points, which show that performance is better than predicted by the analysis. In addition, we compare our ANN approach to E2LSH, which implements LSH, and we show that the theoretical advantages of each method are reflected on their actual performance.
△ Less
Submitted 3 December, 2016; v1 submitted 4 December, 2014;
originally announced December 2014.
-
Counting Euclidean embeddings of rigid graphs
Authors:
Ioannis Z. Emiris,
Ioannis Psarros
Abstract:
A graph is called (generically) rigid in $\mathbb{R}^d$ if, for any choice of sufficiently generic edge lengths, it can be embedded in $\mathbb{R}^d$ in a finite number of distinct ways, modulo rigid transformations. Here we deal with the problem of determining the maximum number of planar Euclidean embeddings as a function of the number of the vertices. We obtain polynomial systems which totally…
▽ More
A graph is called (generically) rigid in $\mathbb{R}^d$ if, for any choice of sufficiently generic edge lengths, it can be embedded in $\mathbb{R}^d$ in a finite number of distinct ways, modulo rigid transformations. Here we deal with the problem of determining the maximum number of planar Euclidean embeddings as a function of the number of the vertices. We obtain polynomial systems which totally capture the structure of a given graph, by exploiting distance geometry theory. Consequently, counting the number of Euclidean embeddings of a given rigid graph, reduces to the problem of counting roots of the corresponding polynomial system.
△ Less
Submitted 25 January, 2017; v1 submitted 6 February, 2014;
originally announced February 2014.