-
Path Signature Representation of Patient-Clinician Interactions as a Predictor for Neuropsychological Tests Outcomes in Children: A Proof of Concept
Authors:
Giulio Falcioni,
Alexandra Georgescu,
Emilia Molimpakis,
Lev Gottlieb,
Taylor Kuhn,
Stefano Goria
Abstract:
This research report presents a proof-of-concept study on the application of machine learning techniques to video and speech data collected during diagnostic cognitive assessments of children with a neurodevelopmental disorder. The study utilised a dataset of 39 video recordings, capturing extensive sessions where clinicians administered, among other things, four cognitive assessment tests. From t…
▽ More
This research report presents a proof-of-concept study on the application of machine learning techniques to video and speech data collected during diagnostic cognitive assessments of children with a neurodevelopmental disorder. The study utilised a dataset of 39 video recordings, capturing extensive sessions where clinicians administered, among other things, four cognitive assessment tests. From the first 40 minutes of each clinical session, covering the administration of the Wechsler Intelligence Scale for Children (WISC-V), we extracted head positions and speech turns of both clinician and child. Despite the limited sample size and heterogeneous recording styles, the analysis successfully extracted path signatures as features from the recorded data, focusing on patient-clinician interactions. Importantly, these features quantify the interpersonal dynamics of the assessment process (dialogue and movement patterns). Results suggest that these features exhibit promising potential for predicting all cognitive tests scores of the entire session length and for prototy** a predictive model as a clinical decision support tool. Overall, this proof of concept demonstrates the feasibility of leveraging machine learning techniques for clinical video and speech data analysis in order to potentially enhance the efficiency of cognitive assessments for neurodevelopmental disorders in children.
△ Less
Submitted 12 December, 2023;
originally announced December 2023.
-
Weighted Distance Nearest Neighbor Condensing
Authors:
Lee-Ad Gottlieb,
Timor Sharabi,
Roi Weiss
Abstract:
The problem of nearest neighbor condensing has enjoyed a long history of study, both in its theoretical and practical aspects. In this paper, we introduce the problem of weighted distance nearest neighbor condensing, where one assigns weights to each point of the condensed set, and then new points are labeled based on their weighted distance nearest neighbor in the condensed set.
We study the th…
▽ More
The problem of nearest neighbor condensing has enjoyed a long history of study, both in its theoretical and practical aspects. In this paper, we introduce the problem of weighted distance nearest neighbor condensing, where one assigns weights to each point of the condensed set, and then new points are labeled based on their weighted distance nearest neighbor in the condensed set.
We study the theoretical properties of this new model, and show that it can produce dramatically better condensing than the standard nearest neighbor rule, yet is characterized by generalization bounds almost identical to the latter. We then suggest a condensing heuristic for our new problem. We demonstrate Bayes consistency for this heuristic, and also show promising empirical results.
△ Less
Submitted 24 October, 2023;
originally announced October 2023.
-
Using Deepfake Technologies for Word Emphasis Detection
Authors:
Eran Kaufman,
Lee-Ad Gottlieb
Abstract:
In this work, we consider the task of automated emphasis detection for spoken language. This problem is challenging in that emphasis is affected by the particularities of speech of the subject, for example the subject accent, dialect or voice. To address this task, we propose to utilize deep fake technology to produce an emphasis devoid speech for this speaker. This requires extracting the text of…
▽ More
In this work, we consider the task of automated emphasis detection for spoken language. This problem is challenging in that emphasis is affected by the particularities of speech of the subject, for example the subject accent, dialect or voice. To address this task, we propose to utilize deep fake technology to produce an emphasis devoid speech for this speaker. This requires extracting the text of the spoken voice, and then using a voice sample from the same speaker to produce emphasis devoid speech for this task. By comparing the generated speech with the spoken voice, we are able to isolate patterns of emphasis which are relatively easy to detect.
△ Less
Submitted 12 May, 2023;
originally announced May 2023.
-
Functions with average smoothness: structure, algorithms, and learning
Authors:
Yair Ashlagi,
Lee-Ad Gottlieb,
Aryeh Kontorovich
Abstract:
We initiate a program of average smoothness analysis for efficiently learning real-valued functions on metric spaces. Rather than using the Lipschitz constant as the regularizer, we define a local slope at each point and gauge the function complexity as the average of these values. Since the mean can be dramatically smaller than the maximum, this complexity measure can yield considerably sharper g…
▽ More
We initiate a program of average smoothness analysis for efficiently learning real-valued functions on metric spaces. Rather than using the Lipschitz constant as the regularizer, we define a local slope at each point and gauge the function complexity as the average of these values. Since the mean can be dramatically smaller than the maximum, this complexity measure can yield considerably sharper generalization bounds -- assuming that these admit a refinement where the Lipschitz constant is replaced by our average of local slopes.
Our first major contribution is to obtain just such distribution-sensitive bounds. This required overcoming a number of technical challenges, perhaps the most formidable of which was bounding the {\em empirical} covering numbers, which can be much worse-behaved than the ambient ones. Our combinatorial results are accompanied by efficient algorithms for smoothing the labels of the random sample, as well as guarantees that the extension from the sample to the whole space will continue to be, with high probability, smooth on average. Along the way we discover a surprisingly rich combinatorial and analytic structure in the function class we define.
△ Less
Submitted 8 November, 2020; v1 submitted 13 July, 2020;
originally announced July 2020.
-
Faster Algorithms for Orienteering and $k$-TSP
Authors:
Lee-Ad Gottlieb,
Robert Krauthgamer,
Havana Rika
Abstract:
We consider the rooted orienteering problem in Euclidean space: Given $n$ points $P$ in $\mathbb R^d$, a root point $s\in P$ and a budget $\mathcal B>0$, find a path that starts from $s$, has total length at most $\mathcal B$, and visits as many points of $P$ as possible. This problem is known to be NP-hard, hence we study $(1-δ)$-approximation algorithms. The previous Polynomial-Time Approximatio…
▽ More
We consider the rooted orienteering problem in Euclidean space: Given $n$ points $P$ in $\mathbb R^d$, a root point $s\in P$ and a budget $\mathcal B>0$, find a path that starts from $s$, has total length at most $\mathcal B$, and visits as many points of $P$ as possible. This problem is known to be NP-hard, hence we study $(1-δ)$-approximation algorithms. The previous Polynomial-Time Approximation Scheme (PTAS) for this problem, due to Chen and Har-Peled (2008), runs in time $n^{O(d\sqrt{d}/δ)}(\log n)^{(d/δ)^{O(d)}}$, and improving on this time bound was left as an open problem. Our main contribution is a PTAS with a significantly improved time complexity of $n^{O(1/δ)}(\log n)^{(d/δ)^{O(d)}}$.
A known technique for approximating the orienteering problem is to reduce it to solving $1/δ$ correlated instances of rooted $k$-TSP (a $k$-TSP tour is one that visits at least $k$ points). However, the $k$-TSP tours in this reduction must achieve a certain excess guarantee (namely, their length can surpass the optimum length only in proportion to a parameter of the optimum called excess) that is stronger than the usual $(1+δ)$-approximation. Our main technical contribution is to improve the running time of these $k$-TSP variants, particularly in its dependence on the dimension $d$. Indeed, our running time is polynomial even for a moderately large dimension, roughly up to $d=O(\log\log n)$ instead of $d=O(1)$.
△ Less
Submitted 21 April, 2022; v1 submitted 18 February, 2020;
originally announced February 2020.
-
Nested Barycentric Coordinate System as an Explicit Feature Map
Authors:
Lee-Ad Gottlieb,
Eran Kaufman,
Aryeh Kontorovich,
Gabriel Nivasch,
Ofir Pele
Abstract:
We propose a new embedding method which is particularly well-suited for settings where the sample size greatly exceeds the ambient dimension. Our technique consists of partitioning the space into simplices and then embedding the data points into features corresponding to the simplices' barycentric coordinates. We then train a linear classifier in the rich feature space obtained from the simplices.…
▽ More
We propose a new embedding method which is particularly well-suited for settings where the sample size greatly exceeds the ambient dimension. Our technique consists of partitioning the space into simplices and then embedding the data points into features corresponding to the simplices' barycentric coordinates. We then train a linear classifier in the rich feature space obtained from the simplices. The decision boundary may be highly non-linear, though it is linear within each simplex (and hence piecewise-linear overall). Further, our method can approximate any convex body. We give generalization bounds based on empirical margin and a novel hybrid sample compression technique. An extensive empirical evaluation shows that our method consistently outperforms a range of popular kernel embedding methods.
△ Less
Submitted 5 February, 2020;
originally announced February 2020.
-
Apportioned Margin Approach for Cost Sensitive Large Margin Classifiers
Authors:
Lee-Ad Gottlieb,
Eran Kaufman,
Aryeh Kontorovich
Abstract:
We consider the problem of cost sensitive multiclass classification, where we would like to increase the sensitivity of an important class at the expense of a less important one. We adopt an {\em apportioned margin} framework to address this problem, which enables an efficient margin shift between classes that share the same boundary. The decision boundary between all pairs of classes divides the…
▽ More
We consider the problem of cost sensitive multiclass classification, where we would like to increase the sensitivity of an important class at the expense of a less important one. We adopt an {\em apportioned margin} framework to address this problem, which enables an efficient margin shift between classes that share the same boundary. The decision boundary between all pairs of classes divides the margin between them in accordance to a given prioritization vector, which yields a tighter error bound for the important classes while also reducing the overall out-of-sample error. In addition to demonstrating an efficient implementation of our framework, we derive generalization bounds, demonstrate Fisher consistency, adapt the framework to Mercer's kernel and to neural networks, and report promising empirical results on all accounts.
△ Less
Submitted 4 February, 2020;
originally announced February 2020.
-
Classification in asymmetric spaces via sample compression
Authors:
Lee-Ad Gottlieb,
Shira Ozeri
Abstract:
We initiate the rigorous study of classification in quasi-metric spaces. These are point sets endowed with a distance function that is non-negative and also satisfies the triangle inequality, but is asymmetric. We develop and refine a learning algorithm for quasi-metrics based on sample compression and nearest neighbor, and prove that it has favorable statistical properties.
We initiate the rigorous study of classification in quasi-metric spaces. These are point sets endowed with a distance function that is non-negative and also satisfies the triangle inequality, but is asymmetric. We develop and refine a learning algorithm for quasi-metrics based on sample compression and nearest neighbor, and prove that it has favorable statistical properties.
△ Less
Submitted 22 September, 2019;
originally announced September 2019.
-
Labelings vs. Embeddings: On Distributed Representations of Distances
Authors:
Arnold Filtser,
Lee-Ad Gottlieb,
Robert Krauthgamer
Abstract:
We investigate for which metric spaces the performance of distance labeling and of $\ell_\infty$-embeddings differ, and how significant can this difference be. Recall that a distance labeling is a distributed representation of distances in a metric space $(X,d)$, where each point $x\in X$ is assigned a succinct label, such that the distance between any two points $x,y \in X$ can be approximated gi…
▽ More
We investigate for which metric spaces the performance of distance labeling and of $\ell_\infty$-embeddings differ, and how significant can this difference be. Recall that a distance labeling is a distributed representation of distances in a metric space $(X,d)$, where each point $x\in X$ is assigned a succinct label, such that the distance between any two points $x,y \in X$ can be approximated given only their labels. A highly structured special case is an embedding into $\ell_\infty$, where each point $x\in X$ is assigned a vector $f(x)$ such that $\|f(x)-f(y)\|_\infty$ is approximately $d(x,y)$. The performance of a distance labeling or an $\ell_\infty$-embedding is measured via its distortion and its label-size/dimension.
We also study the analogous question for the prioritized versions of these two measures. Here, a priority order $π=(x_1,\dots,x_n)$ of the point set $X$ is given, and higher-priority points should have shorter labels. Formally, a distance labeling has prioritized label-size $α(\cdot)$ if every $x_j$ has label size at most $α(j)$. Similarly, an embedding $f: X \to \ell_\infty$ has prioritized dimension $α(\cdot)$ if $f(x_j)$ is non-zero only in the first $α(j)$ coordinates. In addition, we compare these prioritized measures to their classical (worst-case) versions.
We answer these questions in several scenarios, uncovering a surprisingly diverse range of behaviors. First, in some cases labelings and embeddings have very similar worst-case performance, but in other cases there is a huge disparity. However in the prioritized setting, we most often find a strict separation between the performance of labelings and embeddings. And finally, when comparing the classical and prioritized settings, we find that the worst-case bound for label size often "translates" to a prioritized one, but also find a surprising exception to this rule.
△ Less
Submitted 20 September, 2023; v1 submitted 16 July, 2019;
originally announced July 2019.
-
Near-linear time approximation schemes for Steiner tree and forest in low-dimensional spaces
Authors:
Lee-Ad Gottlieb,
Yair Bartal
Abstract:
We give an algorithm that computes a $(1+ε)$-approximate Steiner forest in near-linear time $n \cdot 2^{(1/ε)^{O(ddim^2)} (\log \log n)^2}$. This is a dramatic improvement upon the best previous result due to Chan et al., who gave a runtime of $n^{2^{O(ddim)}} \cdot 2^{(ddim/ε)^{O(ddim)} \sqrt{\log n}}$.
For Steiner tree our methods achieve an even better runtime…
▽ More
We give an algorithm that computes a $(1+ε)$-approximate Steiner forest in near-linear time $n \cdot 2^{(1/ε)^{O(ddim^2)} (\log \log n)^2}$. This is a dramatic improvement upon the best previous result due to Chan et al., who gave a runtime of $n^{2^{O(ddim)}} \cdot 2^{(ddim/ε)^{O(ddim)} \sqrt{\log n}}$.
For Steiner tree our methods achieve an even better runtime $n (\log n)^{(1/ε)^{O(ddim^2)}}$ in doubling spaces. For Euclidean space the runtime can be reduced to $2^{(1/ε)^{O(d^2)}} n \log n$, improving upon the result of Arora in fixed dimension $d$.
△ Less
Submitted 7 April, 2019;
originally announced April 2019.
-
Learning convex polyhedra with margin
Authors:
Lee-Ad Gottlieb,
Eran Kaufman,
Aryeh Kontorovich,
Gabriel Nivasch
Abstract:
We present an improved algorithm for {\em quasi-properly} learning convex polyhedra in the realizable PAC setting from data with a margin. Our learning algorithm constructs a consistent polyhedron as an intersection of about $t \log t$ halfspaces with constant-size margins in time polynomial in $t$ (where $t$ is the number of halfspaces forming an optimal polyhedron). We also identify distinct gen…
▽ More
We present an improved algorithm for {\em quasi-properly} learning convex polyhedra in the realizable PAC setting from data with a margin. Our learning algorithm constructs a consistent polyhedron as an intersection of about $t \log t$ halfspaces with constant-size margins in time polynomial in $t$ (where $t$ is the number of halfspaces forming an optimal polyhedron). We also identify distinct generalizations of the notion of margin from hyperplanes to polyhedra and investigate how they relate geometrically; this result may have ramifications beyond the learning setting.
△ Less
Submitted 2 November, 2021; v1 submitted 24 May, 2018;
originally announced May 2018.
-
Approximate nearest neighbor search for $\ell_p$-spaces ($2 < p < \infty$) via embeddings
Authors:
Yair Bartal,
Lee-Ad Gottlieb
Abstract:
While the problem of approximate nearest neighbor search has been well-studied for Euclidean space and $\ell_1$, few non-trivial algorithms are known for $\ell_p$ when ($2 < p < \infty$). In this paper, we revisit this fundamental problem and present approximate nearest-neighbor search algorithms which give the first non-trivial approximation factor guarantees in this setting.
While the problem of approximate nearest neighbor search has been well-studied for Euclidean space and $\ell_1$, few non-trivial algorithms are known for $\ell_p$ when ($2 < p < \infty$). In this paper, we revisit this fundamental problem and present approximate nearest-neighbor search algorithms which give the first non-trivial approximation factor guarantees in this setting.
△ Less
Submitted 6 December, 2015;
originally announced December 2015.
-
A light metric spanner
Authors:
Lee-Ad Gottlieb
Abstract:
It has long been known that $d$-dimensional Euclidean point sets admit $(1+ε)$-stretch spanners with lightness $W_E = ε^{-O(d)}$, that is total edge weight at most $W_E$ times the weight of the minimum spaning tree of the set [DHN93]. Whether or not a similar result holds for metric spaces with low doubling dimension has remained an important open problem, and has resisted numerous attempts at res…
▽ More
It has long been known that $d$-dimensional Euclidean point sets admit $(1+ε)$-stretch spanners with lightness $W_E = ε^{-O(d)}$, that is total edge weight at most $W_E$ times the weight of the minimum spaning tree of the set [DHN93]. Whether or not a similar result holds for metric spaces with low doubling dimension has remained an important open problem, and has resisted numerous attempts at resolution. In this paper, we resolve the question in the affirmative, and show that doubling spaces admit $(1+ε)$-stretch spanners with lightness $W_D = (ddim/ε)^{O(ddim)}$.
Important in its own right, our result also implies a much faster polynomial-time approximation scheme for the traveling salesman problemin doubling metric spaces, improving upon the bound presented in [BGK-12].
△ Less
Submitted 14 May, 2015;
originally announced May 2015.
-
The YLI-MED Corpus: Characteristics, Procedures, and Plans
Authors:
Julia Bernd,
Damian Borth,
Benjamin Elizalde,
Gerald Friedland,
Heather Gallagher,
Luke Gottlieb,
Adam Janin,
Sara Karabashlieva,
Jocelyn Takahashi,
Jennifer Won
Abstract:
The YLI Multimedia Event Detection corpus is a public-domain index of videos with annotations and computed features, specialized for research in multimedia event detection (MED), i.e., automatically identifying what's happening in a video by analyzing the audio and visual content. The videos indexed in the YLI-MED corpus are a subset of the larger YLI feature corpus, which is being developed by th…
▽ More
The YLI Multimedia Event Detection corpus is a public-domain index of videos with annotations and computed features, specialized for research in multimedia event detection (MED), i.e., automatically identifying what's happening in a video by analyzing the audio and visual content. The videos indexed in the YLI-MED corpus are a subset of the larger YLI feature corpus, which is being developed by the International Computer Science Institute and Lawrence Livermore National Laboratory based on the Yahoo Flickr Creative Commons 100 Million (YFCC100M) dataset. The videos in YLI-MED are categorized as depicting one of ten target events, or no target event, and are annotated for additional attributes like language spoken and whether the video has a musical score. The annotations also include degree of annotator agreement and average annotator confidence scores for the event categorization of each video. Version 1.0 of YLI-MED includes 1823 "positive" videos that depict the target events and 48,138 "negative" videos, as well as 177 supplementary videos that are similar to event videos but are not positive examples. Our goal in producing YLI-MED is to be as open about our data and procedures as possible. This report describes the procedures used to collect the corpus; gives detailed descriptive statistics about the corpus makeup (and how video attributes affected annotators' judgments); discusses possible biases in the corpus introduced by our procedural choices and compares it with the most similar existing dataset, TRECVID MED's HAVIC corpus; and gives an overview of our future plans for expanding the annotation effort.
△ Less
Submitted 13 March, 2015;
originally announced March 2015.
-
Nearly optimal classification for semimetrics
Authors:
Lee-Ad Gottlieb,
Aryeh Kontorovich
Abstract:
We initiate the rigorous study of classification in semimetric spaces, which are point sets with a distance function that is non-negative and symmetric, but need not satisfy the triangle inequality. For metric spaces, the doubling dimension essentially characterizes both the runtime and sample complexity of classification algorithms --- yet we show that this is not the case for semimetrics. Instea…
▽ More
We initiate the rigorous study of classification in semimetric spaces, which are point sets with a distance function that is non-negative and symmetric, but need not satisfy the triangle inequality. For metric spaces, the doubling dimension essentially characterizes both the runtime and sample complexity of classification algorithms --- yet we show that this is not the case for semimetrics. Instead, we define the {\em density dimension} and discover that it plays a central role in the statistical and algorithmic feasibility of learning in semimetric spaces. We present nearly optimal sample compression algorithms and use these to obtain generalization guarantees, including fast rates. The latter hold for general sample compression schemes and may be of independent interest.
△ Less
Submitted 22 February, 2015;
originally announced February 2015.
-
Dimension reduction techniques for $\ell_p$, $1 \le p \le 2$, with applications
Authors:
Yair Bartal,
Lee-Ad Gottlieb
Abstract:
For Euclidean space ($\ell_2$), there exists the powerful dimension reduction transform of Johnson and Lindenstrauss, with a host of known applications. Here, we consider the problem of dimension reduction for all $\ell_p$ spaces $1 \le p \le 2$. Although strong lower bounds are known for dimension reduction in $\ell_1$, Ostrovsky and Rabani successfully circumvented these by presenting an…
▽ More
For Euclidean space ($\ell_2$), there exists the powerful dimension reduction transform of Johnson and Lindenstrauss, with a host of known applications. Here, we consider the problem of dimension reduction for all $\ell_p$ spaces $1 \le p \le 2$. Although strong lower bounds are known for dimension reduction in $\ell_1$, Ostrovsky and Rabani successfully circumvented these by presenting an $\ell_1$ embedding that maintains fidelity in only a bounded distance range, with applications to clustering and nearest neighbor search. However, their embedding techniques are specific to $\ell_1$ and do not naturally extend to other norms.
In this paper, we apply a range of advanced techniques and produce bounded range dimension reduction embeddings for all of $1 \le p \le 2$, thereby demonstrating that the approach initiated by Ostrovsky and Rabani for $\ell_1$ can be extended to a much more general framework. We also obtain improved bounds in terms of the intrinsic dimensionality. As a result we achieve improved bounds for proximity problems including snowflake embeddings and clustering.
△ Less
Submitted 6 December, 2015; v1 submitted 8 August, 2014;
originally announced August 2014.
-
Optimizing Budget Allocation in Graphs
Authors:
Boaz Ben-Moshe,
Michael Elkin,
Lee-Ad Gottlieb,
Eran Omri
Abstract:
In the classical facility location problem we consider a graph $G$ with fixed weights on the edges of $G$. The goal is then to find an optimal positioning for a set of facilities on the graph with respect to some objective function. We introduce a new framework for facility location problems, where the weights on the graph edges are not fixed, but rather should be assigned. The goal is to find a v…
▽ More
In the classical facility location problem we consider a graph $G$ with fixed weights on the edges of $G$. The goal is then to find an optimal positioning for a set of facilities on the graph with respect to some objective function. We introduce a new framework for facility location problems, where the weights on the graph edges are not fixed, but rather should be assigned. The goal is to find a valid assignment for which the resulting weighted graph optimizes the facility location objective function. We present algorithms for finding the optimal {\em budget allocation} for the center point problem and for the median point problem on trees. Our algorithms run in linear time, both for the case where a candidate vertex is given as part of the input, and for the case where finding a vertex that optimizes the solution is part of the problem. We also present a hardness result for the general graph case of the center point problem, followed by an $O(\log^2(n))$ approximation algorithm on graphs - with general metric spaces.
△ Less
Submitted 9 June, 2014;
originally announced June 2014.
-
Near-optimal sample compression for nearest neighbors
Authors:
Lee-Ad Gottlieb,
Aryeh Kontorovich,
Pinhas Nisnevitch
Abstract:
We present the first sample compression algorithm for nearest neighbors with non-trivial performance guarantees. We complement these guarantees by demonstrating almost matching hardness lower bounds, which show that our bound is nearly optimal. Our result yields new insight into margin-based nearest neighbor classification in metric spaces and allows us to significantly sharpen and simplify existi…
▽ More
We present the first sample compression algorithm for nearest neighbors with non-trivial performance guarantees. We complement these guarantees by demonstrating almost matching hardness lower bounds, which show that our bound is nearly optimal. Our result yields new insight into margin-based nearest neighbor classification in metric spaces and allows us to significantly sharpen and simplify existing bounds. Some encouraging empirical results are also presented.
△ Less
Submitted 26 March, 2018; v1 submitted 13 April, 2014;
originally announced April 2014.
-
Light spanners for snowflake metrics
Authors:
Lee-Ad Gottlieb,
Shay Solomon
Abstract:
A classic result in the study of spanners is the existence of light low-stretch spanners for Euclidean spaces. These spanners ahve arbitrary low stretch, and weight only a constant factor greater than that of the minimum spanning tree of the points (with dependence on the stretch and Euclidean dimention). A central open problem in this field asks whether other spaces admit low weight spanners as w…
▽ More
A classic result in the study of spanners is the existence of light low-stretch spanners for Euclidean spaces. These spanners ahve arbitrary low stretch, and weight only a constant factor greater than that of the minimum spanning tree of the points (with dependence on the stretch and Euclidean dimention). A central open problem in this field asks whether other spaces admit low weight spanners as well - for example metric space with low intrinsic dimension - yet only a handful of results of this type are known.
In this paper, we consider snowflake metric spaces of low intrinsic dimension. The α-snowflake of a metric (X,δ) is the metric (X,$δ^α$) for 0<α<1. By utilizing an approach completely different than those used for Euclidean spaces, we demonstrate that snowflake metrics admit light spanners. Further, we show that the spanner is of diameter O($\log$n), a result not possible for Euclidean spaces. As an immediate corollary to our spanner, we obtain dramatic improvments in algorithms for the traveling salesman problem in this setting, achieving a polynomial-time approximation scheme with near-linear runtime. Along the way, we show that all ${\ell}_p$ spaces admit light spanners, a result of interest in its own right.
△ Less
Submitted 20 January, 2014;
originally announced January 2014.
-
On the Impossibility of Dimension Reduction for Doubling Subsets of $\ell_p$, $p>2$
Authors:
Yair Bartal,
Lee-Ad Gottlieb,
Ofer Neiman
Abstract:
A major open problem in the field of metric embedding is the existence of dimension reduction for $n$-point subsets of Euclidean space, such that both distortion and dimension depend only on the {\em doubling constant} of the pointset, and not on its cardinality. In this paper, we negate this possibility for $\ell_p$ spaces with $p>2$. In particular, we introduce an $n$-point subset of $\ell_p$ wi…
▽ More
A major open problem in the field of metric embedding is the existence of dimension reduction for $n$-point subsets of Euclidean space, such that both distortion and dimension depend only on the {\em doubling constant} of the pointset, and not on its cardinality. In this paper, we negate this possibility for $\ell_p$ spaces with $p>2$. In particular, we introduce an $n$-point subset of $\ell_p$ with doubling constant O(1), and demonstrate that any embedding of the set into $\ell_p^d$ with distortion $D$ must have $D\geΩ\left(\left(\frac{c\log n}{d}\right)^{\frac{1}{2}-\frac{1}{p}}\right)$.
△ Less
Submitted 22 August, 2013;
originally announced August 2013.
-
Efficient Classification for Metric Data
Authors:
Lee-Ad Gottlieb,
Aryeh Kontorovich,
Robert Krauthgamer
Abstract:
Recent advances in large-margin classification of data residing in general metric spaces (rather than Hilbert spaces) enable classification under various natural metrics, such as string edit and earthmover distance. A general framework developed for this purpose by von Luxburg and Bousquet [JMLR, 2004] left open the questions of computational efficiency and of providing direct bounds on generaliza…
▽ More
Recent advances in large-margin classification of data residing in general metric spaces (rather than Hilbert spaces) enable classification under various natural metrics, such as string edit and earthmover distance. A general framework developed for this purpose by von Luxburg and Bousquet [JMLR, 2004] left open the questions of computational efficiency and of providing direct bounds on generalization error.
We design a new algorithm for classification in general metric spaces, whose runtime and accuracy depend on the doubling dimension of the data points, and can thus achieve superior classification performance in many common scenarios. The algorithmic core of our approach is an approximate (rather than exact) solution to the classical problems of Lipschitz extension and of Nearest Neighbor Search. The algorithm's generalization performance is guaranteed via the fat-shattering dimension of Lipschitz classifiers, and we present experimental evidence of its superiority to some common kernel methods. As a by-product, we offer a new perspective on the nearest neighbor classifier, which yields significantly sharper risk asymptotics than the classic analysis of Cover and Hart [IEEE Trans. Info. Theory, 1967].
△ Less
Submitted 10 July, 2014; v1 submitted 11 June, 2013;
originally announced June 2013.
-
Adaptive Metric Dimensionality Reduction
Authors:
Lee-Ad Gottlieb,
Aryeh Kontorovich,
Robert Krauthgamer
Abstract:
We study adaptive data-dependent dimensionality reduction in the context of supervised learning in general metric spaces. Our main statistical contribution is a generalization bound for Lipschitz functions in metric spaces that are doubling, or nearly doubling. On the algorithmic front, we describe an analogue of PCA for metric spaces: namely an efficient procedure that approximates the data's int…
▽ More
We study adaptive data-dependent dimensionality reduction in the context of supervised learning in general metric spaces. Our main statistical contribution is a generalization bound for Lipschitz functions in metric spaces that are doubling, or nearly doubling. On the algorithmic front, we describe an analogue of PCA for metric spaces: namely an efficient procedure that approximates the data's intrinsic dimension, which is often much lower than the ambient dimension. Our approach thus leverages the dual benefits of low dimensionality: (1) more efficient algorithms, e.g., for proximity search, and (2) more optimistic generalization bounds.
△ Less
Submitted 25 March, 2015; v1 submitted 12 February, 2013;
originally announced February 2013.
-
The Traveling Salesman Problem: Low-Dimensionality Implies a Polynomial Time Approximation Scheme
Authors:
Yair Bartal,
Lee-Ad Gottlieb,
Robert Krauthgamer
Abstract:
The Traveling Salesman Problem (TSP) is among the most famous NP-hard optimization problems. We design for this problem a randomized polynomial-time algorithm that computes a (1+eps)-approximation to the optimal tour, for any fixed eps>0, in TSP instances that form an arbitrary metric space with bounded intrinsic dimension.
The celebrated results of Arora (A-98) and Mitchell (M-99) prove that th…
▽ More
The Traveling Salesman Problem (TSP) is among the most famous NP-hard optimization problems. We design for this problem a randomized polynomial-time algorithm that computes a (1+eps)-approximation to the optimal tour, for any fixed eps>0, in TSP instances that form an arbitrary metric space with bounded intrinsic dimension.
The celebrated results of Arora (A-98) and Mitchell (M-99) prove that the above result holds in the special case of TSP in a fixed-dimensional Euclidean space. Thus, our algorithm demonstrates that the algorithmic tractability of metric TSP depends on the dimensionality of the space and not on its specific geometry. This result resolves a problem that has been open since the quasi-polynomial time algorithm of Talwar (T-04).
△ Less
Submitted 9 April, 2015; v1 submitted 3 December, 2011;
originally announced December 2011.
-
Efficient Regression in Metric Spaces via Approximate Lipschitz Extension
Authors:
Lee-Ad Gottlieb,
Aryeh Kontorovich,
Robert Krauthgamer
Abstract:
We present a framework for performing efficient regression in general metric spaces. Roughly speaking, our regressor predicts the value at a new point by computing a Lipschitz extension --- the smoothest function consistent with the observed data --- after performing structural risk minimization to avoid overfitting. We obtain finite-sample risk bounds with minimal structural and noise assumptions…
▽ More
We present a framework for performing efficient regression in general metric spaces. Roughly speaking, our regressor predicts the value at a new point by computing a Lipschitz extension --- the smoothest function consistent with the observed data --- after performing structural risk minimization to avoid overfitting. We obtain finite-sample risk bounds with minimal structural and noise assumptions, and a natural speed-precision tradeoff. The offline (learning) and online (prediction) stages can be solved by convex programming, but this naive approach has runtime complexity $O(n^3)$, which is prohibitive for large datasets. We design instead a regression algorithm whose speed and generalization performance depend on the intrinsic dimension of the data, to which the algorithm adapts. While our main innovation is algorithmic, the statistical results may also be of independent interest.
△ Less
Submitted 24 April, 2017; v1 submitted 18 November, 2011;
originally announced November 2011.
-
Matrix sparsification and the sparse null space problem
Authors:
Lee-Ad Gottlieb,
Tyler Neylon
Abstract:
We revisit the matrix problems sparse null space and matrix sparsification, and show that they are equivalent. We then proceed to seek algorithms for these problems: We prove the hardness of approximation of these problems, and also give a powerful tool to extend algorithms and heuristics for sparse approximation theory to these problems.
We revisit the matrix problems sparse null space and matrix sparsification, and show that they are equivalent. We then proceed to seek algorithms for these problems: We prove the hardness of approximation of these problems, and also give a powerful tool to extend algorithms and heuristics for sparse approximation theory to these problems.
△ Less
Submitted 9 August, 2010;
originally announced August 2010.
-
Fast, precise and dynamic distance queries
Authors:
Yair Bartal,
Lee-Ad Gottlieb,
Tsvi Kopelowitz,
Moshe Lewenstein,
Liam Roditty
Abstract:
We present an approximate distance oracle for a point set S with n points and doubling dimension λ. For every ε>0, the oracle supports (1+ε)-approximate distance queries in (universal) constant time, occupies space [ε^{-O(λ)} + 2^{O(λ log λ)}]n, and can be constructed in [2^{O(λ)} log3 n + ε^{-O(λ)} + 2^{O(λ log λ)}]n expected time. This improves upon the best previously known constructions, prese…
▽ More
We present an approximate distance oracle for a point set S with n points and doubling dimension λ. For every ε>0, the oracle supports (1+ε)-approximate distance queries in (universal) constant time, occupies space [ε^{-O(λ)} + 2^{O(λ log λ)}]n, and can be constructed in [2^{O(λ)} log3 n + ε^{-O(λ)} + 2^{O(λ log λ)}]n expected time. This improves upon the best previously known constructions, presented by Har-Peled and Mendel. Furthermore, the oracle can be made fully dynamic with expected O(1) query time and only 2^{O(λ)} log n + ε^{-O(λ)} + 2^{O(λ log λ)} update time. This is the first fully dynamic (1+ε)-distance oracle.
△ Less
Submitted 9 August, 2010;
originally announced August 2010.
-
A Nonlinear Approach to Dimension Reduction
Authors:
Lee-Ad Gottlieb,
Robert Krauthgamer
Abstract:
The $l_2$ flattening lemma of Johnson and Lindenstrauss [JL84] is a powerful tool for dimension reduction. It has been conjectured that the target dimension bounds can be refined and bounded in terms of the intrinsic dimensionality of the data set (for example, the doubling dimension). One such problem was proposed by Lang and Plaut [LP01] (see also [GKL03,MatousekProblems07,ABN08,CGT10]), and is…
▽ More
The $l_2$ flattening lemma of Johnson and Lindenstrauss [JL84] is a powerful tool for dimension reduction. It has been conjectured that the target dimension bounds can be refined and bounded in terms of the intrinsic dimensionality of the data set (for example, the doubling dimension). One such problem was proposed by Lang and Plaut [LP01] (see also [GKL03,MatousekProblems07,ABN08,CGT10]), and is still open. We prove another result in this line of work:
The snowflake metric $d^{1/2}$ of a doubling set $S \subset l_2$ embeds with constant distortion into $l_2^D$, for dimension $D$ that depends solely on the doubling constant of the metric. In fact, the distortion can be made arbitrarily close to 1, and the target dimension is polylogarithmic in the doubling constant. Our techniques are robust and extend to the more difficult spaces $l_1$ and $l_\infty$, although the dimension bounds here are quantitatively inferior than those for $l_2$.
△ Less
Submitted 14 May, 2015; v1 submitted 31 July, 2009;
originally announced July 2009.