-
From Large to Small Datasets: Size Generalization for Clustering Algorithm Selection
Authors:
Vaggos Chatziafratis,
Ishani Karmarkar,
Ellen Vitercik
Abstract:
In clustering algorithm selection, we are given a massive dataset and must efficiently select which clustering algorithm to use. We study this problem in a semi-supervised setting, with an unknown ground-truth clustering that we can only access through expensive oracle queries. Ideally, the clustering algorithm's output will be structurally close to the ground truth. We approach this problem by in…
▽ More
In clustering algorithm selection, we are given a massive dataset and must efficiently select which clustering algorithm to use. We study this problem in a semi-supervised setting, with an unknown ground-truth clustering that we can only access through expensive oracle queries. Ideally, the clustering algorithm's output will be structurally close to the ground truth. We approach this problem by introducing a notion of size generalization for clustering algorithm accuracy. We identify conditions under which we can (1) subsample the massive clustering instance, (2) evaluate a set of candidate algorithms on the smaller instance, and (3) guarantee that the algorithm with the best accuracy on the small instance will have the best accuracy on the original big instance. We provide theoretical size generalization guarantees for three classic clustering algorithms: single-linkage, k-means++, and (a smoothed variant of) Gonzalez's k-centers heuristic. We validate our theoretical analysis with empirical results, observing that on real-world clustering instances, we can use a subsample of as little as 5% of the data to identify which algorithm is best on the full dataset.
△ Less
Submitted 25 February, 2024; v1 submitted 22 February, 2024;
originally announced February 2024.
-
Dimension-Accuracy Tradeoffs in Contrastive Embeddings for Triplets, Terminals & Top-k Nearest Neighbors
Authors:
Vaggos Chatziafratis,
Piotr Indyk
Abstract:
Metric embeddings traditionally study how to map $n$ items to a target metric space such that distance lengths are not heavily distorted; but what if we only care to preserve the relative order of the distances (and not their length)? In this paper, we are motivated by the following basic question: given triplet comparisons of the form ``item $i$ is closer to item $j$ than to item $k$,'' can we fi…
▽ More
Metric embeddings traditionally study how to map $n$ items to a target metric space such that distance lengths are not heavily distorted; but what if we only care to preserve the relative order of the distances (and not their length)? In this paper, we are motivated by the following basic question: given triplet comparisons of the form ``item $i$ is closer to item $j$ than to item $k$,'' can we find low-dimensional Euclidean representations for the $n$ items that respect those distance comparisons? Such order-preserving embeddings naturally arise in important applications and have been studied since the 1950s, under the name of ordinal or non-metric embeddings. Our main results are:
1. Nearly-Tight Bounds on Triplet Dimension: We introduce the natural concept of triplet dimension of a dataset, and surprisingly, we show that in order for an ordinal embedding to be triplet-preserving, its dimension needs to grow as $\frac n2$ in the worst case. This is optimal (up to constant) as $n-1$ dimensions always suffice.
2. Tradeoffs for Dimension vs (Ordinal) Relaxation: We then relax the requirement that every triplet should be exactly preserved and present almost tight lower bounds for the maximum ratio between distances whose relative order was inverted by the embedding; this ratio is known as (ordinal) relaxation in the literature and serves as a counterpart to (metric) distortion.
3. New Bounds on Terminal and Top-$k$-NNs Embeddings: Going beyond triplets, we then study two well-motivated scenarios where we care about preserving specific sets of distances (not necessarily triplets). The first scenario is Terminal Ordinal Embeddings and the second scenario is top-$k$-NNs Ordinal Embeddings.
To the best of our knowledge, these are some of the first tradeoffs on triplet-preserving ordinal embeddings and the first study of Terminal and Top-$k$-NNs Ordinal Embeddings.
△ Less
Submitted 29 December, 2023; v1 submitted 20 December, 2023;
originally announced December 2023.
-
Triplet Reconstruction and all other Phylogenetic CSPs are Approximation Resistant
Authors:
Vaggos Chatziafratis,
Konstantin Makarychev
Abstract:
We study the natural problem of Triplet Reconstruction (also Rooted Triplets Consistency or Triplet Clustering), originally motivated in computational biology and relational databases (Aho, Sagiv, Szymanski, and Ullman, 1981): given $n$ points, we want to embed them onto the $n$ leaves of a rooted binary tree (a hierarchical clustering or ultrametric embedding) such that a given set of $m$ triplet…
▽ More
We study the natural problem of Triplet Reconstruction (also Rooted Triplets Consistency or Triplet Clustering), originally motivated in computational biology and relational databases (Aho, Sagiv, Szymanski, and Ullman, 1981): given $n$ points, we want to embed them onto the $n$ leaves of a rooted binary tree (a hierarchical clustering or ultrametric embedding) such that a given set of $m$ triplet constraints is satisfied. Triplet $ij|k$ indicates that ``$i, j$ are more closely related to each other than to $k$'' and a tree satisfies $ij|k$ if $d(i,j)$ is the smallest among the 3 distances. Aho et al. (1981) gave an elegant efficient algorithm to find a tree respecting all constraints (if it exists) and it is easy to see that a random binary tree is a 1/3-approximation. Unfortunately, despite more than four decades of research, no better approximation is known.
Our main theorem--which captures Triplet Reconstruction as a special case--is a general hardness of approximation result about Constraint Satisfaction Problems (CSPs) over infinite domains (the variables are mapped to any of the $n$ leaves of a tree). Specifically, we prove, under Unique Games (Khot, 2002), that Triplet Reconstruction and more generally, every CSP over hierarchies is approximation resistant (there is no polynomial-time algorithm that does asymptotically better than a biased random assignment). This settles the approximability for many interesting Subtree or Supertree Aggregation Problems. More broadly, our result significantly extends the list of approximation resistant predicates and is a generalization of Guruswami, Hastad, Manokaran, Raghavendra, and Charikar (2011), who showed that ordering CSPs are approximation resistant. The main challenge in our analyses stems from the fact that trees have topology which is what determines whether a given triplet constraint on the leaves is satisfied or not.
△ Less
Submitted 5 April, 2023; v1 submitted 24 December, 2022;
originally announced December 2022.
-
On Scrambling Phenomena for Randomly Initialized Recurrent Networks
Authors:
Vaggos Chatziafratis,
Ioannis Panageas,
Clayton Sanford,
Stelios Andrew Stavroulakis
Abstract:
Recurrent Neural Networks (RNNs) frequently exhibit complicated dynamics, and their sensitivity to the initialization process often renders them notoriously hard to train. Recent works have shed light on such phenomena analyzing when exploding or vanishing gradients may occur, either of which is detrimental for training dynamics. In this paper, we point to a formal connection between RNNs and chao…
▽ More
Recurrent Neural Networks (RNNs) frequently exhibit complicated dynamics, and their sensitivity to the initialization process often renders them notoriously hard to train. Recent works have shed light on such phenomena analyzing when exploding or vanishing gradients may occur, either of which is detrimental for training dynamics. In this paper, we point to a formal connection between RNNs and chaotic dynamical systems and prove a qualitatively stronger phenomenon about RNNs than what exploding gradients seem to suggest. Our main result proves that under standard initialization (e.g., He, Xavier etc.), RNNs will exhibit \textit{Li-Yorke chaos} with \textit{constant} probability \textit{independent} of the network's width. This explains the experimentally observed phenomenon of \textit{scrambling}, under which trajectories of nearby points may appear to be arbitrarily close during some timesteps, yet will be far away in future timesteps. In stark contrast to their feedforward counterparts, we show that chaotic behavior in RNNs is preserved under small perturbations and that their expressive power remains exponential in the number of feedback iterations. Our technical arguments rely on viewing RNNs as random walks under non-linear activations, and studying the existence of certain types of higher-order fixed points called \textit{periodic points} that lead to phase transitions from order to chaos.
△ Less
Submitted 11 October, 2022;
originally announced October 2022.
-
Efficiently Computing Nash Equilibria in Adversarial Team Markov Games
Authors:
Fivos Kalogiannis,
Ioannis Anagnostides,
Ioannis Panageas,
Emmanouil-Vasileios Vlatakis-Gkaragkounis,
Vaggos Chatziafratis,
Stelios Stavroulakis
Abstract:
Computing Nash equilibrium policies is a central problem in multi-agent reinforcement learning that has received extensive attention both in theory and in practice. However, provable guarantees have been thus far either limited to fully competitive or cooperative scenarios or impose strong assumptions that are difficult to meet in most practical applications. In this work, we depart from those pri…
▽ More
Computing Nash equilibrium policies is a central problem in multi-agent reinforcement learning that has received extensive attention both in theory and in practice. However, provable guarantees have been thus far either limited to fully competitive or cooperative scenarios or impose strong assumptions that are difficult to meet in most practical applications. In this work, we depart from those prior results by investigating infinite-horizon \emph{adversarial team Markov games}, a natural and well-motivated class of games in which a team of identically-interested players -- in the absence of any explicit coordination or communication -- is competing against an adversarial player. This setting allows for a unifying treatment of zero-sum Markov games and Markov potential games, and serves as a step to model more realistic strategic interactions that feature both competing and cooperative interests. Our main contribution is the first algorithm for computing stationary $ε$-approximate Nash equilibria in adversarial team Markov games with computational complexity that is polynomial in all the natural parameters of the game, as well as $1/ε$. The proposed algorithm is particularly natural and practical, and it is based on performing independent policy gradient steps for each player in the team, in tandem with best responses from the side of the adversary; in turn, the policy for the adversary is then obtained by solving a carefully constructed linear program. Our analysis leverages non-standard techniques to establish the KKT optimality conditions for a nonlinear program with nonconvex constraints, thereby leading to a natural interpretation of the induced Lagrange multipliers. Along the way, we significantly extend an important characterization of optimal policies in adversarial (normal-form) team games due to Von Stengel and Koller (GEB `97).
△ Less
Submitted 3 August, 2022;
originally announced August 2022.
-
Hierarchical Clustering in Graph Streams: Single-Pass Algorithms and Space Lower Bounds
Authors:
Sepehr Assadi,
Vaggos Chatziafratis,
Jakub Łącki,
Vahab Mirrokni,
Chen Wang
Abstract:
The Hierarchical Clustering (HC) problem consists of building a hierarchy of clusters to represent a given dataset. Motivated by the modern large-scale applications, we study the problem in the \streaming model, in which the memory is heavily limited and only a single or very few passes over the input are allowed. Specifically, we investigate whether a good hierarchical clustering can be obtained,…
▽ More
The Hierarchical Clustering (HC) problem consists of building a hierarchy of clusters to represent a given dataset. Motivated by the modern large-scale applications, we study the problem in the \streaming model, in which the memory is heavily limited and only a single or very few passes over the input are allowed. Specifically, we investigate whether a good hierarchical clustering can be obtained, or at least whether we can approximately estimate the value of the optimal hierarchy. To measure the quality of a hierarchy, we use the HC minimization objective introduced by Dasgupta. Assuming that the input is an $n$-vertex weighted graph whose edges arrive in a stream, we derive the following results on space-vs-accuracy tradeoffs:
* With $O(n\cdot \text{polylog}\,{n})$ space, we develop a single-pass algorithm, whose approximation ratio matches the currently best offline algorithm.
* When the space is more limited, namely, $n^{1-o(1)}$, we prove that no algorithm can even estimate the value of optimum HC tree to within an $o(\frac{\log{n}}{\log\log{n}})$ factor, even when allowed $\text{polylog}{\,{n}}$ passes over the input.
* In the most stringent setting of $\text{polylog}\,{n}$ space, we rule out algorithms that can even distinguish between "highly"-vs-"poorly" clusterable graphs, namely, graphs that have an $n^{1/2-o(1)}$ factor gap between their HC objective value.
* Finally, we prove that any single-pass streaming algorithm that computes an optimal HC tree requires to store almost the entire input even if allowed exponential time.
Our algorithmic results establish a general structural result that proves that cut sparsifiers of input graph can preserve cost of "balanced" HC trees to within a constant factor. Our lower bound results include a new streaming lower bound for a novel problem "One-vs-Many-Expanders", which can be of independent interest.
△ Less
Submitted 15 June, 2022;
originally announced June 2022.
-
Expressivity of Neural Networks via Chaotic Itineraries beyond Sharkovsky's Theorem
Authors:
Clayton Sanford,
Vaggos Chatziafratis
Abstract:
Given a target function $f$, how large must a neural network be in order to approximate $f$? Recent works examine this basic question on neural network \textit{expressivity} from the lens of dynamical systems and provide novel ``depth-vs-width'' tradeoffs for a large family of functions $f$. They suggest that such tradeoffs are governed by the existence of \textit{periodic} points or \emph{cycles}…
▽ More
Given a target function $f$, how large must a neural network be in order to approximate $f$? Recent works examine this basic question on neural network \textit{expressivity} from the lens of dynamical systems and provide novel ``depth-vs-width'' tradeoffs for a large family of functions $f$. They suggest that such tradeoffs are governed by the existence of \textit{periodic} points or \emph{cycles} in $f$. Our work, by further deploying dynamical systems concepts, illuminates a more subtle connection between periodicity and expressivity: we prove that periodic points alone lead to suboptimal depth-width tradeoffs and we improve upon them by demonstrating that certain ``chaotic itineraries'' give stronger exponential tradeoffs, even in regimes where previous analyses only imply polynomial gaps. Contrary to prior works, our bounds are nearly-optimal, tighten as the period increases, and handle strong notions of inapproximability (e.g., constant $L_1$ error). More broadly, we identify a phase transition to the \textit{chaotic regime} that exactly coincides with an abrupt shift in other notions of function complexity, including VC-dimension and topological entropy.
△ Less
Submitted 19 October, 2021;
originally announced October 2021.
-
Maximizing Agreements for Ranking, Clustering and Hierarchical Clustering via MAX-CUT
Authors:
Vaggos Chatziafratis,
Mohammad Mahdian,
Sara Ahmadian
Abstract:
In this paper, we study a number of well-known combinatorial optimization problems that fit in the following paradigm: the input is a collection of (potentially inconsistent) local relationships between the elements of a ground set (e.g., pairwise comparisons, similar/dissimilar pairs, or ancestry structure of triples of points), and the goal is to aggregate this information into a global structur…
▽ More
In this paper, we study a number of well-known combinatorial optimization problems that fit in the following paradigm: the input is a collection of (potentially inconsistent) local relationships between the elements of a ground set (e.g., pairwise comparisons, similar/dissimilar pairs, or ancestry structure of triples of points), and the goal is to aggregate this information into a global structure (e.g., a ranking, a clustering, or a hierarchical clustering) in a way that maximizes agreement with the input. Well-studied problems such as rank aggregation, correlation clustering, and hierarchical clustering with triplet constraints fall in this class of problems.
We study these problems on stochastic instances with a hidden embedded ground truth solution. Our main algorithmic contribution is a unified technique that uses the maximum cut problem in graphs to approximately solve these problems. Using this technique, we can often get approximation guarantees in the stochastic setting that are better than the known worst case inapproximability bounds for the corresponding problem. On the negative side, we improve the worst case inapproximability bound on several hierarchical clustering formulations through a reduction to related ranking problems.
△ Less
Submitted 23 February, 2021;
originally announced February 2021.
-
Hierarchical Clustering via Sketches and Hierarchical Correlation Clustering
Authors:
Danny Vainstein,
Vaggos Chatziafratis,
Gui Citovsky,
Anand Rajagopalan,
Mohammad Mahdian,
Yossi Azar
Abstract:
Recently, Hierarchical Clustering (HC) has been considered through the lens of optimization. In particular, two maximization objectives have been defined. Moseley and Wang defined the \emph{Revenue} objective to handle similarity information given by a weighted graph on the data points (w.l.o.g., $[0,1]$ weights), while Cohen-Addad et al. defined the \emph{Dissimilarity} objective to handle dissim…
▽ More
Recently, Hierarchical Clustering (HC) has been considered through the lens of optimization. In particular, two maximization objectives have been defined. Moseley and Wang defined the \emph{Revenue} objective to handle similarity information given by a weighted graph on the data points (w.l.o.g., $[0,1]$ weights), while Cohen-Addad et al. defined the \emph{Dissimilarity} objective to handle dissimilarity information. In this paper, we prove structural lemmas for both objectives allowing us to convert any HC tree to a tree with constant number of internal nodes while incurring an arbitrarily small loss in each objective. Although the best-known approximations are 0.585 and 0.667 respectively, using our lemmas we obtain approximations arbitrarily close to 1, if not all weights are small (i.e., there exist constants $ε, δ$ such that the fraction of weights smaller than $δ$, is at most $1 - ε$); such instances encompass many metric-based similarity instances, thereby improving upon prior work. Finally, we introduce Hierarchical Correlation Clustering (HCC) to handle instances that contain similarity and dissimilarity information simultaneously. For HCC, we provide an approximation of 0.4767 and for complementary similarity/dissimilarity weights (analogous to $+/-$ correlation clustering), we again present nearly-optimal approximations.
△ Less
Submitted 26 January, 2021;
originally announced January 2021.
-
Inapproximability for Local Correlation Clustering and Dissimilarity Hierarchical Clustering
Authors:
Vaggos Chatziafratis,
Neha Gupta,
Euiwoong Lee
Abstract:
We present hardness of approximation results for Correlation Clustering with local objectives and for Hierarchical Clustering with dissimilarity information. For the former, we study the local objective of Puleo and Milenkovic (ICML '16) that prioritizes reducing the disagreements at data points that are worst off and for the latter we study the maximization version of Dasgupta's cost function (ST…
▽ More
We present hardness of approximation results for Correlation Clustering with local objectives and for Hierarchical Clustering with dissimilarity information. For the former, we study the local objective of Puleo and Milenkovic (ICML '16) that prioritizes reducing the disagreements at data points that are worst off and for the latter we study the maximization version of Dasgupta's cost function (STOC '16). Our APX hardness results imply that the two problems are hard to approximate within a constant of 4/3 ~ 1.33 (assuming P vs NP) and 9159/9189 ~ 0.9967 (assuming the Unique Games Conjecture) respectively.
△ Less
Submitted 3 October, 2020;
originally announced October 2020.
-
From Trees to Continuous Embeddings and Back: Hyperbolic Hierarchical Clustering
Authors:
Ines Chami,
Albert Gu,
Vaggos Chatziafratis,
Christopher Ré
Abstract:
Similarity-based Hierarchical Clustering (HC) is a classical unsupervised machine learning algorithm that has traditionally been solved with heuristic algorithms like Average-Linkage. Recently, Dasgupta reframed HC as a discrete optimization problem by introducing a global cost function measuring the quality of a given tree. In this work, we provide the first continuous relaxation of Dasgupta's di…
▽ More
Similarity-based Hierarchical Clustering (HC) is a classical unsupervised machine learning algorithm that has traditionally been solved with heuristic algorithms like Average-Linkage. Recently, Dasgupta reframed HC as a discrete optimization problem by introducing a global cost function measuring the quality of a given tree. In this work, we provide the first continuous relaxation of Dasgupta's discrete optimization problem with provable quality guarantees. The key idea of our method, HypHC, is showing a direct correspondence from discrete trees to continuous representations (via the hyperbolic embeddings of their leaf nodes) and back (via a decoding algorithm that maps leaf embeddings to a dendrogram), allowing us to search the space of discrete binary trees with continuous optimization. Building on analogies between trees and hyperbolic space, we derive a continuous analogue for the notion of lowest common ancestor, which leads to a continuous relaxation of Dasgupta's discrete objective. We can show that after decoding, the global minimizer of our continuous relaxation yields a discrete tree with a (1 + epsilon)-factor approximation for Dasgupta's optimal tree, where epsilon can be made arbitrarily small and controls optimization challenges. We experimentally evaluate HypHC on a variety of HC benchmarks and find that even approximate solutions found with gradient descent have superior clustering quality than agglomerative heuristics or other gradient based algorithms. Finally, we highlight the flexibility of HypHC using end-to-end training in a downstream classification task.
△ Less
Submitted 1 October, 2020;
originally announced October 2020.
-
Better Depth-Width Trade-offs for Neural Networks through the lens of Dynamical Systems
Authors:
Vaggos Chatziafratis,
Sai Ganesh Nagarajan,
Ioannis Panageas
Abstract:
The expressivity of neural networks as a function of their depth, width and type of activation units has been an important question in deep learning theory. Recently, depth separation results for ReLU networks were obtained via a new connection with dynamical systems, using a generalized notion of fixed points of a continuous map $f$, called periodic points. In this work, we strengthen the connect…
▽ More
The expressivity of neural networks as a function of their depth, width and type of activation units has been an important question in deep learning theory. Recently, depth separation results for ReLU networks were obtained via a new connection with dynamical systems, using a generalized notion of fixed points of a continuous map $f$, called periodic points. In this work, we strengthen the connection with dynamical systems and we improve the existing width lower bounds along several aspects. Our first main result is period-specific width lower bounds that hold under the stronger notion of $L^1$-approximation error, instead of the weaker classification error. Our second contribution is that we provide sharper width lower bounds, still yielding meaningful exponential depth-width separations, in regimes where previous results wouldn't apply. A byproduct of our results is that there exists a universal constant characterizing the depth-width trade-offs, as long as $f$ has odd periods. Technically, our results follow by unveiling a tighter connection between the following three quantities of a given function: its period, its Lipschitz constant and the growth rate of the number of oscillations arising under compositions of the function $f$ with itself.
△ Less
Submitted 20 July, 2020; v1 submitted 2 March, 2020;
originally announced March 2020.
-
Bisect and Conquer: Hierarchical Clustering via Max-Uncut Bisection
Authors:
Sara Ahmadian,
Vaggos Chatziafratis,
Alessandro Epasto,
Euiwoong Lee,
Mohammad Mahdian,
Konstantin Makarychev,
Grigory Yaroslavtsev
Abstract:
Hierarchical Clustering is an unsupervised data analysis method which has been widely used for decades. Despite its popularity, it had an underdeveloped analytical foundation and to address this, Dasgupta recently introduced an optimization viewpoint of hierarchical clustering with pairwise similarity information that spurred a line of work shedding light on old algorithms (e.g., Average-Linkage),…
▽ More
Hierarchical Clustering is an unsupervised data analysis method which has been widely used for decades. Despite its popularity, it had an underdeveloped analytical foundation and to address this, Dasgupta recently introduced an optimization viewpoint of hierarchical clustering with pairwise similarity information that spurred a line of work shedding light on old algorithms (e.g., Average-Linkage), but also designing new algorithms. Here, for the maximization dual of Dasgupta's objective (introduced by Moseley-Wang), we present polynomial-time .4246 approximation algorithms that use Max-Uncut Bisection as a subroutine. The previous best worst-case approximation factor in polynomial time was .336, improving only slightly over Average-Linkage which achieves 1/3. Finally, we complement our positive results by providing APX-hardness (even for 0-1 similarities), under the Small Set Expansion hypothesis.
△ Less
Submitted 15 December, 2019;
originally announced December 2019.
-
Depth-Width Trade-offs for ReLU Networks via Sharkovsky's Theorem
Authors:
Vaggos Chatziafratis,
Sai Ganesh Nagarajan,
Ioannis Panageas,
Xiao Wang
Abstract:
Understanding the representational power of Deep Neural Networks (DNNs) and how their structural properties (e.g., depth, width, type of activation unit) affect the functions they can compute, has been an important yet challenging question in deep learning and approximation theory. In a seminal paper, Telgarsky highlighted the benefits of depth by presenting a family of functions (based on simple…
▽ More
Understanding the representational power of Deep Neural Networks (DNNs) and how their structural properties (e.g., depth, width, type of activation unit) affect the functions they can compute, has been an important yet challenging question in deep learning and approximation theory. In a seminal paper, Telgarsky highlighted the benefits of depth by presenting a family of functions (based on simple triangular waves) for which DNNs achieve zero classification error, whereas shallow networks with fewer than exponentially many nodes incur constant error. Even though Telgarsky's work reveals the limitations of shallow neural networks, it does not inform us on why these functions are difficult to represent and in fact he states it as a tantalizing open question to characterize those functions that cannot be well-approximated by smaller depths.
In this work, we point to a new connection between DNNs expressivity and Sharkovsky's Theorem from dynamical systems, that enables us to characterize the depth-width trade-offs of ReLU networks for representing functions based on the presence of generalized notion of fixed points, called periodic points (a fixed point is a point of period 1). Motivated by our observation that the triangle waves used in Telgarsky's work contain points of period 3 - a period that is special in that it implies chaotic behavior based on the celebrated result by Li-Yorke - we proceed to give general lower bounds for the width needed to represent periodic functions as a function of the depth. Technically, the crux of our approach is based on an eigenvalue analysis of the dynamical system associated with such functions.
△ Less
Submitted 9 December, 2019;
originally announced December 2019.
-
Adversarially Robust Low Dimensional Representations
Authors:
Pranjal Awasthi,
Vaggos Chatziafratis,
Xue Chen,
Aravindan Vijayaraghavan
Abstract:
Many machine learning systems are vulnerable to small perturbations made to inputs either at test time or at training time. This has received much recent interest on the empirical front due to applications where reliability and security are critical. However, theoretical understanding of algorithms that are robust to adversarial perturbations is limited.
In this work we focus on Principal Compon…
▽ More
Many machine learning systems are vulnerable to small perturbations made to inputs either at test time or at training time. This has received much recent interest on the empirical front due to applications where reliability and security are critical. However, theoretical understanding of algorithms that are robust to adversarial perturbations is limited.
In this work we focus on Principal Component Analysis (PCA), a ubiquitous algorithmic primitive in machine learning. We formulate a natural robust variant of PCA where the goal is to find a low dimensional subspace to represent the given data with minimum projection error, that is in addition robust to small perturbations measured in $\ell_q$ norm (say $q=\infty$). Unlike PCA which is solvable in polynomial time, our formulation is computationally intractable to optimize as it captures a variant of the well-studied sparse PCA objective as a special case. We show the following results:
-Polynomial time algorithm that is constant factor competitive in the worst-case with respect to the best subspace, in terms of the projection error and the robustness criterion.
-We show that our algorithmic techniques can also be made robust to adversarial training-time perturbations, in addition to yielding representations that are robust to adversarial perturbations at test time. Specifically, we design algorithms for a strong notion of training-time perturbations, where every point is adversarially perturbed up to a specified amount.
-We illustrate the broad applicability of our algorithmic techniques in addressing robustness to adversarial perturbations, both at training time and test time. In particular, our adversarially robust PCA primitive leads to computationally efficient and robust algorithms for both unsupervised and supervised learning problems such as clustering and learning adversarially robust classifiers.
△ Less
Submitted 13 August, 2021; v1 submitted 29 November, 2019;
originally announced November 2019.
-
Hierarchical Clustering for Euclidean Data
Authors:
Moses Charikar,
Vaggos Chatziafratis,
Rad Niazadeh,
Grigory Yaroslavtsev
Abstract:
Recent works on Hierarchical Clustering (HC), a well-studied problem in exploratory data analysis, have focused on optimizing various objective functions for this problem under arbitrary similarity measures. In this paper we take the first step and give novel scalable algorithms for this problem tailored to Euclidean data in R^d and under vector-based similarity measures, a prevalent model in seve…
▽ More
Recent works on Hierarchical Clustering (HC), a well-studied problem in exploratory data analysis, have focused on optimizing various objective functions for this problem under arbitrary similarity measures. In this paper we take the first step and give novel scalable algorithms for this problem tailored to Euclidean data in R^d and under vector-based similarity measures, a prevalent model in several typical machine learning applications. We focus primarily on the popular Gaussian kernel and other related measures, presenting our results through the lens of the objective introduced recently by Moseley and Wang [2017]. We show that the approximation factor in Moseley and Wang [2017] can be improved for Euclidean data. We further demonstrate both theoretically and experimentally that our algorithms scale to very high dimension d, while outperforming average-linkage and showing competitive results against other less scalable approaches.
△ Less
Submitted 26 December, 2018;
originally announced December 2018.
-
Bilu-Linial stability, certified algorithms and the Independent Set problem
Authors:
Haris Angelidakis,
Pranjal Awasthi,
Avrim Blum,
Vaggos Chatziafratis,
Chen Dan
Abstract:
We study the Maximum Independent Set (MIS) problem under the notion of stability introduced by Bilu and Linial (2010): a weighted instance of MIS is $γ$-stable if it has a unique optimal solution that remains the unique optimum under multiplicative perturbations of the weights by a factor of at most $γ\geq 1$. The goal then is to efficiently recover the unique optimal solution. In this work, we so…
▽ More
We study the Maximum Independent Set (MIS) problem under the notion of stability introduced by Bilu and Linial (2010): a weighted instance of MIS is $γ$-stable if it has a unique optimal solution that remains the unique optimum under multiplicative perturbations of the weights by a factor of at most $γ\geq 1$. The goal then is to efficiently recover the unique optimal solution. In this work, we solve stable instances of MIS on several graphs classes: we solve $\widetilde{O}(Δ/\sqrt{\log Δ})$-stable instances on graphs of maximum degree $Δ$, $(k - 1)$-stable instances on $k$-colorable graphs and $(1 + \varepsilon)$-stable instances on planar graphs. For general graphs, we present a strong lower bound showing that there are no efficient algorithms for $O(n^{\frac{1}{2} - \varepsilon})$-stable instances of MIS, assuming the planted clique conjecture. We also give an algorithm for $(\varepsilon n)$-stable instances. As a by-product of our techniques, we give algorithms and lower bounds for stable instances of Node Multiway Cut. Furthermore, we prove a general result showing that the integrality gap of convex relaxations of several maximization problems reduces dramatically on stable instances.
Moreover, we initiate the study of certified algorithms, a notion recently introduced by Makarychev and Makarychev (2018), which is a class of $γ$-approximation algorithms that satisfy one crucial property: the solution returned is optimal for a perturbation of the original instance. We obtain $Δ$-certified algorithms for MIS on graphs of maximum degree $Δ$, and $(1+\varepsilon)$-certified algorithms on planar graphs. Finally, we analyze the algorithm of Berman and Furer (1994) and prove that it is a $\left(\frac{Δ+ 1}{3} + \varepsilon\right)$-certified algorithm for MIS on graphs of maximum degree $Δ$ where all weights are equal to 1.
△ Less
Submitted 29 November, 2021; v1 submitted 19 October, 2018;
originally announced October 2018.
-
Hierarchical Clustering better than Average-Linkage
Authors:
Moses Charikar,
Vaggos Chatziafratis,
Rad Niazadeh
Abstract:
Hierarchical Clustering (HC) is a widely studied problem in exploratory data analysis, usually tackled by simple agglomerative procedures like average-linkage, single-linkage or complete-linkage. In this paper we focus on two objectives, introduced recently to give insight into the performance of average-linkage clustering: a similarity based HC objective proposed by [Moseley and Wang, 2017] and a…
▽ More
Hierarchical Clustering (HC) is a widely studied problem in exploratory data analysis, usually tackled by simple agglomerative procedures like average-linkage, single-linkage or complete-linkage. In this paper we focus on two objectives, introduced recently to give insight into the performance of average-linkage clustering: a similarity based HC objective proposed by [Moseley and Wang, 2017] and a dissimilarity based HC objective proposed by [Cohen-Addad et al., 2018]. In both cases, we present tight counterexamples showing that average-linkage cannot obtain better than 1/3 and 2/3 approximations respectively (in the worst-case), settling an open question raised in [Moseley and Wang, 2017]. This matches the approximation ratio of a random solution, raising a natural question: can we beat average-linkage for these objectives? We answer this in the affirmative, giving two new algorithms based on semidefinite programming with provably better guarantees.
△ Less
Submitted 7 August, 2018;
originally announced August 2018.
-
On the Computational Power of Online Gradient Descent
Authors:
Vaggos Chatziafratis,
Tim Roughgarden,
Joshua R. Wang
Abstract:
We prove that the evolution of weight vectors in online gradient descent can encode arbitrary polynomial-space computations, even in very simple learning settings. Our results imply that, under weak complexity-theoretic assumptions, it is impossible to reason efficiently about the fine-grained behavior of online gradient descent.
We prove that the evolution of weight vectors in online gradient descent can encode arbitrary polynomial-space computations, even in very simple learning settings. Our results imply that, under weak complexity-theoretic assumptions, it is impossible to reason efficiently about the fine-grained behavior of online gradient descent.
△ Less
Submitted 6 February, 2019; v1 submitted 3 July, 2018;
originally announced July 2018.
-
Hierarchical Clustering with Structural Constraints
Authors:
Vaggos Chatziafratis,
Rad Niazadeh,
Moses Charikar
Abstract:
Hierarchical clustering is a popular unsupervised data analysis method. For many real-world applications, we would like to exploit prior information about the data that imposes constraints on the clustering hierarchy, and is not captured by the set of features available to the algorithm. This gives rise to the problem of "hierarchical clustering with structural constraints". Structural constraints…
▽ More
Hierarchical clustering is a popular unsupervised data analysis method. For many real-world applications, we would like to exploit prior information about the data that imposes constraints on the clustering hierarchy, and is not captured by the set of features available to the algorithm. This gives rise to the problem of "hierarchical clustering with structural constraints". Structural constraints pose major challenges for bottom-up approaches like average/single linkage and even though they can be naturally incorporated into top-down divisive algorithms, no formal guarantees exist on the quality of their output. In this paper, we provide provable approximation guarantees for two simple top-down algorithms, using a recently introduced optimization viewpoint of hierarchical clustering with pairwise similarity information [Dasgupta, 2016]. We show how to find good solutions even in the presence of conflicting prior information, by formulating a constraint-based regularization of the objective. We further explore a variation of this objective for dissimilarity information [Cohen-Addad et al., 2018] and improve upon current techniques. Finally, we demonstrate our approach on a real dataset for the taxonomy application.
△ Less
Submitted 14 July, 2018; v1 submitted 23 May, 2018;
originally announced May 2018.
-
Stability and Recovery for Independence Systems
Authors:
Vaggos Chatziafratis,
Tim Roughgarden,
Jan Vondrak
Abstract:
Two genres of heuristics that are frequently reported to perform much better on "real-world" instances than in the worst case are greedy algorithms and local search algorithms. In this paper, we systematically study these two types of algorithms for the problem of maximizing a monotone submodular set function subject to downward-closed feasibility constraints. We consider perturbation-stable insta…
▽ More
Two genres of heuristics that are frequently reported to perform much better on "real-world" instances than in the worst case are greedy algorithms and local search algorithms. In this paper, we systematically study these two types of algorithms for the problem of maximizing a monotone submodular set function subject to downward-closed feasibility constraints. We consider perturbation-stable instances, in the sense of Bilu and Linial, and precisely identify the stability threshold beyond which these algorithms are guaranteed to recover the optimal solution. Byproducts of our work include the first definition of perturbation-stability for non-additive objective functions, and a resolution of the worst-case approximation guarantee of local search in p-extendible systems.
△ Less
Submitted 30 June, 2017; v1 submitted 29 April, 2017;
originally announced May 2017.
-
Approximate Hierarchical Clustering via Sparsest Cut and Spreading Metrics
Authors:
Moses Charikar,
Vaggos Chatziafratis
Abstract:
Dasgupta recently introduced a cost function for the hierarchical clustering of a set of points given pairwise similarities between them. He showed that this function is NP-hard to optimize, but a top-down recursive partitioning heuristic based on an alpha_n-approximation algorithm for uniform sparsest cut gives an approximation of O(alpha_n log n) (the current best algorithm has alpha_n=O(sqrt{lo…
▽ More
Dasgupta recently introduced a cost function for the hierarchical clustering of a set of points given pairwise similarities between them. He showed that this function is NP-hard to optimize, but a top-down recursive partitioning heuristic based on an alpha_n-approximation algorithm for uniform sparsest cut gives an approximation of O(alpha_n log n) (the current best algorithm has alpha_n=O(sqrt{log n})). We show that the aforementioned sparsest cut heuristic in fact obtains an O(alpha_n)-approximation for hierarchical clustering. The algorithm also applies to a generalized cost function studied by Dasgupta. Moreover, we obtain a strong inapproximability result, showing that the hierarchical clustering objective is hard to approximate to within any constant factor assuming the Small-Set Expansion (SSE) Hypothesis. Finally, we discuss approximation algorithms based on convex relaxations. We present a spreading metric SDP relaxation for the problem and show that it has integrality gap at most O(sqrt{log n}). The advantage of the SDP relative to the sparsest cut heuristic is that it provides an explicit lower bound on the optimal solution and could potentially yield an even better approximation for hierarchical clustering. In fact our analysis of this SDP served as the inspiration for our improved analysis of the sparsest cut heuristic. We also show that a spreading metric LP relaxation gives an O(log n)-approximation.
△ Less
Submitted 29 September, 2016;
originally announced September 2016.