Search | arXiv e-print repository

Competitive Algorithms for Online Knapsack with Succinct Predictions

Authors: Mohammadreza Daneshvaramoli, Helia Karisani, Adam Lechowicz, Bo Sun, Cameron Musco, Mohammad Hajiesmaili

Abstract: In the online knapsack problem, the goal is to pack items arriving online with different values and weights into a capacity-limited knapsack to maximize the total value of the accepted items. We study \textit{learning-augmented} algorithms for this problem, which aim to use machine-learned predictions to move beyond pessimistic worst-case guarantees. Existing learning-augmented algorithms for onli… ▽ More In the online knapsack problem, the goal is to pack items arriving online with different values and weights into a capacity-limited knapsack to maximize the total value of the accepted items. We study \textit{learning-augmented} algorithms for this problem, which aim to use machine-learned predictions to move beyond pessimistic worst-case guarantees. Existing learning-augmented algorithms for online knapsack consider relatively complicated prediction models that give an algorithm substantial information about the input, such as the total weight of items at each value. In practice, such predictions can be error-sensitive and difficult to learn. Motivated by this limitation, we introduce a family of learning-augmented algorithms for online knapsack that use \emph{succinct predictions}. In particular, the machine-learned prediction given to the algorithm is just a single value or interval that estimates the minimum value of any item accepted by an offline optimal solution. By leveraging a relaxation to online \emph{fractional} knapsack, we design algorithms that can leverage such succinct predictions in both the trusted setting (i.e., with perfect prediction) and the untrusted setting, where we prove that a simple meta-algorithm achieves a nearly optimal consistency-robustness trade-off. Empirically, we show that our algorithms significantly outperform baselines that do not use predictions and often outperform algorithms based on more complex prediction models. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: 29 pages, 10 figures, Submitted to NeurIPS 2024

MSC Class: 68Q25; 68T05 ACM Class: F.2.2; I.2.6

arXiv:2406.07521 [pdf, other]

Faster Spectral Density Estimation and Sparsification in the Nuclear Norm

Authors: Yujia **, Ishani Karmarkar, Christopher Musco, Aaron Sidford, Apoorv Vikram Singh

Abstract: We consider the problem of estimating the spectral density of the normalized adjacency matrix of an $n$-node undirected graph. We provide a randomized algorithm that, with $O(nε^{-2})$ queries to a degree and neighbor oracle and in $O(nε^{-3})$ time, estimates the spectrum up to $ε$ accuracy in the Wasserstein-1 metric. This improves on previous state-of-the-art methods, including an $O(nε^{-7})$… ▽ More We consider the problem of estimating the spectral density of the normalized adjacency matrix of an $n$-node undirected graph. We provide a randomized algorithm that, with $O(nε^{-2})$ queries to a degree and neighbor oracle and in $O(nε^{-3})$ time, estimates the spectrum up to $ε$ accuracy in the Wasserstein-1 metric. This improves on previous state-of-the-art methods, including an $O(nε^{-7})$ time algorithm from [Braverman et al., STOC 2022] and, for sufficiently small $ε$, a $2^{O(ε^{-1})}$ time method from [Cohen-Steiner et al., KDD 2018]. To achieve this result, we introduce a new notion of graph sparsification, which we call nuclear sparsification. We provide an $O(nε^{-2})$-query and $O(nε^{-2})$-time algorithm for computing $O(nε^{-2})$-sparse nuclear sparsifiers. We show that this bound is optimal in both its sparsity and query complexity, and we separate our results from the related notion of additive spectral sparsification. Of independent interest, we show that our sparsification method also yields the first deterministic algorithm for spectral density estimation that scales linearly with $n$ (sublinear in the representation size of the graph). △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: Accepted for presentation at the Conference on Learning Theory (COLT) 2024

arXiv:2405.18680 [pdf, ps, other]

Navigable Graphs for High-Dimensional Nearest Neighbor Search: Constructions and Limits

Authors: Haya Diwan, **rui Gou, Cameron Musco, Christopher Musco, Torsten Suel

Abstract: There has been significant recent interest in graph-based nearest neighbor search methods, many of which are centered on the construction of navigable graphs over high-dimensional point sets. A graph is navigable if we can successfully move from any starting node to any target node using a greedy routing strategy where we always move to the neighbor that is closest to the destination according to… ▽ More There has been significant recent interest in graph-based nearest neighbor search methods, many of which are centered on the construction of navigable graphs over high-dimensional point sets. A graph is navigable if we can successfully move from any starting node to any target node using a greedy routing strategy where we always move to the neighbor that is closest to the destination according to a given distance function. The complete graph is navigable for any point set, but the important question for applications is if sparser graphs can be constructed. While this question is fairly well understood in low-dimensions, we establish some of the first upper and lower bounds for high-dimensional point sets. First, we give a simple and efficient way to construct a navigable graph with average degree $O(\sqrt{n \log n })$ for any set of $n$ points, in any dimension, for any distance function. We compliment this result with a nearly matching lower bound: even under the Euclidean metric in $O(\log n)$ dimensions, a random point set has no navigable graph with average degree $O(n^α)$ for any $α< 1/2$. Our lower bound relies on sharp anti-concentration bounds for binomial random variables, which we use to show that the near-neighborhoods of a set of random points do not overlap significantly, forcing any navigable graph to have many edges. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.09312 [pdf, ps, other]

Agnostic Active Learning of Single Index Models with Linear Sample Complexity

Authors: Aarshvi Gajjar, Wai Ming Tai, Xingyu Xu, Chinmay Hegde, Christopher Musco, Yi Li

Abstract: We study active learning methods for single index models of the form $F({\mathbf x}) = f(\langle {\mathbf w}, {\mathbf x}\rangle)$, where $f:\mathbb{R} \to \mathbb{R}$ and ${\mathbf x,\mathbf w} \in \mathbb{R}^d$. In addition to their theoretical interest as simple examples of non-linear neural networks, single index models have received significant recent attention due to applications in scientif… ▽ More We study active learning methods for single index models of the form $F({\mathbf x}) = f(\langle {\mathbf w}, {\mathbf x}\rangle)$, where $f:\mathbb{R} \to \mathbb{R}$ and ${\mathbf x,\mathbf w} \in \mathbb{R}^d$. In addition to their theoretical interest as simple examples of non-linear neural networks, single index models have received significant recent attention due to applications in scientific machine learning like surrogate modeling for partial differential equations (PDEs). Such applications require sample-efficient active learning methods that are robust to adversarial noise. I.e., that work even in the challenging agnostic learning setting. We provide two main results on agnostic active learning of single index models. First, when $f$ is known and Lipschitz, we show that $\tilde{O}(d)$ samples collected via {statistical leverage score sampling} are sufficient to learn a near-optimal single index model. Leverage score sampling is simple to implement, efficient, and already widely used for actively learning linear models. Our result requires no assumptions on the data distribution, is optimal up to log factors, and improves quadratically on a recent ${O}(d^{2})$ bound of \cite{gajjar2023active}. Second, we show that $\tilde{O}(d)$ samples suffice even in the more difficult setting when $f$ is \emph{unknown}. Our results leverage tools from high dimensional probability, including Dudley's inequality and dual Sudakov minoration, as well as a novel, distribution-aware discretization of the class of Lipschitz functions. △ Less

Submitted 16 May, 2024; v1 submitted 15 May, 2024; originally announced May 2024.

arXiv:2405.05865 [pdf, ps, other]

Faster Linear Systems and Matrix Norm Approximation via Multi-level Sketched Preconditioning

Authors: Michał Dereziński, Christopher Musco, Jiaming Yang

Abstract: We present a new class of preconditioned iterative methods for solving linear systems of the form $Ax = b$. Our methods are based on constructing a low-rank Nyström approximation to $A$ using sparse random sketching. This approximation is used to construct a preconditioner, which itself is inverted quickly using additional levels of random sketching and preconditioning. We prove that the convergen… ▽ More We present a new class of preconditioned iterative methods for solving linear systems of the form $Ax = b$. Our methods are based on constructing a low-rank Nyström approximation to $A$ using sparse random sketching. This approximation is used to construct a preconditioner, which itself is inverted quickly using additional levels of random sketching and preconditioning. We prove that the convergence of our methods depends on a natural average condition number of $A$, which improves as the rank of the Nyström approximation increases. Concretely, this allows us to obtain faster runtimes for a number of fundamental linear algebraic problems: 1. We show how to solve any $n\times n$ linear system that is well-conditioned except for $k$ outlying large singular values in $\tilde{O}(n^{2.065} + k^ω)$ time, improving on a recent result of [Dereziński, Yang, STOC 2024] for all $k \gtrsim n^{0.78}$. 2. We give the first $\tilde{O}(n^2 + {d_λ}^ω$) time algorithm for solving a regularized linear system $(A + λI)x = b$, where $A$ is positive semidefinite with effective dimension $d_λ$. This problem arises in applications like Gaussian process regression. 3. We give faster algorithms for approximating Schatten $p$-norms and other matrix norms. For example, for the Schatten 1 (nuclear) norm, we give an algorithm that runs in $\tilde{O}(n^{2.11})$ time, improving on an $\tilde{O}(n^{2.18})$ method of [Musco et al., ITCS 2018]. Interestingly, previous state-of-the-art algorithms for most of the problems above relied on stochastic iterative methods, like stochastic coordinate and gradient descent. Our work takes a completely different approach, instead leveraging tools from matrix sketching. △ Less

Submitted 9 May, 2024; originally announced May 2024.

arXiv:2404.13757 [pdf, ps, other]

Sublinear Time Low-Rank Approximation of Toeplitz Matrices

Authors: Cameron Musco, Kshiteej Sheth

Abstract: We present a sublinear time algorithm for computing a near optimal low-rank approximation to any positive semidefinite (PSD) Toeplitz matrix $T\in \mathbb{R}^{d\times d}$, given noisy access to its entries. In particular, given entrywise query access to $T+E$ for an arbitrary noise matrix $E\in \mathbb{R}^{d\times d}$, integer rank $k\leq d$, and error parameter $δ>0$, our algorithm runs in time… ▽ More We present a sublinear time algorithm for computing a near optimal low-rank approximation to any positive semidefinite (PSD) Toeplitz matrix $T\in \mathbb{R}^{d\times d}$, given noisy access to its entries. In particular, given entrywise query access to $T+E$ for an arbitrary noise matrix $E\in \mathbb{R}^{d\times d}$, integer rank $k\leq d$, and error parameter $δ>0$, our algorithm runs in time $\text{poly}(k,\log(d/δ))$ and outputs (in factored form) a Toeplitz matrix $\widetilde{T} \in \mathbb{R}^{d \times d}$ with rank $\text{poly}(k,\log(d/δ))$ satisfying, for some fixed constant $C$, \begin{equation*} \|T-\widetilde{T}\|_F \leq C \cdot \max\{\|E\|_F,\|T-T_k\|_F\} + δ\cdot \|T\|_F. \end{equation*} Here $\|\cdot \|_F$ is the Frobenius norm and $T_k$ is the best (not necessarily Toeplitz) rank-$k$ approximation to $T$ in the Frobenius norm, given by projecting $T$ onto its top $k$ eigenvectors. Our result has the following applications. When $E = 0$, we obtain the first sublinear time near-relative-error low-rank approximation algorithm for PSD Toeplitz matrices, resolving the main open problem of Kapralov et al. SODA `23, whose algorithm had sublinear query complexity but exponential runtime. Our algorithm can also be applied to approximate the unknown Toeplitz covariance matrix of a multivariate Gaussian distribution, given sample access to this distribution, resolving an open question of Eldar et al. SODA `20. Our algorithm applies sparse Fourier transform techniques to recover a low-rank Toeplitz matrix using its Fourier structure. Our key technical contribution is the first polynomial time algorithm for \emph{discrete time off-grid} sparse Fourier recovery, which may be of independent interest. △ Less

Submitted 21 April, 2024; originally announced April 2024.

Comments: Published in SODA 2024. Updated proofs

arXiv:2402.09379 [pdf, other]

Fixed-sparsity matrix approximation from matrix-vector products

Authors: Noah Amsel, Tyler Chen, Feyza Duman Keles, Diana Halikias, Cameron Musco, Christopher Musco

Abstract: We study the problem of approximating a matrix $\mathbf{A}$ with a matrix that has a fixed sparsity pattern (e.g., diagonal, banded, etc.), when $\mathbf{A}$ is accessed only by matrix-vector products. We describe a simple randomized algorithm that returns an approximation with the given sparsity pattern with Frobenius-norm error at most $(1+\varepsilon)$ times the best possible error. When each r… ▽ More We study the problem of approximating a matrix $\mathbf{A}$ with a matrix that has a fixed sparsity pattern (e.g., diagonal, banded, etc.), when $\mathbf{A}$ is accessed only by matrix-vector products. We describe a simple randomized algorithm that returns an approximation with the given sparsity pattern with Frobenius-norm error at most $(1+\varepsilon)$ times the best possible error. When each row of the desired sparsity pattern has at most $s$ nonzero entries, this algorithm requires $O(s/\varepsilon)$ non-adaptive matrix-vector products with $\mathbf{A}$. We also prove a matching lower-bound, showing that, for any sparsity pattern with $Θ(s)$ nonzeros per row and column, any algorithm achieving $(1+ε)$ approximation requires $Ω(s/\varepsilon)$ matrix-vector products in the worst case. We thus resolve the matrix-vector product query complexity of the problem up to constant factors, even for the well-studied case of diagonal approximation, for which no previous lower bounds were known. △ Less

Submitted 26 March, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

arXiv:2312.11712 [pdf, other]

A Simple and Practical Method for Reducing the Disparate Impact of Differential Privacy

Authors: Lucas Rosenblatt, Julia Stoyanovich, Christopher Musco

Abstract: Differentially private (DP) mechanisms have been deployed in a variety of high-impact social settings (perhaps most notably by the U.S. Census). Since all DP mechanisms involve adding noise to results of statistical queries, they are expected to impact our ability to accurately analyze and learn from data, in effect trading off privacy with utility. Alarmingly, the impact of DP on utility can vary… ▽ More Differentially private (DP) mechanisms have been deployed in a variety of high-impact social settings (perhaps most notably by the U.S. Census). Since all DP mechanisms involve adding noise to results of statistical queries, they are expected to impact our ability to accurately analyze and learn from data, in effect trading off privacy with utility. Alarmingly, the impact of DP on utility can vary significantly among different sub-populations. A simple way to reduce this disparity is with stratification. First compute an independent private estimate for each group in the data set (which may be the intersection of several protected classes), then, to compute estimates of global statistics, appropriately recombine these group estimates. Our main observation is that naive stratification often yields high-accuracy estimates of population-level statistics, without the need for additional privacy budget. We support this observation theoretically and empirically. Our theoretical results center on the private mean estimation problem, while our empirical results center on extensive experiments on private data synthesis to demonstrate the effectiveness of stratification on a variety of private mechanisms. Overall, we argue that this straightforward approach provides a strong baseline against which future work on reducing utility disparities of DP mechanisms should be compared. △ Less

Submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.03691 [pdf, other]

On the Role of Edge Dependency in Graph Generative Models

Authors: Sudhanshu Chanpuriya, Cameron Musco, Konstantinos Sotiropoulos, Charalampos Tsourakakis

Abstract: In this work, we introduce a novel evaluation framework for generative models of graphs, emphasizing the importance of model-generated graph overlap (Chanpuriya et al., 2021) to ensure both accuracy and edge-diversity. We delineate a hierarchy of graph generative models categorized into three levels of complexity: edge independent, node independent, and fully dependent models. This hierarchy encap… ▽ More In this work, we introduce a novel evaluation framework for generative models of graphs, emphasizing the importance of model-generated graph overlap (Chanpuriya et al., 2021) to ensure both accuracy and edge-diversity. We delineate a hierarchy of graph generative models categorized into three levels of complexity: edge independent, node independent, and fully dependent models. This hierarchy encapsulates a wide range of prevalent methods. We derive theoretical bounds on the number of triangles and other short-length cycles producible by each level of the hierarchy, contingent on the model overlap. We provide instances demonstrating the asymptotic optimality of our bounds. Furthermore, we introduce new generative models for each of the three hierarchical levels, leveraging dense subgraph discovery (Gionis & Tsourakakis, 2015). Our evaluation, conducted on real-world datasets, focuses on assessing the output quality and overlap of our proposed models in comparison to other popular models. Our results indicate that our simple, interpretable models provide competitive baselines to popular generative models. Through this investigation, we aim to propel the advancement of graph generative models by offering a structured framework and robust evaluation metrics, thereby facilitating the development of models capable of generating accurate and edge-diverse graphs. △ Less

Submitted 6 December, 2023; originally announced December 2023.

arXiv:2311.14023 [pdf, ps, other]

Algorithm-agnostic low-rank approximation of operator monotone matrix functions

Authors: David Persson, Raphael A. Meyer, Christopher Musco

Abstract: Low-rank approximation of a matrix function, $f(A)$, is an important task in computational mathematics. Most methods require direct access to $f(A)$, which is often considerably more expensive than accessing $A$. Persson and Kressner (SIMAX 2023) avoid this issue for symmetric positive semidefinite matrices by proposing funNyström, which first constructs a Nyström approximation to $A$ using subspa… ▽ More Low-rank approximation of a matrix function, $f(A)$, is an important task in computational mathematics. Most methods require direct access to $f(A)$, which is often considerably more expensive than accessing $A$. Persson and Kressner (SIMAX 2023) avoid this issue for symmetric positive semidefinite matrices by proposing funNyström, which first constructs a Nyström approximation to $A$ using subspace iteration, and then uses the approximation to directly obtain a low-rank approximation for $f(A)$. They prove that the method yields a near-optimal approximation whenever $f$ is a continuous operator monotone function with $f(0) = 0$. We significantly generalize the results of Persson and Kressner beyond subspace iteration. We show that if $\widehat{A}$ is a near-optimal low-rank Nyström approximation to $A$ then $f(\widehat{A})$ is a near-optimal low-rank approximation to $f(A)$, independently of how $\widehat{A}$ is computed. Further, we show sufficient conditions for a basis $Q$ to produce a near-optimal Nyström approximation $\widehat{A} = AQ(Q^T AQ)^{\dagger} Q^T A$. We use these results to establish that many common low-rank approximation methods produce near-optimal Nyström approximations to $A$ and therefore to $f(A)$. △ Less

Submitted 23 November, 2023; originally announced November 2023.

MSC Class: 65F15; 65F55; 65F60; 68W25

arXiv:2310.18265 [pdf, other]

Structured Semidefinite Programming for Recovering Structured Preconditioners

Authors: Arun Jambulapati, Jerry Li, Christopher Musco, Kirankumar Shiragur, Aaron Sidford, Kevin Tian

Abstract: We develop a general framework for finding approximately-optimal preconditioners for solving linear systems. Leveraging this framework we obtain improved runtimes for fundamental preconditioning and linear system solving problems including the following. We give an algorithm which, given positive definite $\mathbf{K} \in \mathbb{R}^{d \times d}$ with $\mathrm{nnz}(\mathbf{K})$ nonzero entries, com… ▽ More We develop a general framework for finding approximately-optimal preconditioners for solving linear systems. Leveraging this framework we obtain improved runtimes for fundamental preconditioning and linear system solving problems including the following. We give an algorithm which, given positive definite $\mathbf{K} \in \mathbb{R}^{d \times d}$ with $\mathrm{nnz}(\mathbf{K})$ nonzero entries, computes an $ε$-optimal diagonal preconditioner in time $\widetilde{O}(\mathrm{nnz}(\mathbf{K}) \cdot \mathrm{poly}(κ^\star,ε^{-1}))$, where $κ^\star$ is the optimal condition number of the rescaled matrix. We give an algorithm which, given $\mathbf{M} \in \mathbb{R}^{d \times d}$ that is either the pseudoinverse of a graph Laplacian matrix or a constant spectral approximation of one, solves linear systems in $\mathbf{M}$ in $\widetilde{O}(d^2)$ time. Our diagonal preconditioning results improve state-of-the-art runtimes of $Ω(d^{3.5})$ attained by general-purpose semidefinite programming, and our solvers improve state-of-the-art runtimes of $Ω(d^ω)$ where $ω> 2.3$ is the current matrix multiplication constant. We attain our results via new algorithms for a class of semidefinite programs (SDPs) we call matrix-dictionary approximation SDPs, which we leverage to solve an associated problem we call matrix-dictionary recovery. △ Less

Submitted 27 October, 2023; originally announced October 2023.

Comments: Merge of arXiv:1812.06295 and arXiv:2008.01722

arXiv:2310.04966 [pdf, other]

Improved Active Learning via Dependent Leverage Score Sampling

Authors: Atsushi Shimizu, Xiaoou Cheng, Christopher Musco, Jonathan Weare

Abstract: We show how to obtain improved active learning methods in the agnostic (adversarial noise) setting by combining marginal leverage score sampling with non-independent sampling strategies that promote spatial coverage. In particular, we propose an easily implemented method based on the \emph{pivotal sampling algorithm}, which we test on problems motivated by learning-based methods for parametric PDE… ▽ More We show how to obtain improved active learning methods in the agnostic (adversarial noise) setting by combining marginal leverage score sampling with non-independent sampling strategies that promote spatial coverage. In particular, we propose an easily implemented method based on the \emph{pivotal sampling algorithm}, which we test on problems motivated by learning-based methods for parametric PDEs and uncertainty quantification. In comparison to independent sampling, our method reduces the number of samples needed to reach a given target accuracy by up to $50\%$. We support our findings with two theoretical results. First, we show that any non-independent leverage score sampling method that obeys a weak \emph{one-sided $\ell_{\infty}$ independence condition} (which includes pivotal sampling) can actively learn $d$ dimensional linear functions with $O(d\log d)$ samples, matching independent sampling. This result extends recent work on matrix Chernoff bounds under $\ell_{\infty}$ independence, and may be of interest for analyzing other sampling strategies beyond pivotal sampling. Second, we show that, for the important case of polynomial regression, our pivotal method obtains an improved bound on $O(d)$ samples. △ Less

Submitted 4 May, 2024; v1 submitted 7 October, 2023; originally announced October 2023.

Comments: To appear at ICLR 2024

arXiv:2309.16157 [pdf, other]

Sampling Methods for Inner Product Sketching

Authors: Majid Daliri, Juliana Freire, Christopher Musco, Aécio Santos, Haoxiang Zhang

Abstract: Recently, Bessa et al. (PODS 2023) showed that sketches based on coordinated weighted sampling theoretically and empirically outperform popular linear sketching methods like Johnson-Lindentrauss projection and CountSketch for the ubiquitous problem of inner product estimation. We further develop this finding by introducing and analyzing two alternative sampling-based methods. In contrast to the co… ▽ More Recently, Bessa et al. (PODS 2023) showed that sketches based on coordinated weighted sampling theoretically and empirically outperform popular linear sketching methods like Johnson-Lindentrauss projection and CountSketch for the ubiquitous problem of inner product estimation. We further develop this finding by introducing and analyzing two alternative sampling-based methods. In contrast to the computationally expensive algorithm in Bessa et al., our methods run in linear time (to compute the sketch) and perform better in practice, significantly beating linear sketching on a variety of tasks. For example, they provide state-of-the-art results for estimating the correlation between columns in unjoined tables, a problem that we show how to reduce to inner product estimation in a black-box way. While based on known sampling techniques (threshold and priority sampling) we introduce significant new theoretical analysis to prove approximation guarantees for our methods. △ Less

Submitted 15 January, 2024; v1 submitted 28 September, 2023; originally announced September 2023.

Comments: 17 pages, 10 figures

arXiv:2308.06448 [pdf, other]

Latent Random Steps as Relaxations of Max-Cut, Min-Cut, and More

Authors: Sudhanshu Chanpuriya, Cameron Musco

Abstract: Algorithms for node clustering typically focus on finding homophilous structure in graphs. That is, they find sets of similar nodes with many edges within, rather than across, the clusters. However, graphs often also exhibit heterophilous structure, as exemplified by (nearly) bipartite and tripartite graphs, where most edges occur across the clusters. Grappling with such structure is typically lef… ▽ More Algorithms for node clustering typically focus on finding homophilous structure in graphs. That is, they find sets of similar nodes with many edges within, rather than across, the clusters. However, graphs often also exhibit heterophilous structure, as exemplified by (nearly) bipartite and tripartite graphs, where most edges occur across the clusters. Grappling with such structure is typically left to the task of graph simplification. We present a probabilistic model based on non-negative matrix factorization which unifies clustering and simplification, and provides a framework for modeling arbitrary graph structure. Our model is based on factorizing the process of taking a random walk on the graph. It permits an unconstrained parametrization, allowing for optimization via simple gradient descent. By relaxing the hard clustering to a soft clustering, our algorithm relaxes potentially hard clustering problems to a tractable ones. We illustrate our algorithm's capabilities on a synthetic graph, as well as simple unsupervised learning tasks involving bipartite and tripartite clustering of orthographic and phonological data. △ Less

Submitted 11 August, 2023; originally announced August 2023.

arXiv:2308.05907 [pdf, ps, other]

Simple Analysis of Priority Sampling

Authors: Majid Daliri, Juliana Freire, Christopher Musco, Aécio Santos, Haoxiang Zhang

Abstract: We prove a tight upper bound on the variance of the priority sampling method (aka sequential Poisson sampling). Our proof is significantly shorter and simpler than the original proof given by Mario Szegedy at STOC 2006, which resolved a conjecture by Duffield, Lund, and Thorup. We prove a tight upper bound on the variance of the priority sampling method (aka sequential Poisson sampling). Our proof is significantly shorter and simpler than the original proof given by Mario Szegedy at STOC 2006, which resolved a conjecture by Duffield, Lund, and Thorup. △ Less

Submitted 10 August, 2023; originally announced August 2023.

Comments: 6 pages

arXiv:2307.00474 [pdf, other]

Moments, Random Walks, and Limits for Spectrum Approximation

Authors: Yujia **, Christopher Musco, Aaron Sidford, Apoorv Vikram Singh

Abstract: We study lower bounds for the problem of approximating a one dimensional distribution given (noisy) measurements of its moments. We show that there are distributions on $[-1,1]$ that cannot be approximated to accuracy $ε$ in Wasserstein-1 distance even if we know \emph{all} of their moments to multiplicative accuracy $(1\pm2^{-Ω(1/ε)})$; this result matches an upper bound of Kong and Valiant [Anna… ▽ More We study lower bounds for the problem of approximating a one dimensional distribution given (noisy) measurements of its moments. We show that there are distributions on $[-1,1]$ that cannot be approximated to accuracy $ε$ in Wasserstein-1 distance even if we know \emph{all} of their moments to multiplicative accuracy $(1\pm2^{-Ω(1/ε)})$; this result matches an upper bound of Kong and Valiant [Annals of Statistics, 2017]. To obtain our result, we provide a hard instance involving distributions induced by the eigenvalue spectra of carefully constructed graph adjacency matrices. Efficiently approximating such spectra in Wasserstein-1 distance is a well-studied algorithmic problem, and a recent result of Cohen-Steiner et al. [KDD 2018] gives a method based on accurately approximating spectral moments using $2^{O(1/ε)}$ random walks initiated at uniformly random nodes in the graph. As a strengthening of our main result, we show that improving the dependence on $1/ε$ in this result would require a new algorithmic approach. Specifically, no algorithm can compute an $ε$-accurate approximation to the spectrum of a normalized graph adjacency matrix with constant probability, even when given the transcript of $2^{Ω(1/ε)}$ random walks of length $2^{Ω(1/ε)}$ started at random nodes. △ Less

Submitted 2 July, 2023; originally announced July 2023.

arXiv:2305.18755 [pdf, other]

Dimensionality Reduction for General KDE Mode Finding

Authors: Xinyu Luo, Christopher Musco, Cas Widdershoven

Abstract: Finding the mode of a high dimensional probability distribution $D$ is a fundamental algorithmic problem in statistics and data analysis. There has been particular interest in efficient methods for solving the problem when $D$ is represented as a mixture model or kernel density estimate, although few algorithmic results with worst-case approximation and runtime guarantees are known. In this work,… ▽ More Finding the mode of a high dimensional probability distribution $D$ is a fundamental algorithmic problem in statistics and data analysis. There has been particular interest in efficient methods for solving the problem when $D$ is represented as a mixture model or kernel density estimate, although few algorithmic results with worst-case approximation and runtime guarantees are known. In this work, we significantly generalize a result of (LeeLiMusco:2021) on mode approximation for Gaussian mixture models. We develop randomized dimensionality reduction methods for mixtures involving a broader class of kernels, including the popular logistic, sigmoid, and generalized Gaussian kernels. As in Lee et al.'s work, our dimensionality reduction results yield quasi-polynomial algorithms for mode finding with multiplicative accuracy $(1-ε)$ for any $ε> 0$. Moreover, when combined with gradient descent, they yield efficient practical heuristics for the problem. In addition to our positive results, we prove a hardness result for box kernels, showing that there is no polynomial time algorithm for finding the mode of a kernel density estimate, unless $\mathit{P} = \mathit{NP}$. Obtaining similar hardness results for kernels used in practice (like Gaussian or logistic kernels) is an interesting future direction. △ Less

Submitted 1 June, 2023; v1 submitted 30 May, 2023; originally announced May 2023.

Comments: Full version of a paper published at ICML'23

arXiv:2305.14451 [pdf, other]

Kernel Interpolation with Sparse Grids

Authors: Mohit Yadav, Daniel Sheldon, Cameron Musco

Abstract: Structured kernel interpolation (SKI) accelerates Gaussian process (GP) inference by interpolating the kernel covariance function using a dense grid of inducing points, whose corresponding kernel matrix is highly structured and thus amenable to fast linear algebra. Unfortunately, SKI scales poorly in the dimension of the input points, since the dense grid size grows exponentially with the dimensio… ▽ More Structured kernel interpolation (SKI) accelerates Gaussian process (GP) inference by interpolating the kernel covariance function using a dense grid of inducing points, whose corresponding kernel matrix is highly structured and thus amenable to fast linear algebra. Unfortunately, SKI scales poorly in the dimension of the input points, since the dense grid size grows exponentially with the dimension. To mitigate this issue, we propose the use of sparse grids within the SKI framework. These grids enable accurate interpolation, but with a number of points growing more slowly with dimension. We contribute a novel nearly linear time matrix-vector multiplication algorithm for the sparse grid kernel matrix. Next, we describe how sparse grids can be combined with an efficient interpolation scheme based on simplices. With these changes, we demonstrate that SKI can be scaled to higher dimensions while maintaining accuracy. △ Less

Submitted 23 May, 2023; originally announced May 2023.

Comments: Accepted at Neural Information Processing Systems (NeurIPS) 2022

arXiv:2305.05826 [pdf, ps, other]

Universal Matrix Sparsifiers and Fast Deterministic Algorithms for Linear Algebra

Authors: Rajarshi Bhattacharjee, Gregory Dexter, Cameron Musco, Archan Ray, Sushant Sachdeva, David P Woodruff

Abstract: Let $\mathbf S \in \mathbb R^{n \times n}$ satisfy $\|\mathbf 1-\mathbf S\|_2\leεn$, where $\mathbf 1$ is the all ones matrix and $\|\cdot\|_2$ is the spectral norm. It is well-known that there exists such an $\mathbf S$ with just $O(n/ε^2)$ non-zero entries: we can let $\mathbf S$ be the scaled adjacency matrix of a Ramanujan expander graph. We show that such an $\mathbf S$ yields a $universal$… ▽ More Let $\mathbf S \in \mathbb R^{n \times n}$ satisfy $\|\mathbf 1-\mathbf S\|_2\leεn$, where $\mathbf 1$ is the all ones matrix and $\|\cdot\|_2$ is the spectral norm. It is well-known that there exists such an $\mathbf S$ with just $O(n/ε^2)$ non-zero entries: we can let $\mathbf S$ be the scaled adjacency matrix of a Ramanujan expander graph. We show that such an $\mathbf S$ yields a $universal$ $sparsifier$ for any positive semidefinite (PSD) matrix. In particular, for any PSD $\mathbf A \in \mathbb{R}^{n\times n}$ with entries bounded in magnitude by $1$, $\|\mathbf A - \mathbf A\circ\mathbf S\|_2 \le εn$, where $\circ$ denotes the entrywise (Hadamard) product. Our techniques also give universal sparsifiers for non-PSD matrices. In this case, letting $\mathbf S$ be the scaled adjacency matrix of a Ramanujan graph with $\tilde O(n/ε^4)$ edges, we have $\|\mathbf A - \mathbf A \circ \mathbf S \|_2 \le ε\cdot \max(n,\|\mathbf A\|_1)$, where $\|\mathbf A\|_1$ is the nuclear norm. We show that the above bounds for both PSD and non-PSD matrices are tight up to log factors. Since $\mathbf A \circ \mathbf S$ can be constructed deterministically, our result for PSD matrices derandomizes and improves upon known results for randomized matrix sparsification, which require randomly sampling ${O}(\frac{n \log n}{ε^2})$ entries. We also leverage our results to give the first deterministic algorithms for several problems related to singular value approximation that run in faster than matrix multiplication time. Finally, if $\mathbf A \in \{-1,0,1\}^{n \times n}$ is PSD, we show that $\mathbf{\tilde A}$ with $\|\mathbf A - \mathbf{\tilde A}\|_2 \le εn$ can be obtained by deterministically reading $\tilde O(n/ε)$ entries of $\mathbf A$. This improves the $1/ε$ dependence on our result for general PSD matrices and is near-optimal. △ Less

Submitted 12 January, 2024; v1 submitted 9 May, 2023; originally announced May 2023.

Comments: 41 pages

ACM Class: F.2.1; G.1.3; G.1.2; G.4; I.1.2

arXiv:2305.02535 [pdf, other]

On the Unreasonable Effectiveness of Single Vector Krylov Methods for Low-Rank Approximation

Authors: Raphael A. Meyer, Cameron Musco, Christopher Musco

Abstract: Krylov subspace methods are a ubiquitous tool for computing near-optimal rank $k$ approximations of large matrices. While "large block" Krylov methods with block size at least $k$ give the best known theoretical guarantees, block size one (a single vector) or a small constant is often preferred in practice. Despite their popularity, we lack theoretical bounds on the performance of such "small bloc… ▽ More Krylov subspace methods are a ubiquitous tool for computing near-optimal rank $k$ approximations of large matrices. While "large block" Krylov methods with block size at least $k$ give the best known theoretical guarantees, block size one (a single vector) or a small constant is often preferred in practice. Despite their popularity, we lack theoretical bounds on the performance of such "small block" Krylov methods for low-rank approximation. We address this gap between theory and practice by proving that small block Krylov methods essentially match all known low-rank approximation guarantees for large block methods. Via a black-box reduction we show, for example, that the standard single vector Krylov method run for $t$ iterations obtains the same spectral norm and Frobenius norm error bounds as a Krylov method with block size $\ell \geq k$ run for $O(t/\ell)$ iterations, up to a logarithmic dependence on the smallest gap between sequential singular values. That is, for a given number of matrix-vector products, single vector methods are essentially as effective as any choice of large block size. By combining our result with tail-bounds on eigenvalue gaps in random matrices, we prove that the dependence on the smallest singular value gap can be eliminated if the input matrix is perturbed by a small random matrix. Further, we show that single vector methods match the more complex algorithm of [Bakshi et al. `22], which combines the results of multiple block sizes to achieve an improved algorithm for Schatten $p$-norm low-rank approximation. △ Less

Submitted 6 November, 2023; v1 submitted 4 May, 2023; originally announced May 2023.

Comments: 41 pages, 7 figures. To appear at SODA 2024

MSC Class: 65F55 (Primary) 65F15 (Secondary) ACM Class: G.1.3; F.2.1

arXiv:2304.02261 [pdf, other]

Optimal Sketching Bounds for Sparse Linear Regression

Authors: Tung Mai, Alexander Munteanu, Cameron Musco, Anup B. Rao, Chris Schwiegelshohn, David P. Woodruff

Abstract: We study oblivious sketching for $k$-sparse linear regression under various loss functions such as an $\ell_p$ norm, or from a broad class of hinge-like loss functions, which includes the logistic and ReLU losses. We show that for sparse $\ell_2$ norm regression, there is a distribution over oblivious sketches with $Θ(k\log(d/k)/\varepsilon^2)$ rows, which is tight up to a constant factor. This ex… ▽ More We study oblivious sketching for $k$-sparse linear regression under various loss functions such as an $\ell_p$ norm, or from a broad class of hinge-like loss functions, which includes the logistic and ReLU losses. We show that for sparse $\ell_2$ norm regression, there is a distribution over oblivious sketches with $Θ(k\log(d/k)/\varepsilon^2)$ rows, which is tight up to a constant factor. This extends to $\ell_p$ loss with an additional additive $O(k\log(k/\varepsilon)/\varepsilon^2)$ term in the upper bound. This establishes a surprising separation from the related sparse recovery problem, which is an important special case of sparse regression. For this problem, under the $\ell_2$ norm, we observe an upper bound of $O(k \log (d)/\varepsilon + k\log(k/\varepsilon)/\varepsilon^2)$ rows, showing that sparse recovery is strictly easier to sketch than sparse regression. For sparse regression under hinge-like loss functions including sparse logistic and sparse ReLU regression, we give the first known sketching bounds that achieve $o(d)$ rows showing that $O(μ^2 k\log(μn d/\varepsilon)/\varepsilon^2)$ rows suffice, where $μ$ is a natural complexity parameter needed to obtain relative error bounds for these loss functions. We again show that this dimension is tight, up to lower order terms and the dependence on $μ$. Finally, we show that similar sketching bounds can be achieved for LASSO regression, a popular convex relaxation of sparse regression, where one aims to minimize $\|Ax-b\|_2^2+λ\|x\|_1$ over $x\in\mathbb{R}^d$. We show that sketching dimension $O(\log(d)/(λ\varepsilon)^2)$ suffices and that the dependence on $d$ and $λ$ is tight. △ Less

Submitted 5 April, 2023; originally announced April 2023.

Comments: AISTATS 2023

arXiv:2303.06396 [pdf, other]

No-regret Algorithms for Fair Resource Allocation

Authors: Abhishek Sinha, Ativ Joshi, Rajarshi Bhattacharjee, Cameron Musco, Mohammad Hajiesmaili

Abstract: We consider a fair resource allocation problem in the no-regret setting against an unrestricted adversary. The objective is to allocate resources equitably among several agents in an online fashion so that the difference of the aggregate $α$-fair utilities of the agents between an optimal static clairvoyant allocation and that of the online policy grows sub-linearly with time. The problem is chall… ▽ More We consider a fair resource allocation problem in the no-regret setting against an unrestricted adversary. The objective is to allocate resources equitably among several agents in an online fashion so that the difference of the aggregate $α$-fair utilities of the agents between an optimal static clairvoyant allocation and that of the online policy grows sub-linearly with time. The problem is challenging due to the non-additive nature of the $α$-fairness function. Previously, it was shown that no online policy can exist for this problem with a sublinear standard regret. In this paper, we propose an efficient online resource allocation policy, called Online Proportional Fair (OPF), that achieves $c_α$-approximate sublinear regret with the approximation factor $c_α=(1-α)^{-(1-α)}\leq 1.445,$ for $0\leq α< 1$. The upper bound to the $c_α$-regret for this problem exhibits a surprising phase transition phenomenon. The regret bound changes from a power-law to a constant at the critical exponent $α=\frac{1}{2}.$ As a corollary, our result also resolves an open problem raised by Even-Dar et al. [2009] on designing an efficient no-regret policy for the online job scheduling problem in certain parameter regimes. The proof of our results introduces new algorithmic and analytical techniques, including greedy estimation of the future gradients for non-additive global reward functions and bootstrap** adaptive regret bounds, which may be of independent interest. △ Less

Submitted 11 March, 2023; originally announced March 2023.

arXiv:2301.05811 [pdf, other]

Weighted Minwise Hashing Beats Linear Sketching for Inner Product Estimation

Authors: Aline Bessa, Majid Daliri, Juliana Freire, Cameron Musco, Christopher Musco, Aécio Santos, Haoxiang Zhang

Abstract: We present a new approach for computing compact sketches that can be used to approximate the inner product between pairs of high-dimensional vectors. Based on the Weighted MinHash algorithm, our approach admits strong accuracy guarantees that improve on the guarantees of popular linear sketching approaches for inner product estimation, such as CountSketch and Johnson-Lindenstrauss projection. Spec… ▽ More We present a new approach for computing compact sketches that can be used to approximate the inner product between pairs of high-dimensional vectors. Based on the Weighted MinHash algorithm, our approach admits strong accuracy guarantees that improve on the guarantees of popular linear sketching approaches for inner product estimation, such as CountSketch and Johnson-Lindenstrauss projection. Specifically, while our method admits guarantees that exactly match linear sketching for dense vectors, it yields significantly lower error for sparse vectors with limited overlap between non-zero entries. Such vectors arise in many applications involving sparse data. They are also important in increasingly popular dataset search applications, where inner product sketches are used to estimate data covariance, conditional means, and other quantities involving columns in unjoined tables. We complement our theoretical results by showing that our approach empirically outperforms existing linear sketches and unweighted hashing-based sketches for sparse vectors. △ Less

Submitted 5 May, 2023; v1 submitted 13 January, 2023; originally announced January 2023.

Comments: 23 pages, 6 figures

Journal ref: In Proceedings of the ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS) 2023

arXiv:2211.11328 [pdf, ps, other]

Toeplitz Low-Rank Approximation with Sublinear Query Complexity

Authors: Michael Kapralov, Hannah Lawrence, Mikhail Makarov, Cameron Musco, Kshiteej Sheth

Abstract: We present a sublinear query algorithm for outputting a near-optimal low-rank approximation to any positive semidefinite Toeplitz matrix $T \in \mathbb{R}^{d \times d}$. In particular, for any integer rank $k \leq d$ and $ε,δ> 0$, our algorithm makes $\tilde{O} \left (k^2 \cdot \log(1/δ) \cdot \text{poly}(1/ε) \right )$ queries to the entries of $T$ and outputs a rank… ▽ More We present a sublinear query algorithm for outputting a near-optimal low-rank approximation to any positive semidefinite Toeplitz matrix $T \in \mathbb{R}^{d \times d}$. In particular, for any integer rank $k \leq d$ and $ε,δ> 0$, our algorithm makes $\tilde{O} \left (k^2 \cdot \log(1/δ) \cdot \text{poly}(1/ε) \right )$ queries to the entries of $T$ and outputs a rank $\tilde{O} \left (k \cdot \log(1/δ)/ε\right )$ matrix $\tilde{T} \in \mathbb{R}^{d \times d}$ such that $\| T - \tilde{T}\|_F \leq (1+ε) \cdot \|T-T_k\|_F + δ\|T\|_F$. Here, $\|\cdot\|_F$ is the Frobenius norm and $T_k$ is the optimal rank-$k$ approximation to $T$, given by projection onto its top $k$ eigenvectors. $\tilde{O}(\cdot)$ hides $\text{polylog}(d) $ factors. Our algorithm is \emph{structure-preserving}, in that the approximation $\tilde{T}$ is also Toeplitz. A key technical contribution is a proof that any positive semidefinite Toeplitz matrix in fact has a near-optimal low-rank approximation which is itself Toeplitz. Surprisingly, this basic existence result was not previously known. Building on this result, along with the well-established off-grid Fourier structure of Toeplitz matrices [Cybenko'82], we show that Toeplitz $\tilde{T}$ with near optimal error can be recovered with a small number of random queries via a leverage-score-based off-grid sparse Fourier sampling scheme. △ Less

Submitted 21 November, 2022; originally announced November 2022.

Comments: Accepted in SODA 2023

arXiv:2211.06790 [pdf, other]

Near-Linear Sample Complexity for $L_p$ Polynomial Regression

Authors: Raphael A. Meyer, Cameron Musco, Christopher Musco, David P. Woodruff, Samson Zhou

Abstract: We study $L_p$ polynomial regression. Given query access to a function $f:[-1,1] \rightarrow \mathbb{R}$, the goal is to find a degree $d$ polynomial $\hat{q}$ such that, for a given parameter $\varepsilon > 0$, $$ \|\hat{q}-f\|_p\le (1+\varepsilon) \cdot \min_{q:\text{deg}(q)\le d}\|q-f\|_p. $$ Here $\|\cdot\|_p$ is the $L_p$ norm, $\|g\|_p = (\int_{-1}^1 |g(t)|^p dt)^{1/p}$. We show that queryin… ▽ More We study $L_p$ polynomial regression. Given query access to a function $f:[-1,1] \rightarrow \mathbb{R}$, the goal is to find a degree $d$ polynomial $\hat{q}$ such that, for a given parameter $\varepsilon > 0$, $$ \|\hat{q}-f\|_p\le (1+\varepsilon) \cdot \min_{q:\text{deg}(q)\le d}\|q-f\|_p. $$ Here $\|\cdot\|_p$ is the $L_p$ norm, $\|g\|_p = (\int_{-1}^1 |g(t)|^p dt)^{1/p}$. We show that querying $f$ at points randomly drawn from the Chebyshev measure on $[-1,1]$ is a near-optimal strategy for polynomial regression in all $L_p$ norms. In particular, to find $\hat q$, it suffices to sample $O(d\, \frac{\text{polylog}\,d}{\text{poly}\,\varepsilon})$ points from $[-1,1]$ with probabilities proportional to this measure. While the optimal sample complexity for polynomial regression was well understood for $L_2$ and $L_\infty$, our result is the first that achieves sample complexity linear in $d$ and error $(1+\varepsilon)$ for other values of $p$ without any assumptions. Our result requires two main technical contributions. The first concerns $p\leq 2$, for which we provide explicit bounds on the $L_p$ Lewis weight function of the infinite linear operator underlying polynomial regression. Using tools from the orthogonal polynomial literature, we show that this function is bounded by the Chebyshev density. Our second key contribution is to take advantage of the structure of polynomials to reduce the $p>2$ case to the $p\leq 2$ case. By doing so, we obtain a better sample complexity than what is possible for general $p$-norm linear regression problems, for which $Ω(d^{p/2})$ samples are required. △ Less

Submitted 12 November, 2022; originally announced November 2022.

Comments: 68 pages, to be presented at SODA 2023

arXiv:2210.13601 [pdf, other]

Active Learning for Single Neuron Models with Lipschitz Non-Linearities

Authors: Aarshvi Gajjar, Chinmay Hegde, Christopher Musco

Abstract: We consider the problem of active learning for single neuron models, also sometimes called ``ridge functions'', in the agnostic setting (under adversarial label noise). Such models have been shown to be broadly effective in modeling physical phenomena, and for constructing surrogate data-driven models for partial differential equations. Surprisingly, we show that for a single neuron model with a… ▽ More We consider the problem of active learning for single neuron models, also sometimes called ``ridge functions'', in the agnostic setting (under adversarial label noise). Such models have been shown to be broadly effective in modeling physical phenomena, and for constructing surrogate data-driven models for partial differential equations. Surprisingly, we show that for a single neuron model with any Lipschitz non-linearity (such as the ReLU, sigmoid, absolute value, low-degree polynomial, among others), strong provable approximation guarantees can be obtained using a well-known active learning strategy for fitting \emph{linear functions} in the agnostic setting. % -- i.e. for the case when there is no non-linearity. Namely, we can collect samples via statistical \emph{leverage score sampling}, which has been shown to be near-optimal in other active learning scenarios. We support our theoretical results with empirical simulations showing that our proposed active learning strategy based on leverage score sampling outperforms (ordinary) uniform sampling when fitting single neuron models. △ Less

Submitted 18 July, 2023; v1 submitted 24 October, 2022; originally announced October 2022.

Comments: Inadvertently submitting an incorrect writeup that does not align with the intended content

arXiv:2210.06594 [pdf, other]

Sample Constrained Treatment Effect Estimation

Authors: Raghavendra Addanki, David Arbour, Tung Mai, Cameron Musco, Anup Rao

Abstract: Treatment effect estimation is a fundamental problem in causal inference. We focus on designing efficient randomized controlled trials, to accurately estimate the effect of some treatment on a population of $n$ individuals. In particular, we study sample-constrained treatment effect estimation, where we must select a subset of $s \ll n$ individuals from the population to experiment on. This subset… ▽ More Treatment effect estimation is a fundamental problem in causal inference. We focus on designing efficient randomized controlled trials, to accurately estimate the effect of some treatment on a population of $n$ individuals. In particular, we study sample-constrained treatment effect estimation, where we must select a subset of $s \ll n$ individuals from the population to experiment on. This subset must be further partitioned into treatment and control groups. Algorithms for partitioning the entire population into treatment and control groups, or for choosing a single representative subset, have been well-studied. The key challenge in our setting is jointly choosing a representative subset and a partition for that set. We focus on both individual and average treatment effect estimation, under a linear effects model. We give provably efficient experimental designs and corresponding estimators, by identifying connections to discrepancy minimization and leverage-score-based sampling used in randomized numerical linear algebra. Our theoretical results obtain a smooth transition to known guarantees when $s$ equals the population size. We also empirically demonstrate the performance of our algorithms. △ Less

Submitted 12 October, 2022; originally announced October 2022.

Comments: Conference on Neural Information Processing Systems (NeurIPS) 2022

arXiv:2210.00032 [pdf, other]

Direct Embedding of Temporal Network Edges via Time-Decayed Line Graphs

Authors: Sudhanshu Chanpuriya, Ryan A. Rossi, Sungchul Kim, Tong Yu, Jane Hoffswell, Nedim Lipka, Shunan Guo, Cameron Musco

Abstract: Temporal networks model a variety of important phenomena involving timed interactions between entities. Existing methods for machine learning on temporal networks generally exhibit at least one of two limitations. First, time is assumed to be discretized, so if the time data is continuous, the user must determine the discretization and discard precise time information. Second, edge representations… ▽ More Temporal networks model a variety of important phenomena involving timed interactions between entities. Existing methods for machine learning on temporal networks generally exhibit at least one of two limitations. First, time is assumed to be discretized, so if the time data is continuous, the user must determine the discretization and discard precise time information. Second, edge representations can only be calculated indirectly from the nodes, which may be suboptimal for tasks like edge classification. We present a simple method that avoids both shortcomings: construct the line graph of the network, which includes a node for each interaction, and weigh the edges of this graph based on the difference in time between interactions. From this derived graph, edge representations for the original network can be computed with efficient classical methods. The simplicity of this approach facilitates explicit theoretical analysis: we can constructively show the effectiveness of our method's representations for a natural synthetic model of temporal networks. Empirical results on real-world networks demonstrate our method's efficacy and efficiency on both edge classification and temporal link prediction. △ Less

Submitted 30 September, 2022; originally announced October 2022.

arXiv:2208.03268 [pdf, other]

A Tight Analysis of Hutchinson's Diagonal Estimator

Authors: Prathamesh Dharangutte, Christopher Musco

Abstract: Let $\mathbf{A}\in \mathbb{R}^{n\times n}$ be a matrix with diagonal $\text{diag}(\mathbf{A})$ and let $\bar{\mathbf{A}}$ be $\mathbf{A}$ with its diagonal set to all zeros. We show that Hutchinson's estimator run for $m$ iterations returns a diagonal estimate $\tilde{d}\in \mathbb{R}^n$ such that with probability $(1-δ)$,… ▽ More Let $\mathbf{A}\in \mathbb{R}^{n\times n}$ be a matrix with diagonal $\text{diag}(\mathbf{A})$ and let $\bar{\mathbf{A}}$ be $\mathbf{A}$ with its diagonal set to all zeros. We show that Hutchinson's estimator run for $m$ iterations returns a diagonal estimate $\tilde{d}\in \mathbb{R}^n$ such that with probability $(1-δ)$, $$\|\tilde{d} - \text{diag}(\mathbf{A})\|_2 \leq c\sqrt{\frac{\log(2/δ)}{m}}\|\bar{\mathbf{A}}\|_F,$$ where $c$ is a fixed constant independent of all other parameters. This results improves on a recent result of [Baston and Nakatsukasa, 2022] by a $\log(n)$ factor, yielding a bound that is independent of the matrix dimension $n$. △ Less

Submitted 6 November, 2022; v1 submitted 5 August, 2022; originally announced August 2022.

Comments: To appear in SIAM Symposium on Simplicity in Algorithms (SOSA23)

arXiv:2207.02817 [pdf, ps, other]

Non-Adaptive Edge Counting and Sampling via Bipartite Independent Set Queries

Authors: Raghavendra Addanki, Andrew McGregor, Cameron Musco

Abstract: We study the problem of estimating the number of edges in an $n$-vertex graph, accessed via the Bipartite Independent Set query model introduced by Beame et al. (ITCS '18). In this model, each query returns a Boolean, indicating the existence of at least one edge between two specified sets of nodes. We present a non-adaptive algorithm that returns a $(1\pm ε)$ relative error approximation to the n… ▽ More We study the problem of estimating the number of edges in an $n$-vertex graph, accessed via the Bipartite Independent Set query model introduced by Beame et al. (ITCS '18). In this model, each query returns a Boolean, indicating the existence of at least one edge between two specified sets of nodes. We present a non-adaptive algorithm that returns a $(1\pm ε)$ relative error approximation to the number of edges, with query complexity $\tilde O(ε^{-5}\log^{5} n )$, where $\tilde O(\cdot)$ hides $\textrm{poly}(\log \log n)$ dependencies. This is the first non-adaptive algorithm in this setting achieving $\textrm{poly}(1/ε,\log n)$ query complexity. Prior work requires $Ω(\log^2 n)$ rounds of adaptivity. We avoid this by taking a fundamentally different approach, inspired by work on single-pass streaming algorithms. Moreover, for constant $ε$, our query complexity significantly improves on the best known adaptive algorithm due to Bhattacharya et al. (STACS '22), which requires $O(ε^{-2} \log^{11} n)$ queries. Building on our edge estimation result, we give the first non-adaptive algorithm for outputting a nearly uniformly sampled edge with query complexity $\tilde O(ε^{-6} \log^{6} n)$, improving on the works of Dell et al. (SODA '20) and Bhattacharya et al. (STACS '22), which require $Ω(\log^3 n)$ rounds of adaptivity. Finally, as a consequence of our edge sampling algorithm, we obtain a $\tilde O(n\log^ 8 n)$ query algorithm for connectivity, using two rounds of adaptivity. This improves on a three-round algorithm of Assadi et al. (ESA '21) and is tight; there is no non-adaptive algorithm for connectivity making $o(n^2)$ queries. △ Less

Submitted 6 July, 2022; originally announced July 2022.

Comments: European Symposium on Algorithms (ESA) 2022

arXiv:2203.07557 [pdf, other]

Fast Regression for Structured Inputs

Authors: Raphael A. Meyer, Cameron Musco, Christopher Musco, David P. Woodruff, Samson Zhou

Abstract: We study the $\ell_p$ regression problem, which requires finding $\mathbf{x}\in\mathbb R^{d}$ that minimizes $\|\mathbf{A}\mathbf{x}-\mathbf{b}\|_p$ for a matrix $\mathbf{A}\in\mathbb R^{n \times d}$ and response vector $\mathbf{b}\in\mathbb R^{n}$. There has been recent interest in develo** subsampling methods for this problem that can outperform standard techniques when $n$ is very large. Howe… ▽ More We study the $\ell_p$ regression problem, which requires finding $\mathbf{x}\in\mathbb R^{d}$ that minimizes $\|\mathbf{A}\mathbf{x}-\mathbf{b}\|_p$ for a matrix $\mathbf{A}\in\mathbb R^{n \times d}$ and response vector $\mathbf{b}\in\mathbb R^{n}$. There has been recent interest in develo** subsampling methods for this problem that can outperform standard techniques when $n$ is very large. However, all known subsampling approaches have run time that depends exponentially on $p$, typically, $d^{\mathcal{O}(p)}$, which can be prohibitively expensive. We improve on this work by showing that for a large class of common \emph{structured matrices}, such as combinations of low-rank matrices, sparse matrices, and Vandermonde matrices, there are subsampling based methods for $\ell_p$ regression that depend polynomially on $p$. For example, we give an algorithm for $\ell_p$ regression on Vandermonde matrices that runs in time $\mathcal{O}(n\log^3 n+(dp^2)^{0.5+ω}\cdot\text{polylog}\,n)$, where $ω$ is the exponent of matrix multiplication. The polynomial dependence on $p$ crucially allows our algorithms to extend naturally to efficient algorithms for $\ell_\infty$ regression, via approximation of $\ell_\infty$ by $\ell_{\mathcal{O}(\log n)}$. Of practical interest, we also develop a new subsampling algorithm for $\ell_p$ regression for arbitrary matrices, which is simpler than previous approaches for $p \ge 4$. △ Less

Submitted 14 March, 2022; originally announced March 2022.

arXiv:2202.04139 [pdf, other]

Simplified Graph Convolution with Heterophily

Authors: Sudhanshu Chanpuriya, Cameron Musco

Abstract: Recent work has shown that a simple, fast method called Simple Graph Convolution (SGC) (Wu et al., 2019), which eschews deep learning, is competitive with deep methods like graph convolutional networks (GCNs) (Kipf & Welling, 2017) in common graph machine learning benchmarks. The use of graph data in SGC implicitly assumes the common but not universal graph characteristic of homophily, wherein nod… ▽ More Recent work has shown that a simple, fast method called Simple Graph Convolution (SGC) (Wu et al., 2019), which eschews deep learning, is competitive with deep methods like graph convolutional networks (GCNs) (Kipf & Welling, 2017) in common graph machine learning benchmarks. The use of graph data in SGC implicitly assumes the common but not universal graph characteristic of homophily, wherein nodes link to nodes which are similar. Here we confirm that SGC is indeed ineffective for heterophilous (i.e., non-homophilous) graphs via experiments on synthetic and real-world datasets. We propose Adaptive Simple Graph Convolution (ASGC), which we show can adapt to both homophilous and heterophilous graph structure. Like SGC, ASGC is not a deep model, and hence is fast, scalable, and interpretable; further, we can prove performance guarantees on natural synthetic data models. Empirically, ASGC is often competitive with recent deep models at node classification on a benchmark of real-world datasets. The SGC paper questioned whether the complexity of graph neural networks is warranted for common graph problems involving homophilous networks; our results similarly suggest that, while deep learning often achieves the highest performance, heterophilous structure alone does not necessitate these more involved methods. △ Less

Submitted 3 June, 2022; v1 submitted 8 February, 2022; originally announced February 2022.

arXiv:2112.09631 [pdf, other]

Sublinear Time Approximation of Text Similarity Matrices

Authors: Archan Ray, Nicholas Monath, Andrew McCallum, Cameron Musco

Abstract: We study algorithms for approximating pairwise similarity matrices that arise in natural language processing. Generally, computing a similarity matrix for $n$ data points requires $Ω(n^2)$ similarity computations. This quadratic scaling is a significant bottleneck, especially when similarities are computed via expensive functions, e.g., via transformer models. Approximation methods reduce this qua… ▽ More We study algorithms for approximating pairwise similarity matrices that arise in natural language processing. Generally, computing a similarity matrix for $n$ data points requires $Ω(n^2)$ similarity computations. This quadratic scaling is a significant bottleneck, especially when similarities are computed via expensive functions, e.g., via transformer models. Approximation methods reduce this quadratic complexity, often by using a small subset of exactly computed similarities to approximate the remainder of the complete pairwise similarity matrix. Significant work focuses on the efficient approximation of positive semidefinite (PSD) similarity matrices, which arise e.g., in kernel methods. However, much less is understood about indefinite (non-PSD) similarity matrices, which often arise in NLP. Motivated by the observation that many of these matrices are still somewhat close to PSD, we introduce a generalization of the popular Nyström method to the indefinite setting. Our algorithm can be applied to any similarity matrix and runs in sublinear time in the size of the matrix, producing a rank-$s$ approximation with just $O(ns)$ similarity computations. We show that our method, along with a simple variant of CUR decomposition, performs very well in approximating a variety of similarity matrices arising in NLP tasks. We demonstrate high accuracy of the approximated similarity matrices in the downstream tasks of document classification, sentence similarity, and cross-document coreference. △ Less

Submitted 27 April, 2022; v1 submitted 17 December, 2021; originally announced December 2021.

Comments: 25 pages, 10 figures

MSC Class: F.2.1

arXiv:2111.14020 [pdf, other]

doi 10.1145/3539597.3570442

Local Edge Dynamics and Opinion Polarization

Authors: Nikita Bhalla, Adam Lechowicz, Cameron Musco

Abstract: The proliferation of social media platforms, recommender systems, and their joint societal impacts have prompted significant interest in opinion formation and evolution within social networks. We study how local edge dynamics can drive opinion polarization. In particular, we introduce a variant of the classic Friedkin-Johnsen opinion dynamics, augmented with a simple time-evolving network model. E… ▽ More The proliferation of social media platforms, recommender systems, and their joint societal impacts have prompted significant interest in opinion formation and evolution within social networks. We study how local edge dynamics can drive opinion polarization. In particular, we introduce a variant of the classic Friedkin-Johnsen opinion dynamics, augmented with a simple time-evolving network model. Edges are iteratively added or deleted according to simple rules, modeling decisions based on individual preferences and network recommendations. Via simulations on synthetic and real-world graphs, we find that the combined presence of two dynamics gives rise to high polarization: 1) confirmation bias -- i.e., the preference for nodes to connect to other nodes with similar expressed opinions and 2) friend-of-friend link recommendations, which encourage new connections between closely connected nodes. We show that our model is tractable to theoretical analysis, which helps explain how these local dynamics erode connectivity across opinion groups, affecting polarization and a related measure of disagreement across edges. Finally, we validate our model against real-world data, showing that our edge dynamics drive the structure of arbitrary graphs, including random graphs, to more closely resemble real social networks. △ Less

Submitted 8 December, 2022; v1 submitted 27 November, 2021; originally announced November 2021.

Comments: Accepted to WSDM 2023. 14 pages, 30 figures

arXiv:2111.04888 [pdf, ps, other]

Active Linear Regression for $\ell_p$ Norms and Beyond

Authors: Cameron Musco, Christopher Musco, David P. Woodruff, Taisuke Yasuda

Abstract: We study active sampling algorithms for linear regression, which aim to query only a few entries of a target vector $b\in\mathbb R^n$ and output a near minimizer to $\min_{x\in\mathbb R^d} \|Ax-b\|$, for a design matrix $A\in\mathbb R^{n \times d}$ and loss $\|\cdot\|$. For $p$ norm regression for any $0<p<\infty$, we give an algorithm based on Lewis weight sampling outputting a $(1+ε)$-approxim… ▽ More We study active sampling algorithms for linear regression, which aim to query only a few entries of a target vector $b\in\mathbb R^n$ and output a near minimizer to $\min_{x\in\mathbb R^d} \|Ax-b\|$, for a design matrix $A\in\mathbb R^{n \times d}$ and loss $\|\cdot\|$. For $p$ norm regression for any $0<p<\infty$, we give an algorithm based on Lewis weight sampling outputting a $(1+ε)$-approximate solution using just $\tilde O(d/ε^2)$ queries to $b$ for $p\in(0,1)$, $\tilde{O}(d/ε)$ queries for $1<p<2$, and $\tilde{O}(d^{p/2}/ε^p)$ queries for $2<p<\infty$. For $0<p<2$, our bounds are optimal up to log factors, settling the query complexity for this range. For $2<p<\infty$, our dependence on $d$ is optimal, while our dependence on $ε$ is off by at most $ε$, up to log factors. Our result resolves an open question of [CD21], who gave near optimal bounds for the $1$ norm, but required $d^2/ε^2$ samples for $\ell_p$ regression with $1<p<2$, and gave no bounds for $2<p<\infty$ or $0<p<1$. We also give the first total sensitivity bound of $O(d^{\max\{1,p/2\}}\log^2n)$ for loss functions of degree $p$ polynomial growth, improving a result of [TMF20]. By combining this with our techniques for $\ell_p$ regression, we obtain an active regression algorithm making $\tilde O(d^{1+\max\{1,p/2\}}/\mathrm{poly}(ε))$ queries for such loss functions, including the Tukey and Huber losses, answering another question of [CD21]. For the Huber loss, we further improve our bound to $\tilde O(d^{4-2\sqrt2}/\mathrm{poly}(ε))$ samples. Our sensitivity bounds also have many applications, including Orlicz norm subspace embeddings, robust subspace approximation, and dimension reduction for smoothed $p$-norms. Finally, our active sampling results give the first sublinear time algorithms for Kronecker product regression under every $p$ norm. △ Less

Submitted 26 September, 2022; v1 submitted 8 November, 2021; originally announced November 2021.

Comments: Abstract shortened to meet arXiv limits; v2: improved bounds; v3: improved bounds; v4: to appear in FOCS 2022

arXiv:2111.03030 [pdf, other]

Exact Representation of Sparse Networks with Symmetric Nonnegative Embeddings

Authors: Sudhanshu Chanpuriya, Ryan A. Rossi, Anup Rao, Tung Mai, Nedim Lipka, Zhao Song, Cameron Musco

Abstract: Many models for undirected graphs are based on factorizing the graph's adjacency matrix; these models find a vector representation of each node such that the predicted probability of a link between two nodes increases with the similarity (dot product) of their associated vectors. Recent work has shown that these models are unable to capture key structures in real-world graphs, particularly heterop… ▽ More Many models for undirected graphs are based on factorizing the graph's adjacency matrix; these models find a vector representation of each node such that the predicted probability of a link between two nodes increases with the similarity (dot product) of their associated vectors. Recent work has shown that these models are unable to capture key structures in real-world graphs, particularly heterophilous structures, wherein links occur between dissimilar nodes. In contrast, a factorization with two vectors per node, based on logistic principal components analysis (LPCA), has been proven not only to represent such structures, but also to provide exact low-rank factorization of any graph with bounded max degree. However, this bound has limited applicability to real-world networks, which often have power law degree distributions with high max degree. Further, the LPCA model lacks interpretability since its asymmetric factorization does not reflect the undirectedness of the graph. We address these issues in two ways. First, we prove a new bound for the LPCA model in terms of arboricity rather than max degree; this greatly increases the bound's applicability to many sparse real-world networks. Second, we propose an alternative graph model whose factorization is symmetric and nonnegative, which allows for link predictions to be interpreted in terms of node clusters. We show that the bounds for exact representation in the LPCA model extend to our new model. On the empirical side, our model is optimized effectively on real-world graphs with gradient descent on a cross-entropy loss. We demonstrate its effectiveness on a variety of foundational tasks, such as community detection and link prediction. △ Less

Submitted 30 September, 2022; v1 submitted 4 November, 2021; originally announced November 2021.

arXiv:2111.00048 [pdf, other]

On the Power of Edge Independent Graph Models

Authors: Sudhanshu Chanpuriya, Cameron Musco, Konstantinos Sotiropoulos, Charalampos E. Tsourakakis

Abstract: Why do many modern neural-network-based graph generative models fail to reproduce typical real-world network characteristics, such as high triangle density? In this work we study the limitations of edge independent random graph models, in which each edge is added to the graph independently with some probability. Such models include both the classic Erdös-Rényi and stochastic block models, as well… ▽ More Why do many modern neural-network-based graph generative models fail to reproduce typical real-world network characteristics, such as high triangle density? In this work we study the limitations of edge independent random graph models, in which each edge is added to the graph independently with some probability. Such models include both the classic Erdös-Rényi and stochastic block models, as well as modern generative models such as NetGAN, variational graph autoencoders, and CELL. We prove that subject to a bounded overlap condition, which ensures that the model does not simply memorize a single graph, edge independent models are inherently limited in their ability to generate graphs with high triangle and other subgraph densities. Notably, such high densities are known to appear in real-world social networks and other graphs. We complement our negative results with a simple generative model that balances overlap and accuracy, performing comparably to more complex models in reconstructing many graph statistics. △ Less

Submitted 29 October, 2021; originally announced November 2021.

arXiv:2110.13752 [pdf, other]

Dynamic Trace Estimation

Authors: Prathamesh Dharangutte, Christopher Musco

Abstract: We study a dynamic version of the implicit trace estimation problem. Given access to an oracle for computing matrix-vector multiplications with a dynamically changing matrix A, our goal is to maintain an accurate approximation to A's trace using as few multiplications as possible. We present a practical algorithm for solving this problem and prove that, in a natural setting, its complexity is quad… ▽ More We study a dynamic version of the implicit trace estimation problem. Given access to an oracle for computing matrix-vector multiplications with a dynamically changing matrix A, our goal is to maintain an accurate approximation to A's trace using as few multiplications as possible. We present a practical algorithm for solving this problem and prove that, in a natural setting, its complexity is quadratically better than the standard solution of repeatedly applying Hutchinson's stochastic trace estimator. We also provide an improved algorithm assuming slightly stronger assumptions on the dynamic matrix A. We support our theory with empirical results, showing significant computational improvements on three applications in machine learning and network science: tracking moments of the Hessian spectral density during neural network optimization, counting triangles, and estimating natural connectivity in a dynamically changing graph. △ Less

Submitted 26 October, 2021; originally announced October 2021.

Comments: Accepted to NeurIPS 2021

arXiv:2110.11981 [pdf, other]

How to Quantify Polarization in Models of Opinion Dynamics

Authors: Christopher Musco, Indu Ramesh, Johan Ugander, R. Teal Witter

Abstract: It is widely believed that society is becoming increasingly polarized around important issues, a dynamic that does not align with common mathematical models of opinion formation in social networks. In particular, measures of polarization based on opinion variance are known to decrease over time in frameworks such as the popular DeGroot model. Complementing recent work that seeks to resolve this ap… ▽ More It is widely believed that society is becoming increasingly polarized around important issues, a dynamic that does not align with common mathematical models of opinion formation in social networks. In particular, measures of polarization based on opinion variance are known to decrease over time in frameworks such as the popular DeGroot model. Complementing recent work that seeks to resolve this apparent inconsistency by modifying opinion models, we instead resolve the inconsistency by proposing changes to how polarization is quantified. We present a natural class of group-based polarization measures that capture the extent to which opinions are clustered into distinct groups. Using theoretical arguments and empirical evidence, we show that these group-based measures display interesting, non-monotonic dynamics, even in the simple DeGroot model. In particular, for many natural social networks, group-based metrics can increase over time, and thereby correctly capture perceptions of increasing polarization. Our results build on work by DeMarzo et al., who introduced a group-based polarization metric based on ideological alignment. We show that a central tool from that work, a limit analysis of individual opinions under the DeGroot model, can be extended to the dynamics of other group-based polarization measures, including established statistical measures like bimodality. We also consider local measures of polarization that operationalize how polarization is perceived in a network setting. In conjunction with evidence from prior work that group-based measures better align with real-world perceptions of polarization, our work provides formal support for the use of these measures in place of variance-based polarization in future studies of opinion dynamics. △ Less

Submitted 25 October, 2021; v1 submitted 22 October, 2021; originally announced October 2021.

arXiv:2109.07647 [pdf, other]

Sublinear Time Eigenvalue Approximation via Random Sampling

Authors: Rajarshi Bhattacharjee, Gregory Dexter, Petros Drineas, Cameron Musco, Archan Ray

Abstract: We study the problem of approximating the eigenspectrum of a symmetric matrix $\mathbf A \in \mathbb{R}^{n \times n}$ with bounded entries (i.e., $\|\mathbf A\|_{\infty} \leq 1$). We present a simple sublinear time algorithm that approximates all eigenvalues of $\mathbf{A}$ up to additive error $\pm εn$ using those of a randomly sampled… ▽ More We study the problem of approximating the eigenspectrum of a symmetric matrix $\mathbf A \in \mathbb{R}^{n \times n}$ with bounded entries (i.e., $\|\mathbf A\|_{\infty} \leq 1$). We present a simple sublinear time algorithm that approximates all eigenvalues of $\mathbf{A}$ up to additive error $\pm εn$ using those of a randomly sampled $\tilde {O}\left (\frac{\log^3 n}{ε^3}\right ) \times \tilde O\left (\frac{\log^3 n}{ε^3}\right )$ principal submatrix. Our result can be viewed as a concentration bound on the complete eigenspectrum of a random submatrix, significantly extending known bounds on just the singular values (the magnitudes of the eigenvalues). We give improved error bounds of $\pm ε\sqrt{\text{nnz}(\mathbf{A})}$ and $\pm ε\|\mathbf A\|_F$ when the rows of $\mathbf A$ can be sampled with probabilities proportional to their sparsities or their squared $\ell_2$ norms respectively. Here $\text{nnz}(\mathbf{A})$ is the number of non-zero entries in $\mathbf{A}$ and $\|\mathbf A\|_F$ is its Frobenius norm. Even for the strictly easier problems of approximating the singular values or testing the existence of large negative eigenvalues (Bakshi, Chepurko, and Jayaram, FOCS '20), our results are the first that take advantage of non-uniform sampling to give improved error bounds. From a technical perspective, our results require several new eigenvalue concentration and perturbation bounds for matrices with bounded entries. Our non-uniform sampling bounds require a new algorithmic approach, which judiciously zeroes out entries of a randomly sampled submatrix to reduce variance, before computing the eigenvalues of that submatrix as estimates for those of $\mathbf A$. We complement our theoretical results with numerical simulations, which demonstrate the effectiveness of our algorithms in practice. △ Less

Submitted 21 July, 2022; v1 submitted 15 September, 2021; originally announced September 2021.

Comments: 58 pages, 4 figures

MSC Class: F.2.1; G.1.3; G.1.2; G.4; I.1.2

arXiv:2106.04254 [pdf, other]

Coresets for Classification -- Simplified and Strengthened

Authors: Tung Mai, Anup B. Rao, Cameron Musco

Abstract: We give relative error coresets for training linear classifiers with a broad class of loss functions, including the logistic loss and hinge loss. Our construction achieves $(1\pm ε)$ relative error with $\tilde O(d \cdot μ_y(X)^2/ε^2)$ points, where $μ_y(X)$ is a natural complexity measure of the data matrix $X \in \mathbb{R}^{n \times d}$ and label vector $y \in \{-1,1\}^n$, introduced in by Munt… ▽ More We give relative error coresets for training linear classifiers with a broad class of loss functions, including the logistic loss and hinge loss. Our construction achieves $(1\pm ε)$ relative error with $\tilde O(d \cdot μ_y(X)^2/ε^2)$ points, where $μ_y(X)$ is a natural complexity measure of the data matrix $X \in \mathbb{R}^{n \times d}$ and label vector $y \in \{-1,1\}^n$, introduced in by Munteanu et al. 2018. Our result is based on subsampling data points with probabilities proportional to their $\ell_1$ $Lewis$ $weights$. It significantly improves on existing theoretical bounds and performs well in practice, outperforming uniform subsampling along with other importance sampling methods. Our sampling distribution does not depend on the labels, so can be used for active learning. It also does not depend on the specific loss function, so a single coreset can be used in multiple training scenarios. △ Less

Submitted 17 June, 2021; v1 submitted 8 June, 2021; originally announced June 2021.

arXiv:2104.03461 [pdf, other]

doi 10.1145/3519935.3520009

Sublinear Time Spectral Density Estimation

Authors: Vladimir Braverman, Aditya Krishnan, Christopher Musco

Abstract: We present a new sublinear time algorithm for approximating the spectral density (eigenvalue distribution) of an $n\times n$ normalized graph adjacency or Laplacian matrix. The algorithm recovers the spectrum up to $ε$ accuracy in the Wasserstein-1 distance in $O(n\cdot \text{poly}(1/ε))$ time given sample access to the graph. This result compliments recent work by David Cohen-Steiner, Weihao Kong… ▽ More We present a new sublinear time algorithm for approximating the spectral density (eigenvalue distribution) of an $n\times n$ normalized graph adjacency or Laplacian matrix. The algorithm recovers the spectrum up to $ε$ accuracy in the Wasserstein-1 distance in $O(n\cdot \text{poly}(1/ε))$ time given sample access to the graph. This result compliments recent work by David Cohen-Steiner, Weihao Kong, Christian Sohler, and Gregory Valiant (2018), which obtains a solution with runtime independent of $n$, but exponential in $1/ε$. We conjecture that the trade-off between dimension dependence and accuracy is inherent. Our method is simple and works well experimentally. It is based on a Chebyshev polynomial moment matching method that employees randomized estimators for the matrix trace. We prove that, for any Hermitian $A$, this moment matching method returns an $ε$ approximation to the spectral density using just $O({1}/ε)$ matrix-vector products with $A$. By leveraging stability properties of the Chebyshev polynomial three-term recurrence, we then prove that the method is amenable to the use of coarse approximate matrix-vector products. Our sublinear time algorithm follows from combining this result with a novel sampling algorithm for approximating matrix-vector products with a normalized graph adjacency matrix. Of independent interest, we show a similar result for the widely used \emph{kernel polynomial method} (KPM), proving that this practical algorithm nearly matches the theoretical guarantees of our moment matching method. Our analysis uses tools from Jackson's seminal work on approximation with positive polynomial kernels. △ Less

Submitted 14 April, 2022; v1 submitted 7 April, 2021; originally announced April 2021.

Comments: Accepted to STOC'22

arXiv:2104.03353 [pdf, other]

doi 10.1145/3448016.3458456

Correlation Sketches for Approximate Join-Correlation Queries

Authors: Aécio Santos, Aline Bessa, Fernando Chirigati, Christopher Musco, Juliana Freire

Abstract: The increasing availability of structured datasets, from Web tables and open-data portals to enterprise data, opens up opportunities~to enrich analytics and improve machine learning models through relational data augmentation. In this paper, we introduce a new class of data augmentation queries: join-correlation queries. Given a column $Q$ and a join column $K_Q$ from a query table… ▽ More The increasing availability of structured datasets, from Web tables and open-data portals to enterprise data, opens up opportunities~to enrich analytics and improve machine learning models through relational data augmentation. In this paper, we introduce a new class of data augmentation queries: join-correlation queries. Given a column $Q$ and a join column $K_Q$ from a query table $\mathcal{T}_Q$, retrieve tables $\mathcal{T}_X$ in a dataset collection such that $\mathcal{T}_X$ is joinable with $\mathcal{T}_Q$ on $K_Q$ and there is a column $C \in \mathcal{T}_X$ such that $Q$ is correlated with $C$. A naïve approach to evaluate these queries, which first finds joinable tables and then explicitly joins and computes correlations between $Q$ and all columns of the discovered tables, is prohibitively expensive. To efficiently support correlated column discovery, we 1) propose a sketching method that enables the construction of an index for a large number of tables and that provides accurate estimates for join-correlation queries, and 2) explore different scoring strategies that effectively rank the query results based on how well the columns are correlated with the query. We carry out a detailed experimental evaluation, using both synthetic and real data, which shows that our sketches attain high accuracy and the scoring strategies lead to high-quality rankings. △ Less

Submitted 7 April, 2021; originally announced April 2021.

Comments: Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21)

arXiv:2103.16084 [pdf, other]

doi 10.1145/3448016.3457247

Public Transport Planning: When Transit Network Connectivity Meets Commuting Demand

Authors: Sheng Wang, Yuan Sun, Christopher Musco, Zhifeng Bao

Abstract: In this paper, we make a first attempt to incorporate both commuting demand and transit network connectivity in bus route planning (CT-Bus), and formulate it as a constrained optimization problem: planning a new bus route with k edges over an existing transit network without building new bus stops to maximize a linear aggregation of commuting demand and connectivity of the transit network. We prov… ▽ More In this paper, we make a first attempt to incorporate both commuting demand and transit network connectivity in bus route planning (CT-Bus), and formulate it as a constrained optimization problem: planning a new bus route with k edges over an existing transit network without building new bus stops to maximize a linear aggregation of commuting demand and connectivity of the transit network. We prove the NP-hardness of CT-Bus and propose an expansion-based greedy algorithm that iteratively scans potential candidate paths in the network. To boost the efficiency of computing the connectivity of new networks with candidate paths, we convert it to a matrix trace estimation problem and employ a Lanczos method to estimate the natural connectivity of the transit network with a guaranteed error bound. Furthermore, we derive upper bounds on the objective values and use them to greedily select candidates for expansion. Our experiments conducted on real-world transit networks in New York City and Chicago verify the efficiency, effectiveness, and scalability of our algorithms. △ Less

Submitted 3 April, 2021; v1 submitted 30 March, 2021; originally announced March 2021.

Comments: SIGMOD 2021, 14 pages

arXiv:2102.08532 [pdf, other]

DeepWalking Backwards: From Embeddings Back to Graphs

Authors: Sudhanshu Chanpuriya, Cameron Musco, Konstantinos Sotiropoulos, Charalampos E. Tsourakakis

Abstract: Low-dimensional node embeddings play a key role in analyzing graph datasets. However, little work studies exactly what information is encoded by popular embedding methods, and how this information correlates with performance in downstream machine learning tasks. We tackle this question by studying whether embeddings can be inverted to (approximately) recover the graph used to generate them. Focusi… ▽ More Low-dimensional node embeddings play a key role in analyzing graph datasets. However, little work studies exactly what information is encoded by popular embedding methods, and how this information correlates with performance in downstream machine learning tasks. We tackle this question by studying whether embeddings can be inverted to (approximately) recover the graph used to generate them. Focusing on a variant of the popular DeepWalk method (Perozzi et al., 2014; Qiu et al., 2018), we present algorithms for accurate embedding inversion - i.e., from the low-dimensional embedding of a graph G, we can find a graph H with a very similar embedding. We perform numerous experiments on real-world networks, observing that significant information about G, such as specific edges and bulk properties like triangle density, is often lost in H. However, community structure is often preserved or even enhanced. Our findings are a step towards a more rigorous understanding of exactly what information embeddings encode about the input graph, and why this information is useful for learning tasks. △ Less

Submitted 16 February, 2021; originally announced February 2021.

arXiv:2102.08341 [pdf, other]

Faster Kernel Matrix Algebra via Density Estimation

Authors: Arturs Backurs, Piotr Indyk, Cameron Musco, Tal Wagner

Abstract: We study fast algorithms for computing fundamental properties of a positive semidefinite kernel matrix $K \in \mathbb{R}^{n \times n}$ corresponding to $n$ points $x_1,\ldots,x_n \in \mathbb{R}^d$. In particular, we consider estimating the sum of kernel matrix entries, along with its top eigenvalue and eigenvector. We show that the sum of matrix entries can be estimated to $1+ε$ relative error i… ▽ More We study fast algorithms for computing fundamental properties of a positive semidefinite kernel matrix $K \in \mathbb{R}^{n \times n}$ corresponding to $n$ points $x_1,\ldots,x_n \in \mathbb{R}^d$. In particular, we consider estimating the sum of kernel matrix entries, along with its top eigenvalue and eigenvector. We show that the sum of matrix entries can be estimated to $1+ε$ relative error in time $sublinear$ in $n$ and linear in $d$ for many popular kernels, including the Gaussian, exponential, and rational quadratic kernels. For these kernels, we also show that the top eigenvalue (and an approximate eigenvector) can be approximated to $1+ε$ relative error in time $subquadratic$ in $n$ and linear in $d$. Our algorithms represent significant advances in the best known runtimes for these problems. They leverage the positive definiteness of the kernel matrix, along with a recent line of work on efficient kernel density estimation. △ Less

Submitted 17 June, 2021; v1 submitted 16 February, 2021; originally announced February 2021.

arXiv:2101.11751 [pdf, other]

Faster Kernel Interpolation for Gaussian Processes

Authors: Mohit Yadav, Daniel Sheldon, Cameron Musco

Abstract: A key challenge in scaling Gaussian Process (GP) regression to massive datasets is that exact inference requires computation with a dense n x n kernel matrix, where n is the number of data points. Significant work focuses on approximating the kernel matrix via interpolation using a smaller set of m inducing points. Structured kernel interpolation (SKI) is among the most scalable methods: by placin… ▽ More A key challenge in scaling Gaussian Process (GP) regression to massive datasets is that exact inference requires computation with a dense n x n kernel matrix, where n is the number of data points. Significant work focuses on approximating the kernel matrix via interpolation using a smaller set of m inducing points. Structured kernel interpolation (SKI) is among the most scalable methods: by placing inducing points on a dense grid and using structured matrix algebra, SKI achieves per-iteration time of O(n + m log m) for approximate inference. This linear scaling in n enables inference for very large data sets; however the cost is per-iteration, which remains a limitation for extremely large n. We show that the SKI per-iteration time can be reduced to O(m log m) after a single O(n) time precomputation step by reframing SKI as solving a natural Bayesian linear regression problem with a fixed set of m compact basis functions. With per-iteration complexity independent of the dataset size n for a fixed grid, our method scales to truly massive data sets. We demonstrate speedups in practice for a wide range of m and n and apply the method to GP inference on a three-dimensional weather radar dataset with over 100 million points. △ Less

Submitted 13 August, 2021; v1 submitted 27 January, 2021; originally announced January 2021.

Comments: To appear, Artificial Intelligence and Statistics (AISTATS) 2021

arXiv:2012.13976 [pdf, ps, other]

Intervention Efficient Algorithms for Approximate Learning of Causal Graphs

Authors: Raghavendra Addanki, Andrew McGregor, Cameron Musco

Abstract: We study the problem of learning the causal relationships between a set of observed variables in the presence of latents, while minimizing the cost of interventions on the observed variables. We assume access to an undirected graph $G$ on the observed variables whose edges represent either all direct causal relationships or, less restrictively, a superset of causal relationships (identified, e.g.,… ▽ More We study the problem of learning the causal relationships between a set of observed variables in the presence of latents, while minimizing the cost of interventions on the observed variables. We assume access to an undirected graph $G$ on the observed variables whose edges represent either all direct causal relationships or, less restrictively, a superset of causal relationships (identified, e.g., via conditional independence tests or a domain expert). Our goal is to recover the directions of all causal or ancestral relations in $G$, via a minimum cost set of interventions. It is known that constructing an exact minimum cost intervention set for an arbitrary graph $G$ is NP-hard. We further argue that, conditioned on the hardness of approximate graph coloring, no polynomial time algorithm can achieve an approximation factor better than $Θ(\log n)$, where $n$ is the number of observed variables in $G$. To overcome this limitation, we introduce a bi-criteria approximation goal that lets us recover the directions of all but $εn^2$ edges in $G$, for some specified error parameter $ε> 0$. Under this relaxed goal, we give polynomial time algorithms that achieve intervention cost within a small constant factor of the optimal. Our algorithms combine work on efficient intervention design and the design of low-cost separating set systems, with ideas from the literature on graph property testing. △ Less

Submitted 27 December, 2020; originally announced December 2020.

Comments: To appear, International Conference on Algorithmic Learning Theory(ALT) 2021

arXiv:2011.09986 [pdf, other]

Estimation of Shortest Path Covariance Matrices

Authors: Raj Kumar Maity, Cameron Musco

Abstract: We study the sample complexity of estimating the covariance matrix $\mathbfΣ \in \mathbb{R}^{d\times d}$ of a distribution $\mathcal D$ over $\mathbb{R}^d$ given independent samples, under the assumption that $\mathbfΣ$ is graph-structured. In particular, we focus on shortest path covariance matrices, where the covariance between any two measurements is determined by the shortest path distance in… ▽ More We study the sample complexity of estimating the covariance matrix $\mathbfΣ \in \mathbb{R}^{d\times d}$ of a distribution $\mathcal D$ over $\mathbb{R}^d$ given independent samples, under the assumption that $\mathbfΣ$ is graph-structured. In particular, we focus on shortest path covariance matrices, where the covariance between any two measurements is determined by the shortest path distance in an underlying graph with $d$ nodes. Such matrices generalize Toeplitz and circulant covariance matrices and are widely applied in signal processing applications, where the covariance between two measurements depends on the (shortest path) distance between them in time or space. We focus on minimizing both the vector sample complexity: the number of samples drawn from $\mathcal{D}$ and the entry sample complexity: the number of entries read in each sample. The entry sample complexity corresponds to measurement equipment costs in signal processing applications. We give a very simple algorithm for estimating $\mathbfΣ$ up to spectral norm error $ε\left\|\mathbfΣ\right\|_2$ using just $O(\sqrt{D})$ entry sample complexity and $\tilde O(r^2/ε^2)$ vector sample complexity, where $D$ is the diameter of the underlying graph and $r \le d$ is the rank of $\mathbfΣ$. Our method is based on extending the widely applied idea of sparse rulers for Toeplitz covariance estimation to the graph setting. In the special case when $\mathbfΣ$ is a low-rank Toeplitz matrix, our result matches the state-of-the-art, with a far simpler proof. We also give an information theoretic lower bound matching our upper bound up to a factor $D$ and discuss some directions towards closing this gap. △ Less

Submitted 19 November, 2020; originally announced November 2020.

arXiv:2010.10218 [pdf, other]

Model-specific Data Subsampling with Influence Functions

Authors: Anant Raj, Cameron Musco, Lester Mackey, Nicolo Fusi

Abstract: Model selection requires repeatedly evaluating models on a given dataset and measuring their relative performances. In modern applications of machine learning, the models being considered are increasingly more expensive to evaluate and the datasets of interest are increasing in size. As a result, the process of model selection is time-consuming and computationally inefficient. In this work, we dev… ▽ More Model selection requires repeatedly evaluating models on a given dataset and measuring their relative performances. In modern applications of machine learning, the models being considered are increasingly more expensive to evaluate and the datasets of interest are increasing in size. As a result, the process of model selection is time-consuming and computationally inefficient. In this work, we develop a model-specific data subsampling strategy that improves over random sampling whenever training points have varying influence. Specifically, we leverage influence functions to guide our selection strategy, proving theoretically, and demonstrating empirically that our approach quickly selects high-quality models. △ Less

Submitted 20 October, 2020; originally announced October 2020.

Showing 1–50 of 97 results for author: Musco, C