Search | arXiv e-print repository

No Free Prune: Information-Theoretic Barriers to Pruning at Initialization

Authors: Tanishq Kumar, Kevin Luo, Mark Sellke

Abstract: The existence of "lottery tickets" arXiv:1803.03635 at or near initialization raises the tantalizing question of whether large models are necessary in deep learning, or whether sparse networks can be quickly identified and trained without ever training the dense models that contain them. However, efforts to find these sparse subnetworks without training the dense model ("pruning at initialization"… ▽ More The existence of "lottery tickets" arXiv:1803.03635 at or near initialization raises the tantalizing question of whether large models are necessary in deep learning, or whether sparse networks can be quickly identified and trained without ever training the dense models that contain them. However, efforts to find these sparse subnetworks without training the dense model ("pruning at initialization") have been broadly unsuccessful arXiv:2009.08576. We put forward a theoretical explanation for this, based on the model's effective parameter count, $p_\text{eff}$, given by the sum of the number of non-zero weights in the final network and the mutual information between the sparsity mask and the data. We show the Law of Robustness of arXiv:2105.12806 extends to sparse networks with the usual parameter count replaced by $p_\text{eff}$, meaning a sparse neural network which robustly interpolates noisy data requires a heavily data-dependent mask. We posit that pruning during and after training outputs masks with higher mutual information than those produced by pruning at initialization. Thus two networks may have the same sparsities, but differ in effective parameter count based on how they were trained. This suggests that pruning near initialization may be infeasible and explains why lottery tickets exist, but cannot be found fast (i.e. without training the full network). Experiments on neural networks confirm that information gained during training may indeed affect model capacity. △ Less

Submitted 1 February, 2024; originally announced February 2024.

arXiv:2306.01995 [pdf, ps, other]

Asymptotically Optimal Pure Exploration for Infinite-Armed Bandits

Authors: Xiao-Yue Gong, Mark Sellke

Abstract: We study pure exploration with infinitely many bandit arms generated i.i.d. from an unknown distribution. Our goal is to efficiently select a single high quality arm whose average reward is, with probability $1-δ$, within $\varepsilon$ of being among the top $η$-fraction of arms; this is a natural adaptation of the classical PAC guarantee for infinite action sets. We consider both the fixed confid… ▽ More We study pure exploration with infinitely many bandit arms generated i.i.d. from an unknown distribution. Our goal is to efficiently select a single high quality arm whose average reward is, with probability $1-δ$, within $\varepsilon$ of being among the top $η$-fraction of arms; this is a natural adaptation of the classical PAC guarantee for infinite action sets. We consider both the fixed confidence and fixed budget settings, aiming respectively for minimal expected and fixed sample complexity. For fixed confidence, we give an algorithm with expected sample complexity $O\left(\frac{\log (1/η)\log (1/δ)}{η\varepsilon^2}\right)$. This is optimal except for the $\log (1/η)$ factor, and the $δ$-dependence closes a quadratic gap in the literature. For fixed budget, we show the asymptotically optimal sample complexity as $δ\to 0$ is $c^{-1}\log(1/δ)\big(\log\log(1/δ)\big)^2$ to leading order. Equivalently, the optimal failure probability given exactly $N$ samples decays as $\exp\big(-cN/\log^2 N\big)$, up to a factor $1\pm o_N(1)$ inside the exponent. The constant $c$ depends explicitly on the problem parameters (including the unknown arm distribution) through a certain Fisher information distance. Even the strictly super-linear dependence on $\log(1/δ)$ was not known and resolves a question of Grossman and Moshkovitz (FOCS 2016, SIAM Journal on Computing 2020). △ Less

Submitted 3 June, 2023; originally announced June 2023.

arXiv:2306.01992 [pdf, other]

On Size-Independent Sample Complexity of ReLU Networks

Authors: Mark Sellke

Abstract: We study the sample complexity of learning ReLU neural networks from the point of view of generalization. Given norm constraints on the weight matrices, a common approach is to estimate the Rademacher complexity of the associated function class. Previously Golowich-Rakhlin-Shamir (2020) obtained a bound independent of the network size (scaling with a product of Frobenius norms) except for a factor… ▽ More We study the sample complexity of learning ReLU neural networks from the point of view of generalization. Given norm constraints on the weight matrices, a common approach is to estimate the Rademacher complexity of the associated function class. Previously Golowich-Rakhlin-Shamir (2020) obtained a bound independent of the network size (scaling with a product of Frobenius norms) except for a factor of the square-root depth. We give a refinement which often has no explicit depth-dependence at all. △ Less

Submitted 4 February, 2024; v1 submitted 2 June, 2023; originally announced June 2023.

Comments: 4 pages

arXiv:2306.01990 [pdf, other]

Incentivizing Exploration with Linear Contexts and Combinatorial Actions

Authors: Mark Sellke

Abstract: We advance the study of incentivized bandit exploration, in which arm choices are viewed as recommendations and are required to be Bayesian incentive compatible. Recent work has shown under certain independence assumptions that after collecting enough initial samples, the popular Thompson sampling algorithm becomes incentive compatible. We give an analog of this result for linear bandits, where th… ▽ More We advance the study of incentivized bandit exploration, in which arm choices are viewed as recommendations and are required to be Bayesian incentive compatible. Recent work has shown under certain independence assumptions that after collecting enough initial samples, the popular Thompson sampling algorithm becomes incentive compatible. We give an analog of this result for linear bandits, where the independence of the prior is replaced by a natural convexity condition. This opens up the possibility of efficient and regret-optimal incentivized exploration in high-dimensional action spaces. In the semibandit model, we also improve the sample complexity for the pre-Thompson sampling phase of initial data collection. △ Less

Submitted 19 February, 2024; v1 submitted 2 June, 2023; originally announced June 2023.

Comments: International Conference on Machine Learning (ICML) 2023

arXiv:2304.01438 [pdf, ps, other]

Tight Space Lower Bound for Pseudo-Deterministic Approximate Counting

Authors: Ofer Grossman, Meghal Gupta, Mark Sellke

Abstract: We investigate one of the most basic problems in streaming algorithms: approximating the number of elements in the stream. In 1978, Morris famously gave a randomized algorithm achieving a constant-factor approximation error for streams of length at most N in space $O(\log \log N)$. We investigate the pseudo-deterministic complexity of the problem and prove a tight $Ω(\log N)$ lower bound, thus res… ▽ More We investigate one of the most basic problems in streaming algorithms: approximating the number of elements in the stream. In 1978, Morris famously gave a randomized algorithm achieving a constant-factor approximation error for streams of length at most N in space $O(\log \log N)$. We investigate the pseudo-deterministic complexity of the problem and prove a tight $Ω(\log N)$ lower bound, thus resolving a problem of Goldwasser-Grossman-Mohanty-Woodruff. △ Less

Submitted 5 July, 2023; v1 submitted 3 April, 2023; originally announced April 2023.

Comments: 18 pages

arXiv:2303.12172 [pdf, other]

Algorithmic Threshold for Multi-Species Spherical Spin Glasses

Authors: Brice Huang, Mark Sellke

Abstract: We study efficient optimization of the Hamiltonians of multi-species spherical spin glasses. Our results characterize the maximum value attained by algorithms that are suitably Lipschitz with respect to the disorder through a variational principle that we study in detail. We rely on the branching overlap gap property introduced in our previous work and develop a new method to establish it that doe… ▽ More We study efficient optimization of the Hamiltonians of multi-species spherical spin glasses. Our results characterize the maximum value attained by algorithms that are suitably Lipschitz with respect to the disorder through a variational principle that we study in detail. We rely on the branching overlap gap property introduced in our previous work and develop a new method to establish it that does not require the interpolation method. Consequently our results apply even for models with non-convex covariance, where the Parisi formula for the true ground state remains open. As a special case, we obtain the algorithmic threshold for all single-species spherical spin glasses, which was previously known only for even models. We also obtain closed-form formulas for pure models which coincide with the $E_{\infty}$ value previously determined by the Kac-Rice formula. △ Less

Submitted 13 September, 2023; v1 submitted 21 March, 2023; originally announced March 2023.

Comments: updated references

arXiv:2206.05265 [pdf, ps, other]

When Does Adaptivity Help for Quantum State Learning?

Authors: Sitan Chen, Brice Huang, Jerry Li, Allen Liu, Mark Sellke

Abstract: We consider the classic question of state tomography: given copies of an unknown quantum state $ρ\in\mathbb{C}^{d\times d}$, output $\widehatρ$ which is close to $ρ$ in some sense, e.g. trace distance or fidelity. When one is allowed to make coherent measurements entangled across all copies, $Θ(d^2/ε^2)$ copies are necessary and sufficient to get trace distance $ε$. Unfortunately, the protocols ac… ▽ More We consider the classic question of state tomography: given copies of an unknown quantum state $ρ\in\mathbb{C}^{d\times d}$, output $\widehatρ$ which is close to $ρ$ in some sense, e.g. trace distance or fidelity. When one is allowed to make coherent measurements entangled across all copies, $Θ(d^2/ε^2)$ copies are necessary and sufficient to get trace distance $ε$. Unfortunately, the protocols achieving this rate incur large quantum memory overheads that preclude implementation on near-term devices. On the other hand, the best known protocol using incoherent (single-copy) measurements uses $O(d^3/ε^2)$ copies, and multiple papers have posed it as an open question to understand whether or not this rate is tight. In this work, we fully resolve this question, by showing that any protocol using incoherent measurements, even if they are chosen adaptively, requires $Ω(d^3/ε^2)$ copies, matching the best known upper bound. We do so by a new proof technique which directly bounds the ``tilt'' of the posterior distribution after measurements, which yields a surprisingly short proof of our lower bound, and which we believe may be of independent interest. While this implies that adaptivity does not help for tomography with respect to trace distance, we show that it actually does help for tomography with respect to infidelity. We give an adaptive algorithm that outputs a state which is $γ$-close in infidelity to $ρ$ using only $\tilde{O}(d^3/γ)$ copies, which is optimal for incoherent measurements. In contrast, it is known that any nonadaptive algorithm requires $Ω(d^3/γ^2)$ copies. While it is folklore that in $2$ dimensions, one can achieve a scaling of $O(1/γ)$, to the best of our knowledge, our algorithm is the first to achieve the optimal rate in all dimensions. △ Less

Submitted 30 May, 2023; v1 submitted 10 June, 2022; originally announced June 2022.

Comments: 22 pages

arXiv:2203.05093 [pdf, ps, other]

Sampling from the Sherrington-Kirkpatrick Gibbs measure via algorithmic stochastic localization

Authors: Ahmed El Alaoui, Andrea Montanari, Mark Sellke

Abstract: We consider the Sherrington-Kirkpatrick model of spin glasses at high-temperature and no external field, and study the problem of sampling from the Gibbs distribution $μ$ in polynomial time. We prove that, for any inverse temperature $β<1/2$, there exists an algorithm with complexity $O(n^2)$ that samples from a distribution $μ^{alg}$ which is close in normalized Wasserstein distance to $μ$. Namel… ▽ More We consider the Sherrington-Kirkpatrick model of spin glasses at high-temperature and no external field, and study the problem of sampling from the Gibbs distribution $μ$ in polynomial time. We prove that, for any inverse temperature $β<1/2$, there exists an algorithm with complexity $O(n^2)$ that samples from a distribution $μ^{alg}$ which is close in normalized Wasserstein distance to $μ$. Namely, there exists a coupling of $μ$ and $μ^{alg}$ such that if $(x,x^{alg})\in\{-1,+1\}^n\times \{-1,+1\}^n$ is a pair drawn from this coupling, then $n^{-1}\mathbb E\{||x-x^{alg}||_2^2\}=o_n(1)$. The best previous results, by Bauerschmidt and Bodineau and by Eldan, Koehler, and Zeitouni, implied efficient algorithms to approximately sample (under a stronger metric) for $β<1/4$. We complement this result with a negative one, by introducing a suitable "stability" property for sampling algorithms, which is verified by many standard techniques. We prove that no stable algorithm can approximately sample for $β>1$, even under the normalized Wasserstein metric. Our sampling method is based on an algorithmic implementation of stochastic localization, which progressively tilts the measure $μ$ towards a single configuration, together with an approximate message passing algorithm that is used to approximate the mean of the tilted measure. △ Less

Submitted 15 February, 2024; v1 submitted 9 March, 2022; originally announced March 2022.

arXiv:2202.09653 [pdf, other]

The Pareto Frontier of Instance-Dependent Guarantees in Multi-Player Multi-Armed Bandits with no Communication

Authors: Allen Liu, Mark Sellke

Abstract: We study the stochastic multi-player multi-armed bandit problem. In this problem, $m$ players cooperate to maximize their total reward from $K > m$ arms. However the players cannot communicate and are penalized (e.g. receive no reward) if they pull the same arm at the same time. We ask whether it is possible to obtain optimal instance-dependent regret $\tilde{O}(1/Δ)$ where $Δ$ is the gap between… ▽ More We study the stochastic multi-player multi-armed bandit problem. In this problem, $m$ players cooperate to maximize their total reward from $K > m$ arms. However the players cannot communicate and are penalized (e.g. receive no reward) if they pull the same arm at the same time. We ask whether it is possible to obtain optimal instance-dependent regret $\tilde{O}(1/Δ)$ where $Δ$ is the gap between the $m$-th and $m+1$-st best arms. Such guarantees were recently achieved in a model allowing the players to implicitly communicate through intentional collisions. Surprisingly, we show that with no communication at all, such guarantees are not achievable. In fact, obtaining the optimal $\tilde{O}(1/Δ)$ regret for some values of $Δ$ necessarily implies strictly sub-optimal regret in other regimes. Our main result is a complete characterization of the Pareto optimal instance-dependent trade-offs that are possible with no communication. Our algorithm generalizes that of Bubeck, Budzinski, and the second author. As there, our algorithm succeeds even when feedback upon collision can be corrupted by an adaptive adversary, thanks to a strong no-collision property. Our lower bound is based on topological obstructions at multiple scales and is completely new. △ Less

Submitted 6 June, 2022; v1 submitted 19 February, 2022; originally announced February 2022.

Comments: Accepted for presentation at Conference on Learning Theory (COLT) 2022

arXiv:2111.06813 [pdf, ps, other]

Local algorithms for Maximum Cut and Minimum Bisection on locally treelike regular graphs of large degree

Authors: Ahmed El Alaoui, Andrea Montanari, Mark Sellke

Abstract: Given a graph $G$ of degree $k$ over $n$ vertices, we consider the problem of computing a near maximum cut or a near minimum bisection in polynomial time. For graphs of girth $2L$, we develop a local message passing algorithm whose complexity is $O(nkL)$, and that achieves near optimal cut values among all $L$-local algorithms. Focusing on max-cut, the algorithm constructs a cut of value… ▽ More Given a graph $G$ of degree $k$ over $n$ vertices, we consider the problem of computing a near maximum cut or a near minimum bisection in polynomial time. For graphs of girth $2L$, we develop a local message passing algorithm whose complexity is $O(nkL)$, and that achieves near optimal cut values among all $L$-local algorithms. Focusing on max-cut, the algorithm constructs a cut of value $nk/4+ n\mathsf{P}_\star\sqrt{k/4}+\mathsf{err}(n,k,L)$, where $\mathsf{P}_\star\approx 0.763166$ is the value of the Parisi formula from spin glass theory, and $\mathsf{err}(n,k,L)=o_n(n)+no_k(\sqrt{k})+n \sqrt{k} o_L(1)$ (subscripts indicate the asymptotic variables). Our result generalizes to locally treelike graphs, i.e., graphs whose girth becomes $2L$ after removing a small fraction of vertices. Earlier work established that, for random $k$-regular graphs, the typical max-cut value is $nk/4+ n\mathsf{P}_\star\sqrt{k/4}+o_n(n)+no_k(\sqrt{k})$. Therefore our algorithm is nearly optimal on such graphs. An immediate corollary of this result is that random regular graphs have nearly minimum max-cut, and nearly maximum min-bisection among all regular locally treelike graphs. This can be viewed as a combinatorial version of the near-Ramanujan property of random regular graphs. △ Less

Submitted 3 February, 2023; v1 submitted 12 November, 2021; originally announced November 2021.

Comments: Improved presentation. To appear in Random Structures and Algorithms

arXiv:2110.07847 [pdf, other]

Tight Lipschitz Hardness for Optimizing Mean Field Spin Glasses

Authors: Brice Huang, Mark Sellke

Abstract: We study the problem of algorithmically optimizing the Hamiltonian $H_N$ of a spherical or Ising mixed $p$-spin glass. The maximum asymptotic value $\mathsf{OPT}$ of $H_N/N$ is characterized by a variational principle known as the Parisi formula, proved first by Talagrand and in more generality by Panchenko. Recently developed approximate message passing algorithms efficiently optimize $H_N/N$ up… ▽ More We study the problem of algorithmically optimizing the Hamiltonian $H_N$ of a spherical or Ising mixed $p$-spin glass. The maximum asymptotic value $\mathsf{OPT}$ of $H_N/N$ is characterized by a variational principle known as the Parisi formula, proved first by Talagrand and in more generality by Panchenko. Recently developed approximate message passing algorithms efficiently optimize $H_N/N$ up to a value $\mathsf{ALG}$ given by an extended Parisi formula, which minimizes over a larger space of functional order parameters. These two objectives are equal for spin glasses exhibiting a no overlap gap property. However, $\mathsf{ALG} < \mathsf{OPT}$ can also occur, and no efficient algorithm producing an objective value exceeding $\mathsf{ALG}$ is known. We prove that for mixed even $p$-spin models, no algorithm satisfying an overlap concentration property can produce an objective larger than $\mathsf{ALG}$ with non-negligible probability. This property holds for all algorithms with suitably Lipschitz dependence on the disorder coefficients of $H_N$. It encompasses natural formulations of gradient descent, approximate message passing, and Langevin dynamics run for bounded time and in particular includes the algorithms achieving $\mathsf{ALG}$ mentioned above. To prove this result, we substantially generalize the overlap gap property framework introduced by Gamarnik and Sudan to arbitrary ultrametric forbidden structures of solutions. △ Less

Submitted 11 September, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

Comments: 84 pages, 2 figures, updated introduction

arXiv:2106.09913 [pdf, other]

Iterative Feature Matching: Toward Provable Domain Generalization with Logarithmic Environments

Authors: Yining Chen, Elan Rosenfeld, Mark Sellke, Tengyu Ma, Andrej Risteski

Abstract: Domain generalization aims at performing well on unseen test environments with data from a limited number of training environments. Despite a proliferation of proposal algorithms for this task, assessing their performance both theoretically and empirically is still very challenging. Distributional matching algorithms such as (Conditional) Domain Adversarial Networks [Ganin et al., 2016, Long et al… ▽ More Domain generalization aims at performing well on unseen test environments with data from a limited number of training environments. Despite a proliferation of proposal algorithms for this task, assessing their performance both theoretically and empirically is still very challenging. Distributional matching algorithms such as (Conditional) Domain Adversarial Networks [Ganin et al., 2016, Long et al., 2018] are popular and enjoy empirical success, but they lack formal guarantees. Other approaches such as Invariant Risk Minimization (IRM) require a prohibitively large number of training environments -- linear in the dimension of the spurious feature space $d_s$ -- even on simple data models like the one proposed by [Rosenfeld et al., 2021]. Under a variant of this model, we show that both ERM and IRM cannot generalize with $o(d_s)$ environments. We then present an iterative feature matching algorithm that is guaranteed with high probability to yield a predictor that generalizes after seeing only $O(\log d_s)$ environments. Our results provide the first theoretical justification for a family of distribution-matching algorithms widely used in practice under a concrete nontrivial data model. △ Less

Submitted 22 November, 2021; v1 submitted 18 June, 2021; originally announced June 2021.

Comments: We acknowledge that the previous version of this paper (v1) contained an error - Theorem 3.2 was incorrect. We removed this theorem and updated the rest of the paper in v2

arXiv:2105.12806 [pdf, ps, other]

A Universal Law of Robustness via Isoperimetry

Authors: Sébastien Bubeck, Mark Sellke

Abstract: Classically, data interpolation with a parametrized model class is possible as long as the number of parameters is larger than the number of equations to be satisfied. A puzzling phenomenon in deep learning is that models are trained with many more parameters than what this classical theory would suggest. We propose a partial theoretical explanation for this phenomenon. We prove that for a broad c… ▽ More Classically, data interpolation with a parametrized model class is possible as long as the number of parameters is larger than the number of equations to be satisfied. A puzzling phenomenon in deep learning is that models are trained with many more parameters than what this classical theory would suggest. We propose a partial theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparametrization is necessary if one wants to interpolate the data smoothly. Namely we show that smooth interpolation requires $d$ times more parameters than mere interpolation, where $d$ is the ambient data dimension. We prove this universal law of robustness for any smoothly parametrized function class with polynomial size weights, and any covariate distribution verifying isoperimetry. In the case of two-layers neural networks and Gaussian covariates, this law was conjectured in prior work by Bubeck, Li and Nagaraj. We also give an interpretation of our result as an improved generalization bound for model classes consisting of smooth functions. △ Less

Submitted 23 December, 2022; v1 submitted 26 May, 2021; originally announced May 2021.

arXiv:2011.11503 [pdf, ps, other]

Metric Transforms and Low Rank Matrices via Representation Theory of the Real Hyperrectangle

Authors: Josh Alman, Timothy Chu, Gary Miller, Shyam Narayanan, Mark Sellke, Zhao Song

Abstract: In this paper, we develop a new technique which we call representation theory of the real hyperrectangle, which describes how to compute the eigenvectors and eigenvalues of certain matrices arising from hyperrectangles. We show that these matrices arise naturally when analyzing a number of different algorithmic tasks such as kernel methods, neural network training, natural language processing, and… ▽ More In this paper, we develop a new technique which we call representation theory of the real hyperrectangle, which describes how to compute the eigenvectors and eigenvalues of certain matrices arising from hyperrectangles. We show that these matrices arise naturally when analyzing a number of different algorithmic tasks such as kernel methods, neural network training, natural language processing, and the design of algorithms using the polynomial method. We then use our new technique along with these connections to prove several new structural results in these areas, including: $\bullet$ A function is a positive definite Manhattan kernel if and only if it is a completely monotone function. These kernels are widely used across machine learning; one example is the Laplace kernel which is widely used in machine learning for chemistry. $\bullet$ A function transforms Manhattan distances to Manhattan distances if and only if it is a Bernstein function. This completes the theory of Manhattan to Manhattan metric transforms initiated by Assouad in 1980. $\bullet$ A function applied entry-wise to any square matrix of rank $r$ always results in a matrix of rank $< 2^{r-1}$ if and only if it is a polynomial of sufficiently low degree. This gives a converse to a key lemma used by the polynomial method in algorithm design. Our work includes a sophisticated combination of techniques from different fields, including metric embeddings, the polynomial method, and group representation theory. △ Less

Submitted 4 August, 2021; v1 submitted 23 November, 2020; originally announced November 2020.

arXiv:2011.03896 [pdf, other]

Cooperative and Stochastic Multi-Player Multi-Armed Bandit: Optimal Regret With Neither Communication Nor Collisions

Authors: Sébastien Bubeck, Thomas Budzinski, Mark Sellke

Abstract: We consider the cooperative multi-player version of the stochastic multi-armed bandit problem. We study the regime where the players cannot communicate but have access to shared randomness. In prior work by the first two authors, a strategy for this regime was constructed for two players and three arms, with regret $\tilde{O}(\sqrt{T})$, and with no collisions at all between the players (with very… ▽ More We consider the cooperative multi-player version of the stochastic multi-armed bandit problem. We study the regime where the players cannot communicate but have access to shared randomness. In prior work by the first two authors, a strategy for this regime was constructed for two players and three arms, with regret $\tilde{O}(\sqrt{T})$, and with no collisions at all between the players (with very high probability). In this paper we show that these properties (near-optimal regret and no collisions at all) are achievable for any number of players and arms. At a high level, the previous strategy heavily relied on a $2$-dimensional geometric intuition that was difficult to generalize in higher dimensions, while here we take a more combinatorial route to build the new strategy. △ Less

Submitted 7 November, 2020; originally announced November 2020.

arXiv:2010.15811 [pdf, ps, other]

Algorithmic pure states for the negative spherical perceptron

Authors: Ahmed El Alaoui, Mark Sellke

Abstract: We consider the spherical perceptron with Gaussian disorder. This is the set $S$ of points $σ\in \mathbb{R}^N$ on the sphere of radius $\sqrt{N}$ satisfying $\langle g_a , σ\rangle \ge κ\sqrt{N}\,$ for all $1 \le a \le M$, where $(g_a)_{a=1}^M$ are independent standard gaussian vectors and $κ\in \mathbb{R}$ is fixed. Various characteristics of $S$ such as its surface measure and the largest $M$ fo… ▽ More We consider the spherical perceptron with Gaussian disorder. This is the set $S$ of points $σ\in \mathbb{R}^N$ on the sphere of radius $\sqrt{N}$ satisfying $\langle g_a , σ\rangle \ge κ\sqrt{N}\,$ for all $1 \le a \le M$, where $(g_a)_{a=1}^M$ are independent standard gaussian vectors and $κ\in \mathbb{R}$ is fixed. Various characteristics of $S$ such as its surface measure and the largest $M$ for which it is non-empty, were computed heuristically in statistical physics in the asymptotic regime $N \to \infty$, $M/N \to α$. The case $κ<0$ is of special interest as $S$ is conjectured to exhibit a hierarchical tree-like geometry known as "full replica-symmetry breaking" (FRSB) close to the satisfiability threshold $α_{\text{SAT}}(κ)$, and whose characteristics are captured by a Parisi variational principle akin to the one appearing in the Sherrington-Kirkpatrick model. In this paper we design an efficient algorithm which, given oracle access to the solution of the Parisi variational principle, exploits this conjectured FRSB structure for $κ<0$ and outputs a vector $\hatσ$ satisfying $\langle g_a , \hatσ\rangle \ge κ\sqrt{N}$ for all $1\le a \le M$ and lying on a sphere of non-trivial radius $\sqrt{\bar{q} N}$, where $\bar{q} \in (0,1)$ is the right-end of the support of the associated Parisi measure. We expect $\hatσ$ to be approximately the barycenter of a pure state of the spherical perceptron. Moreover we expect that $\bar{q} \to 1$ as $α\to α_{\text{SAT}}(κ)$, so that $\big\langle g_a,\hatσ/|\hatσ|\big\rangle \geq (κ-o(1))\sqrt{N}$ near criticality. △ Less

Submitted 29 October, 2020; originally announced October 2020.

Comments: 34 pages

arXiv:2009.08266 [pdf, other]

Metrical Service Systems with Transformations

Authors: Sébastien Bubeck, Niv Buchbinder, Christian Coester, Mark Sellke

Abstract: We consider a generalization of the fundamental online metrical service systems (MSS) problem where the feasible region can be transformed between requests. In this problem, which we call T-MSS, an algorithm maintains a point in a metric space and has to serve a sequence of requests. Each request is a map (transformation) $f_t\colon A_t\to B_t$ between subsets $A_t$ and $B_t$ of the metric space.… ▽ More We consider a generalization of the fundamental online metrical service systems (MSS) problem where the feasible region can be transformed between requests. In this problem, which we call T-MSS, an algorithm maintains a point in a metric space and has to serve a sequence of requests. Each request is a map (transformation) $f_t\colon A_t\to B_t$ between subsets $A_t$ and $B_t$ of the metric space. To serve it, the algorithm has to go to a point $a_t\in A_t$, paying the distance from its previous position. Then, the transformation is applied, modifying the algorithm's state to $f_t(a_t)$. Such transformations can model, e.g., changes to the environment that are outside of an algorithm's control, and we therefore do not charge any additional cost to the algorithm when the transformation is applied. The transformations also allow to model requests occurring in the $k$-taxi problem. We show that for $α$-Lipschitz transformations, the competitive ratio is $Θ(α)^{n-2}$ on $n$-point metrics. Here, the upper bound is achieved by a deterministic algorithm and the lower bound holds even for randomized algorithms. For the $k$-taxi problem, we prove a competitive ratio of $\tilde O((n\log k)^2)$. For chasing convex bodies, we show that even with contracting transformations no competitive algorithm exists. The problem T-MSS has a striking connection to the following deep mathematical question: Given a finite metric space $M$, what is the required cardinality of an extension $\hat M\supseteq M$ where each partial isometry on $M$ extends to an automorphism? We give partial answers for special cases. △ Less

Submitted 17 September, 2020; originally announced September 2020.

arXiv:2007.07862 [pdf, ps, other]

Vertex Sparsification for Edge Connectivity

Authors: Parinya Chalermsook, Syamantak Das, Bundit Laekhanukit, Yunbum Kook, Yang P. Liu, Richard Peng, Mark Sellke, Daniel Vaz

Abstract: Graph compression or sparsification is a basic information-theoretic and computational question. A major open problem in this research area is whether $(1+ε)$-approximate cut-preserving vertex sparsifiers with size close to the number of terminals exist. As a step towards this goal, we study a thresholded version of the problem: for a given parameter $c$, find a smaller graph, which we call connec… ▽ More Graph compression or sparsification is a basic information-theoretic and computational question. A major open problem in this research area is whether $(1+ε)$-approximate cut-preserving vertex sparsifiers with size close to the number of terminals exist. As a step towards this goal, we study a thresholded version of the problem: for a given parameter $c$, find a smaller graph, which we call connectivity-$c$ mimicking network, which preserves connectivity among $k$ terminals exactly up to the value of $c$. We show that connectivity-$c$ mimicking networks with $O(kc^4)$ edges exist and can be found in time $m(c\log n)^{O(c)}$. We also give a separate algorithm that constructs such graphs with $k \cdot O(c)^{2c}$ edges in time $mc^{O(c)}\log^{O(1)}n$. These results lead to the first data structures for answering fully dynamic offline $c$-edge-connectivity queries for $c \ge 4$ in polylogarithmic time per query, as well as more efficient algorithms for survivable network design on bounded treewidth graphs. △ Less

Submitted 15 July, 2020; originally announced July 2020.

Comments: Merged version of arXiv:1910.10359 and arXiv:1910.10665 with improved bounds, 55 pages

arXiv:2004.07346 [pdf, other]

Online Multiserver Convex Chasing and Optimization

Authors: Sébastien Bubeck, Yuval Rabani, Mark Sellke

Abstract: We introduce the problem of $k$-chasing of convex functions, a simultaneous generalization of both the famous k-server problem in $R^d$, and of the problem of chasing convex bodies and functions. Aside from fundamental interest in this general form, it has natural applications to online $k$-clustering problems with objectives such as $k$-median or $k$-means. We show that this problem exhibits a ri… ▽ More We introduce the problem of $k$-chasing of convex functions, a simultaneous generalization of both the famous k-server problem in $R^d$, and of the problem of chasing convex bodies and functions. Aside from fundamental interest in this general form, it has natural applications to online $k$-clustering problems with objectives such as $k$-median or $k$-means. We show that this problem exhibits a rich landscape of behavior. In general, if both $k > 1$ and $d > 1$ there does not exist any online algorithm with bounded competitiveness. By contrast, we exhibit a class of nicely behaved functions (which include in particular the above-mentioned clustering problems), for which we show that competitive online algorithms exist, and moreover with dimension-free competitive ratio. We also introduce a parallel question of top-$k$ action regret minimization in the realm of online convex optimization. There, too, a much rougher landscape emerges for $k > 1$. While it is possible to achieve vanishing regret, unlike the top-one action case the rate of vanishing does not speed up for strongly convex functions. Moreover, vanishing regret necessitates both intractable computations and randomness. Finally we leave open whether almost dimension-free regret is achievable for $k > 1$ and general convex losses. As evidence that it might be possible, we prove dimension-free regret for linear losses via an information-theoretic argument. △ Less

Submitted 15 April, 2020; originally announced April 2020.

arXiv:2002.00558 [pdf, ps, other]

The Price of Incentivizing Exploration: A Characterization via Thompson Sampling and Sample Complexity

Authors: Mark Sellke, Aleksandrs Slivkins

Abstract: We consider incentivized exploration: a version of multi-armed bandits where the choice of arms is controlled by self-interested agents, and the algorithm can only issue recommendations. The algorithm controls the flow of information, and the information asymmetry can incentivize the agents to explore. Prior work achieves optimal regret rates up to multiplicative factors that become arbitrarily la… ▽ More We consider incentivized exploration: a version of multi-armed bandits where the choice of arms is controlled by self-interested agents, and the algorithm can only issue recommendations. The algorithm controls the flow of information, and the information asymmetry can incentivize the agents to explore. Prior work achieves optimal regret rates up to multiplicative factors that become arbitrarily large depending on the Bayesian priors, and scale exponentially in the number of arms. A more basic problem of sampling each arm once runs into similar factors. We focus on the price of incentives: the loss in performance, broadly construed, incurred for the sake of incentive-compatibility. We prove that Thompson Sampling, a standard bandit algorithm, is incentive-compatible if initialized with sufficiently many data points. The performance loss due to incentives is therefore limited to the initial rounds when these data points are collected. The problem is largely reduced to that of sample complexity: how many rounds are needed? We address this question, providing matching upper and lower bounds and instantiating them in various corollaries. Typically, the optimal sample complexity is polynomial in the number of arms and exponential in the "strength of beliefs". △ Less

Submitted 12 June, 2022; v1 submitted 2 February, 2020; originally announced February 2020.

arXiv:1910.10359 [pdf, ps, other]

Vertex Sparsifiers for c-Edge Connectivity

Authors: Yang P. Liu, Richard Peng, Mark Sellke

Abstract: We show the existence of O(f(c)k) sized vertex sparsifiers that preserve all edge-connectivity values up to c between a set of k terminal vertices, where f(c) is a function that only depends on c, the edge-connectivity value. This construction is algorithmic: we also provide an algorithm whose running time depends linearly on k, but exponentially in c. It implies that for constant values of c, an… ▽ More We show the existence of O(f(c)k) sized vertex sparsifiers that preserve all edge-connectivity values up to c between a set of k terminal vertices, where f(c) is a function that only depends on c, the edge-connectivity value. This construction is algorithmic: we also provide an algorithm whose running time depends linearly on k, but exponentially in c. It implies that for constant values of c, an offline sequence of edge insertions/deletions and c-edge-connectivity queries can be answered in polylog time per operation. These results are obtained by combining structural results about minimum terminal separating cuts in undirected graphs with recent developments in expander decomposition based methods for finding small vertex/edge cuts in graphs. △ Less

Submitted 23 October, 2019; originally announced October 2019.

arXiv:1905.11968 [pdf, ps, other]

Chasing Convex Bodies Optimally

Authors: Mark Sellke

Abstract: In the chasing convex bodies problem, an online player receives a request sequence of $N$ convex sets $K_1,\dots, K_N$ contained in a normed space $\mathbb R^d$. The player starts at $x_0\in \mathbb R^d$, and after observing each $K_n$ picks a new point $x_n\in K_n$. At each step the player pays a movement cost of $||x_n-x_{n-1}||$. The player aims to maintain a constant competitive ratio against… ▽ More In the chasing convex bodies problem, an online player receives a request sequence of $N$ convex sets $K_1,\dots, K_N$ contained in a normed space $\mathbb R^d$. The player starts at $x_0\in \mathbb R^d$, and after observing each $K_n$ picks a new point $x_n\in K_n$. At each step the player pays a movement cost of $||x_n-x_{n-1}||$. The player aims to maintain a constant competitive ratio against the minimum cost possible in hindsight, i.e. knowing all requests in advance. The existence of a finite competitive ratio for convex body chasing was first conjectured in 1991 by Friedman and Linial. This conjecture was recently resolved with an exponential $2^{O(d)}$ upper bound on the competitive ratio. We give an improved algorithm achieving competitive ratio $d$ in any normed space, which is exactly tight for $\ell^{\infty}$. In Euclidean space, our algorithm also achieves competitive ratio $O(\sqrt{d\log N})$, nearly matching a $\sqrt{d}$ lower bound when $N$ is subexponential in $d$. The approach extends our prior work for nested convex bodies, which is based on the classical Steiner point of a convex body. We define the functional Steiner point of a convex function and apply it to the associated work function. △ Less

Submitted 23 November, 2021; v1 submitted 28 May, 2019; originally announced May 2019.

arXiv:1904.12233 [pdf, ps, other]

Non-Stochastic Multi-Player Multi-Armed Bandits: Optimal Rate With Collision Information, Sublinear Without

Authors: Sébastien Bubeck, Yuanzhi Li, Yuval Peres, Mark Sellke

Abstract: We consider the non-stochastic version of the (cooperative) multi-player multi-armed bandit problem. The model assumes no communication at all between the players, and furthermore when two (or more) players select the same action this results in a maximal loss. We prove the first $\sqrt{T}$-type regret guarantee for this problem, under the feedback model where collisions are announced to the colli… ▽ More We consider the non-stochastic version of the (cooperative) multi-player multi-armed bandit problem. The model assumes no communication at all between the players, and furthermore when two (or more) players select the same action this results in a maximal loss. We prove the first $\sqrt{T}$-type regret guarantee for this problem, under the feedback model where collisions are announced to the colliding players. Such a bound was not known even for the simpler stochastic version. We also prove the first sublinear guarantee for the feedback model where collision information is not available, namely $T^{1-\frac{1}{2m}}$ where $m$ is the number of players. △ Less

Submitted 1 May, 2019; v1 submitted 27 April, 2019; originally announced April 2019.

Comments: 27 pages, v2 adds a pseudorandom generator construction to remove the shared randomness assumption in the $\sqrt{T}$-regret result (Section 3.9)

arXiv:1902.00681 [pdf, ps, other]

First-Order Bayesian Regret Analysis of Thompson Sampling

Authors: Sébastien Bubeck, Mark Sellke

Abstract: We address online combinatorial optimization when the player has a prior over the adversary's sequence of losses. In this framework, Russo and Van Roy proposed an information-theoretic analysis of Thompson Sampling based on the information ratio, resulting in optimal worst-case regret bounds. In this paper we introduce three novel ideas to this line of work. First we propose a new quantity, the sc… ▽ More We address online combinatorial optimization when the player has a prior over the adversary's sequence of losses. In this framework, Russo and Van Roy proposed an information-theoretic analysis of Thompson Sampling based on the information ratio, resulting in optimal worst-case regret bounds. In this paper we introduce three novel ideas to this line of work. First we propose a new quantity, the scale-sensitive information ratio, which allows us to obtain more refined first-order regret bounds (i.e., bounds of the form $\sqrt{L^*}$ where $L^*$ is the loss of the best combinatorial action). Second we replace the entropy over combinatorial actions by a coordinate entropy, which allows us to obtain the first optimal worst-case bound for Thompson Sampling in the combinatorial setting. Finally, we introduce a novel link between Bayesian agents and frequentist confidence intervals. Combining these ideas we show that the classical multi-armed bandit first-order regret bound $\tilde{O}(\sqrt{d L^*})$ still holds true in the more challenging and more general semi-bandit scenario. This latter result improves the previous state of the art bound $\tilde{O}(\sqrt{(d+m^3)L^*})$ by Lykouris, Sridharan and Tardos. Moreover we sharpen these results with two technical ingredients. The first leverages a recent insight of Zimmert and Lattimore to replace Shannon entropy with more refined potential functions in the analysis. The second is a \emph{Thresholded} Thompson sampling algorithm, which slightly modifies the original algorithm by never playing low-probability actions. This thresholding results in fully $T$-independent regret bounds when $L^*$ is almost surely upper-bounded, which we show does not hold for ordinary Thompson sampling. △ Less

Submitted 3 April, 2022; v1 submitted 2 February, 2019; originally announced February 2019.

Comments: 58 pages

arXiv:1811.00999 [pdf, ps, other]

Chasing Nested Convex Bodies Nearly Optimally

Authors: Sébastien Bubeck, Bo'az Klartag, Yin Tat Lee, Yuanzhi Li, Mark Sellke

Abstract: The convex body chasing problem, introduced by Friedman and Linial, is a competitive analysis problem on any normed vector space. In convex body chasing, for each timestep $t\in\mathbb N$, a convex body $K_t\subseteq \mathbb R^d$ is given as a request, and the player picks a point $x_t\in K_t$. The player aims to ensure that the total distance $\sum_{t=0}^{T-1}||x_t-x_{t+1}||$ is within a bounded… ▽ More The convex body chasing problem, introduced by Friedman and Linial, is a competitive analysis problem on any normed vector space. In convex body chasing, for each timestep $t\in\mathbb N$, a convex body $K_t\subseteq \mathbb R^d$ is given as a request, and the player picks a point $x_t\in K_t$. The player aims to ensure that the total distance $\sum_{t=0}^{T-1}||x_t-x_{t+1}||$ is within a bounded ratio of the smallest possible offline solution. In this work, we consider the nested version of the problem, in which the sequence $(K_t)$ must be decreasing. For Euclidean spaces, we consider a memoryless algorithm which moves to the so-called Steiner point, and show that in a certain sense it is exactly optimal among memoryless algorithms. For general finite dimensional normed spaces, we combine the Steiner point and our recent previous algorithm to obtain a new algorithm which is nearly optimal for all $\ell^p_d$ spaces with $p\geq 1$, closing a polynomial gap. △ Less

Submitted 12 August, 2021; v1 submitted 2 November, 2018; originally announced November 2018.

arXiv:1811.00887 [pdf, ps, other]

Competitively Chasing Convex Bodies

Authors: Sébastien Bubeck, Yin Tat Lee, Yuanzhi Li, Mark Sellke

Abstract: Let $\mathcal{F}$ be a family of sets in some metric space. In the $\mathcal{F}$-chasing problem, an online algorithm observes a request sequence of sets in $\mathcal{F}$ and responds (online) by giving a sequence of points in these sets. The movement cost is the distance between consecutive such points. The competitive ratio is the worst case ratio (over request sequences) between the total movem… ▽ More Let $\mathcal{F}$ be a family of sets in some metric space. In the $\mathcal{F}$-chasing problem, an online algorithm observes a request sequence of sets in $\mathcal{F}$ and responds (online) by giving a sequence of points in these sets. The movement cost is the distance between consecutive such points. The competitive ratio is the worst case ratio (over request sequences) between the total movement of the online algorithm and the smallest movement one could have achieved by knowing in advance the request sequence. The family $\mathcal{F}$ is said to be chaseable if there exists an online algorithm with finite competitive ratio. In 1991, Linial and Friedman conjectured that the family of convex sets in Euclidean space is chaseable. We prove this conjecture. △ Less

Submitted 2 November, 2018; originally announced November 2018.

Comments: 14 pages

arXiv:1710.11278 [pdf, other]

Approximating Continuous Functions by ReLU Nets of Minimal Width

Authors: Boris Hanin, Mark Sellke

Abstract: This article concerns the expressive power of depth in deep feed-forward neural nets with ReLU activations. Specifically, we answer the following question: for a fixed $d_{in}\geq 1,$ what is the minimal width $w$ so that neural nets with ReLU activations, input dimension $d_{in}$, hidden layer widths at most $w,$ and arbitrary depth can approximate any continuous, real-valued function of… ▽ More This article concerns the expressive power of depth in deep feed-forward neural nets with ReLU activations. Specifically, we answer the following question: for a fixed $d_{in}\geq 1,$ what is the minimal width $w$ so that neural nets with ReLU activations, input dimension $d_{in}$, hidden layer widths at most $w,$ and arbitrary depth can approximate any continuous, real-valued function of $d_{in}$ variables arbitrarily well? It turns out that this minimal width is exactly equal to $d_{in}+1.$ That is, if all the hidden layer widths are bounded by $d_{in}$, then even in the infinite depth limit, ReLU nets can only express a very limited class of functions, and, on the other hand, any continuous function on the $d_{in}$-dimensional unit cube can be approximated to arbitrary precision by ReLU nets in which all hidden layers have width exactly $d_{in}+1.$ Our construction in fact shows that any continuous function $f:[0,1]^{d_{in}}\to\mathbb R^{d_{out}}$ can be approximated by a net of width $d_{in}+d_{out}$. We obtain quantitative depth estimates for such an approximation in terms of the modulus of continuity of $f$. △ Less

Submitted 10 March, 2018; v1 submitted 30 October, 2017; originally announced October 2017.

Comments: v2. 13p. Extended main result to higher dimensional output. Comments welcome

Showing 1–27 of 27 results for author: Sellke, M