Search | arXiv e-print repository

Depth Separations in Neural Networks: Separating the Dimension from the Accuracy

Authors: Itay Safran, Daniel Reichman, Paul Valiant

Abstract: We prove an exponential separation between depth 2 and depth 3 neural networks, when approximating an $\mathcal{O}(1)$-Lipschitz target function to constant accuracy, with respect to a distribution with support in $[0,1]^{d}$, assuming exponentially bounded weights. This addresses an open problem posed in \citet{safran2019depth}, and proves that the curse of dimensionality manifests in depth 2 app… ▽ More We prove an exponential separation between depth 2 and depth 3 neural networks, when approximating an $\mathcal{O}(1)$-Lipschitz target function to constant accuracy, with respect to a distribution with support in $[0,1]^{d}$, assuming exponentially bounded weights. This addresses an open problem posed in \citet{safran2019depth}, and proves that the curse of dimensionality manifests in depth 2 approximation, even in cases where the target function can be represented efficiently using depth 3. Previously, lower bounds that were used to separate depth 2 from depth 3 required that at least one of the Lipschitz parameter, target accuracy or (some measure of) the size of the domain of approximation scale polynomially with the input dimension, whereas we fix the former two and restrict our domain to the unit hypercube. Our lower bound holds for a wide variety of activation functions, and is based on a novel application of an average- to worst-case random self-reducibility argument, to reduce the problem to threshold circuits lower bounds. △ Less

Submitted 11 February, 2024; originally announced February 2024.

arXiv:2311.12784 [pdf, ps, other]

Optimality in Mean Estimation: Beyond Worst-Case, Beyond Sub-Gaussian, and Beyond $1+α$ Moments

Authors: Trung Dang, Jasper C. H. Lee, Maoyuan Song, Paul Valiant

Abstract: There is growing interest in improving our algorithmic understanding of fundamental statistical problems such as mean estimation, driven by the goal of understanding the limits of what we can extract from valuable data. The state of the art results for mean estimation in $\mathbb{R}$ are 1) the optimal sub-Gaussian mean estimator by [LV22], with the tight sub-Gaussian constant for all distribution… ▽ More There is growing interest in improving our algorithmic understanding of fundamental statistical problems such as mean estimation, driven by the goal of understanding the limits of what we can extract from valuable data. The state of the art results for mean estimation in $\mathbb{R}$ are 1) the optimal sub-Gaussian mean estimator by [LV22], with the tight sub-Gaussian constant for all distributions with finite but unknown variance, and 2) the analysis of the median-of-means algorithm by [BCL13] and a lower bound by [DLLO16], characterizing the big-O optimal errors for distributions for which only a $1+α$ moment exists for $α\in (0,1)$. Both results, however, are optimal only in the worst case. We initiate the fine-grained study of the mean estimation problem: Can algorithms leverage useful features of the input distribution to beat the sub-Gaussian rate, without explicit knowledge of such features? We resolve this question with an unexpectedly nuanced answer: "Yes in limited regimes, but in general no". For any distribution $p$ with a finite mean, we construct a distribution $q$ whose mean is well-separated from $p$'s, yet $p$ and $q$ are not distinguishable with high probability, and $q$ further preserves $p$'s moments up to constants. The main consequence is that no reasonable estimator can asymptotically achieve better than the sub-Gaussian error rate for any distribution, matching the worst-case result of [LV22]. More generally, we introduce a new definitional framework to analyze the fine-grained optimality of algorithms, which we call "neighborhood optimality", interpolating between the unattainably strong "instance optimality" and the trivially weak "admissibility" definitions. Applying the new framework, we show that median-of-means is neighborhood optimal, up to constant factors. It is open to find a neighborhood-optimal estimator without constant factor slackness. △ Less

Submitted 21 November, 2023; originally announced November 2023.

Comments: 27 pages, to appear in NeurIPS 2023. Abstract shortened to fit arXiv limit

arXiv:2310.09408 [pdf, other]

Improving Pearson's chi-squared test: hypothesis testing of distributions -- optimally

Authors: Trung Dang, Walter McKelvie, Paul Valiant, Hongao Wang

Abstract: Pearson's chi-squared test, from 1900, is the standard statistical tool for "hypothesis testing on distributions": namely, given samples from an unknown distribution $Q$ that may or may not equal a hypothesis distribution $P$, we want to return "yes" if $P=Q$ and "no" if $P$ is far from $Q$. While the chi-squared test is easy to use, it has been known for a while that it is not "data efficient", i… ▽ More Pearson's chi-squared test, from 1900, is the standard statistical tool for "hypothesis testing on distributions": namely, given samples from an unknown distribution $Q$ that may or may not equal a hypothesis distribution $P$, we want to return "yes" if $P=Q$ and "no" if $P$ is far from $Q$. While the chi-squared test is easy to use, it has been known for a while that it is not "data efficient", it does not make the best use of its data. Precisely, for accuracy $ε$ and confidence $δ$, and given $n$ samples from the unknown distribution $Q$, a tester should return "yes" with probability $>1-δ$ when $P=Q$, and "no" with probability $>1-δ$ when $|P-Q|>ε$. The challenge is to find a tester with the \emph{best} tradeoff between $ε$, $δ$, and $n$. We introduce a new tester, efficiently computable and easy to use, which we hope will replace the chi-squared tester in practical use. Our tester is found via a new non-convex optimization framework that essentially seeks to "find the tester whose Chernoff bounds on its performance are as good as possible". This tester is $1+o(1)$ optimal, in that the number of samples $n$ needed by the tester is within $1+o(1)$ factor of the samples needed by \emph{any} tester, even non-linear testers (for the setting: accuracy $ε$, confidence $δ$, and hypothesis $P$). We complement this algorithmic framework with matching lower bounds saying, essentially, that "our tester is instance-optimal, even to $1+o(1)$ factors, to the degree that Chernoff bounds are tight". Our overall non-convex optimization framework extends well beyond the current problem and is of independent interest. △ Less

Submitted 13 October, 2023; originally announced October 2023.

arXiv:2307.09212 [pdf, other]

How Many Neurons Does it Take to Approximate the Maximum?

Authors: Itay Safran, Daniel Reichman, Paul Valiant

Abstract: We study the size of a neural network needed to approximate the maximum function over $d$ inputs, in the most basic setting of approximating with respect to the $L_2$ norm, for continuous distributions, for a network that uses ReLU activations. We provide new lower and upper bounds on the width required for approximation across various depths. Our results establish new depth separations between de… ▽ More We study the size of a neural network needed to approximate the maximum function over $d$ inputs, in the most basic setting of approximating with respect to the $L_2$ norm, for continuous distributions, for a network that uses ReLU activations. We provide new lower and upper bounds on the width required for approximation across various depths. Our results establish new depth separations between depth 2 and 3, and depth 3 and 5 networks, as well as providing a depth $\mathcal{O}(\log(\log(d)))$ and width $\mathcal{O}(d)$ construction which approximates the maximum function. Our depth separation results are facilitated by a new lower bound for depth 2 networks approximating the maximum function over the uniform distribution, assuming an exponential upper bound on the size of the weights. Furthermore, we are able to use this depth 2 lower bound to provide tight bounds on the number of neurons needed to approximate the maximum by a depth 3 network. Our lower bounds are of potentially broad interest as they apply to the widely studied and used \emph{max} function, in contrast to many previous results that base their bounds on specially constructed or pathological functions and distributions. △ Less

Submitted 7 November, 2023; v1 submitted 18 July, 2023; originally announced July 2023.

arXiv:2206.02348 [pdf, other]

Finite-Sample Maximum Likelihood Estimation of Location

Authors: Shivam Gupta, Jasper C. H. Lee, Eric Price, Paul Valiant

Abstract: We consider 1-dimensional location estimation, where we estimate a parameter $λ$ from $n$ samples $λ+ η_i$, with each $η_i$ drawn i.i.d. from a known distribution $f$. For fixed $f$ the maximum-likelihood estimate (MLE) is well-known to be optimal in the limit as $n \to \infty$: it is asymptotically normal with variance matching the Cramér-Rao lower bound of $\frac{1}{n\mathcal{I}}$, where… ▽ More We consider 1-dimensional location estimation, where we estimate a parameter $λ$ from $n$ samples $λ+ η_i$, with each $η_i$ drawn i.i.d. from a known distribution $f$. For fixed $f$ the maximum-likelihood estimate (MLE) is well-known to be optimal in the limit as $n \to \infty$: it is asymptotically normal with variance matching the Cramér-Rao lower bound of $\frac{1}{n\mathcal{I}}$, where $\mathcal{I}$ is the Fisher information of $f$. However, this bound does not hold for finite $n$, or when $f$ varies with $n$. We show for arbitrary $f$ and $n$ that one can recover a similar theory based on the Fisher information of a smoothed version of $f$, where the smoothing radius decays with $n$. △ Less

Submitted 18 July, 2022; v1 submitted 6 June, 2022; originally announced June 2022.

Comments: Corrected an inaccuracy in the description of the experimental setup. Also updated funding acknowledgements

arXiv:2011.08384 [pdf, ps, other]

Optimal Sub-Gaussian Mean Estimation in $\mathbb{R}$

Authors: Jasper C. H. Lee, Paul Valiant

Abstract: We revisit the problem of estimating the mean of a real-valued distribution, presenting a novel estimator with sub-Gaussian convergence: intuitively, "our estimator, on any distribution, is as accurate as the sample mean is for the Gaussian distribution of matching variance." Crucially, in contrast to prior works, our estimator does not require prior knowledge of the variance, and works across the… ▽ More We revisit the problem of estimating the mean of a real-valued distribution, presenting a novel estimator with sub-Gaussian convergence: intuitively, "our estimator, on any distribution, is as accurate as the sample mean is for the Gaussian distribution of matching variance." Crucially, in contrast to prior works, our estimator does not require prior knowledge of the variance, and works across the entire gamut of distributions with bounded variance, including those without any higher moments. Parameterized by the sample size $n$, the failure probability $δ$, and the variance $σ^2$, our estimator is accurate to within $σ\cdot(1+o(1))\sqrt{\frac{2\log\frac{1}δ}{n}}$, tight up to the $1+o(1)$ factor. Our estimator construction and analysis gives a framework generalizable to other problems, tightly analyzing a sum of dependent random variables by viewing the sum implicitly as a 2-parameter $ψ$-estimator, and constructing bounds using mathematical programming and duality techniques. △ Less

Submitted 16 November, 2020; originally announced November 2020.

arXiv:1911.12289 [pdf, ps, other]

New relations for energy flow in terms of vorticity

Authors: Paul Valiant

Abstract: Considering the vorticity formulation of the Euler equations, we partition the kinetic energy into its contribution from each pair of interacting vortices. We call this contribution the "interaction energy". We show that each contribution satisfies a reciprocity relation on triples of vortices: $\boldsymbol{A}$'s action on $\boldsymbol{B}$ changes the interaction energy between $\boldsymbol{B}$ an… ▽ More Considering the vorticity formulation of the Euler equations, we partition the kinetic energy into its contribution from each pair of interacting vortices. We call this contribution the "interaction energy". We show that each contribution satisfies a reciprocity relation on triples of vortices: $\boldsymbol{A}$'s action on $\boldsymbol{B}$ changes the interaction energy between $\boldsymbol{B}$ and $\boldsymbol{C}$ in an equal and opposite way to the effect of $\boldsymbol{C}$'s action on $\boldsymbol{B}$ on the interaction energy between $\boldsymbol{A}$ and $\boldsymbol{B}$. This result is a curiously detailed accounting of energy flow, as contrasted to standard pointwise conservation laws in fluid dynamics. This result holds for all triples of points $\boldsymbol{A},\boldsymbol{B},\boldsymbol{C}$ in two dimensions; and in 3 dimensions for all points $\boldsymbol{A},\boldsymbol{C}$, and all closed vorticity streamlines $\boldsymbol{B}$. We show this result in 3 dimensions as a consequence of an interaction energy flow around $\boldsymbol{B}$ that is a function only of the triple $(\boldsymbol{A},\boldsymbol{b}\in \boldsymbol{B},\boldsymbol{C})$, a result which may be of independent interest. △ Less

Submitted 27 November, 2019; originally announced November 2019.

arXiv:1911.03605 [pdf, other]

Worst-Case Analysis for Randomly Collected Data

Authors: Justin Y. Chen, Gregory Valiant, Paul Valiant

Abstract: We introduce a framework for statistical estimation that leverages knowledge of how samples are collected but makes no distributional assumptions on the data values. Specifically, we consider a population of elements $[n]={1,\ldots,n}$ with corresponding data values $x_1,\ldots,x_n$. We observe the values for a "sample" set $A \subset [n]$ and wish to estimate some statistic of the values for a "t… ▽ More We introduce a framework for statistical estimation that leverages knowledge of how samples are collected but makes no distributional assumptions on the data values. Specifically, we consider a population of elements $[n]={1,\ldots,n}$ with corresponding data values $x_1,\ldots,x_n$. We observe the values for a "sample" set $A \subset [n]$ and wish to estimate some statistic of the values for a "target" set $B \subset [n]$ where $B$ could be the entire set. Crucially, we assume that the sets $A$ and $B$ are drawn according to some known distribution $P$ over pairs of subsets of $[n]$. A given estimation algorithm is evaluated based on its "worst-case, expected error" where the expectation is with respect to the distribution $P$ from which the sample $A$ and target sets $B$ are drawn, and the worst-case is with respect to the data values $x_1,\ldots,x_n$. Within this framework, we give an efficient algorithm for estimating the target mean that returns a weighted combination of the sample values--where the weights are functions of the distribution $P$ and the sample and target sets $A$, $B$--and show that the worst-case expected error achieved by this algorithm is at most a multiplicative $π/2$ factor worse than the optimal of such algorithms. The algorithm and proof leverage a surprising connection to the Grothendieck problem. This framework, which makes no distributional assumptions on the data values but rather relies on knowledge of the data collection process, is a significant departure from typical estimation and introduces a uniform algorithmic analysis for the many natural settings where membership in a sample may be correlated with data values, such as when sampling probabilities vary as in "importance sampling", when individuals are recruited into a sample via a social network as in "snowball sampling", or when samples have chronological structure as in "selective prediction". △ Less

Submitted 26 October, 2020; v1 submitted 8 November, 2019; originally announced November 2019.

arXiv:1904.09228 [pdf, other]

Uncertainty about Uncertainty: Optimal Adaptive Algorithms for Estimating Mixtures of Unknown Coins

Authors: Jasper C. H. Lee, Paul Valiant

Abstract: Given a mixture between two populations of coins, "positive" coins that each have -- unknown and potentially different -- bias $\geq\frac{1}{2}+Δ$ and "negative" coins with bias $\leq\frac{1}{2}-Δ$, we consider the task of estimating the fraction $ρ$ of positive coins to within additive error $ε$. We achieve an upper and lower bound of $Θ(\fracρ{ε^2Δ^2}\log\frac{1}δ)$ samples for a $1-δ$ probabili… ▽ More Given a mixture between two populations of coins, "positive" coins that each have -- unknown and potentially different -- bias $\geq\frac{1}{2}+Δ$ and "negative" coins with bias $\leq\frac{1}{2}-Δ$, we consider the task of estimating the fraction $ρ$ of positive coins to within additive error $ε$. We achieve an upper and lower bound of $Θ(\fracρ{ε^2Δ^2}\log\frac{1}δ)$ samples for a $1-δ$ probability of success, where crucially, our lower bound applies to all fully-adaptive algorithms. Thus, our sample complexity bounds have tight dependence for every relevant problem parameter. A crucial component of our lower bound proof is a decomposition lemma (see Lemmas 17 and 18) showing how to assemble partially-adaptive bounds into a fully-adaptive bound, which may be of independent interest: though we invoke it for the special case of Bernoulli random variables (coins), it applies to general distributions. We present simulation results to demonstrate the practical efficacy of our approach for realistic problem parameters for crowdsourcing applications, focusing on the "rare events" regime where $ρ$ is small. The fine-grained adaptive flavor of both our algorithm and lower bound contrasts with much previous work in distributional testing and learning. △ Less

Submitted 5 February, 2021; v1 submitted 19 April, 2019; originally announced April 2019.

Comments: Full paper updated to reflect the new result in our SODA 2021 proceedings version: our new sample complexity lower bound includes dependence on the failure probability, and hence is simultaneously tight in all of the problem parameters up to a constant multiplicative factor

arXiv:1904.09080 [pdf, other]

Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process

Authors: Guy Blanc, Neha Gupta, Gregory Valiant, Paul Valiant

Abstract: We consider networks, trained via stochastic gradient descent to minimize $\ell_2$ loss, with the training labels perturbed by independent noise at each iteration. We characterize the behavior of the training dynamics near any parameter vector that achieves zero training error, in terms of an implicit regularization term corresponding to the sum over the data points, of the squared $\ell_2$ norm o… ▽ More We consider networks, trained via stochastic gradient descent to minimize $\ell_2$ loss, with the training labels perturbed by independent noise at each iteration. We characterize the behavior of the training dynamics near any parameter vector that achieves zero training error, in terms of an implicit regularization term corresponding to the sum over the data points, of the squared $\ell_2$ norm of the gradient of the model with respect to the parameter vector, evaluated at each data point. This holds for networks of any connectivity, width, depth, and choice of activation function. We interpret this implicit regularization term for three simple settings: matrix sensing, two layer ReLU networks trained on one-dimensional data, and two layer networks with sigmoid activations trained on a single datapoint. For these settings, we show why this new and general implicit regularization effect drives the networks towards "simple" models. △ Less

Submitted 22 July, 2020; v1 submitted 19 April, 2019; originally announced April 2019.

arXiv:1605.02646 [pdf, other]

Information Theoretically Secure Databases

Authors: Gregory Valiant, Paul Valiant

Abstract: We introduce the notion of a database system that is information theoretically "Secure In Between Accesses"--a database system with the properties that 1) users can efficiently access their data, and 2) while a user is not accessing their data, the user's information is information theoretically secure to malicious agents, provided that certain requirements on the maintenance of the database are r… ▽ More We introduce the notion of a database system that is information theoretically "Secure In Between Accesses"--a database system with the properties that 1) users can efficiently access their data, and 2) while a user is not accessing their data, the user's information is information theoretically secure to malicious agents, provided that certain requirements on the maintenance of the database are realized. We stress that the security guarantee is information theoretic and everlasting: it relies neither on unproved hardness assumptions, nor on the assumption that the adversary is computationally or storage bounded. We propose a realization of such a database system and prove that a user's stored information, in between times when it is being legitimately accessed, is information theoretically secure both to adversaries who interact with the database in the prescribed manner, as well as to adversaries who have installed a virus that has access to the entire database and communicates with the adversary. The central idea behind our design is the construction of a "re-randomizing database" that periodically changes the internal representation of the information that is being stored. To ensure security, these remap**s of the representation of the data must be made sufficiently often in comparison to the amount of information that is being communicated from the database between remap**s and the amount of local memory in the database that a virus may preserve during the remap**s. The core of the proof of the security guarantee is the following communication/data tradeoff for the problem of learning sparse parities from uniformly random $n$-bit examples: any algorithm that can learn a parity of size $k$ with probability at least $p$ and extracts at most $r$ bits of information from each example, must see at least $p\cdot \left(\frac{n}{r}\right)^{k/2} c_k$ examples. △ Less

Submitted 9 May, 2016; originally announced May 2016.

Comments: 16 pages, 2 figures

arXiv:1512.07898 [pdf, ps, other]

doi 10.1017/jfm.2016.573

Eroding dipoles and vorticity growth for Euler flows in $ \scriptstyle{\mathbb{R}}^3$ I. Axisymmetric flow without swirl

Authors: Stephen Childress, Andrew D. Gilbert, Paul Valiant

Abstract: A review of analyses based upon anti-parallel vortex structures suggests that structurally stable vortex structures with eroding circulation may offer a path to the study of rapid vorticity growth in solutions of Euler's equations in $ \scriptstyle{\mathbb{R}}^3$. We examine here the possible formation of such a structure in axisymmetric flow without swirl, leading to maximal growth of vorticity a… ▽ More A review of analyses based upon anti-parallel vortex structures suggests that structurally stable vortex structures with eroding circulation may offer a path to the study of rapid vorticity growth in solutions of Euler's equations in $ \scriptstyle{\mathbb{R}}^3$. We examine here the possible formation of such a structure in axisymmetric flow without swirl, leading to maximal growth of vorticity as $t^{4/3}$. Our study suggests that the optimizing flow giving the $t^{4/3}$ growth mimics an exact solution of Euler's equations representing an eroding toroidal vortex dipole which locally conserves kinetic energy. The dipole cross-section is a perturbation of the classical Sadovskii dipole having piecewise constant vorticity, which breaks the symmetry of closed streamlines. The structure of this perturbed Sadovskii dipole is analyzed asymptotically at large times, and its predicted properties are verified numerically. △ Less

Submitted 24 December, 2015; originally announced December 2015.

Comments: 33 pages, 11 figures

MSC Class: 76B47

arXiv:1511.04466 [pdf, other]

Optimizing Star-Convex Functions

Authors: Jasper C. H. Lee, Paul Valiant

Abstract: We introduce a polynomial time algorithm for optimizing the class of star-convex functions, under no restrictions except boundedness on a region about the origin, and Lebesgue measurability. The algorithm's performance is polynomial in the requested number of digits of accuracy, contrasting with the previous best known algorithm of Nesterov and Polyak that has exponential dependence, and that furt… ▽ More We introduce a polynomial time algorithm for optimizing the class of star-convex functions, under no restrictions except boundedness on a region about the origin, and Lebesgue measurability. The algorithm's performance is polynomial in the requested number of digits of accuracy, contrasting with the previous best known algorithm of Nesterov and Polyak that has exponential dependence, and that further requires Lipschitz second differentiability of the function, but has milder dependence on the dimension of the domain. Star-convex functions constitute a rich class of functions generalizing convex functions to new parameter regimes, and which confound standard variants of gradient descent; more generally, we construct a family of star-convex functions where gradient-based algorithms provably give no information about the location of the global optimum. We introduce a new randomized algorithm for finding cutting planes based only on function evaluations, where, counterintuitively, the algorithm must look outside the feasible region to discover the structure of the star-convex function that lets it compute the next cut of the feasible region. We emphasize that the class of star-convex functions we consider is as unrestricted as possible: the class of Lebesgue measurable star-convex functions has theoretical appeal, introducing to the domain of polynomial-time algorithms a huge class with many interesting pathologies. We view our results as a step forward in understanding the scope of optimization techniques beyond the garden of convex optimization and local gradient-based methods. △ Less

Submitted 11 May, 2016; v1 submitted 13 November, 2015; originally announced November 2015.

Comments: 30 pages (including appendices)

arXiv:1504.05321 [pdf, ps, other]

Instance Optimal Learning

Authors: Gregory Valiant, Paul Valiant

Abstract: We consider the following basic learning task: given independent draws from an unknown distribution over a discrete support, output an approximation of the distribution that is as accurate as possible in $\ell_1$ distance (i.e. total variation or statistical distance). Perhaps surprisingly, it is often possible to "de-noise" the empirical distribution of the samples to return an approximation of t… ▽ More We consider the following basic learning task: given independent draws from an unknown distribution over a discrete support, output an approximation of the distribution that is as accurate as possible in $\ell_1$ distance (i.e. total variation or statistical distance). Perhaps surprisingly, it is often possible to "de-noise" the empirical distribution of the samples to return an approximation of the true distribution that is significantly more accurate than the empirical distribution, without relying on any prior assumptions on the distribution. We present an instance optimal learning algorithm which optimally performs this de-noising for every distribution for which such a de-noising is possible. More formally, given $n$ independent draws from a distribution $p$, our algorithm returns a labelled vector whose expected distance from $p$ is equal to the minimum possible expected error that could be obtained by any algorithm that knows the true unlabeled vector of probabilities of distribution $p$ and simply needs to assign labels, up to an additive subconstant term that is independent of $p$ and goes to zero as $n$ gets large. One conceptual implication of this result is that for large samples, Bayesian assumptions on the "shape" or bounds on the tail probabilities of a distribution over discrete support are not helpful for the task of learning the distribution. As a consequence of our techniques, we also show that given a set of $n$ samples from an arbitrary distribution, one can accurately estimate the expected number of distinct elements that will be observed in a sample of any size up to $n \log n$. This sort of extrapolation is practically relevant, particularly to domains such as genomics where it is important to understand how much more might be discovered given larger sample sizes, and we are optimistic that our approach is practically viable. △ Less

Submitted 11 November, 2015; v1 submitted 21 April, 2015; originally announced April 2015.

arXiv:1308.3946 [pdf, ps, other]

Optimal Algorithms for Testing Closeness of Discrete Distributions

Authors: Siu-On Chan, Ilias Diakonikolas, Gregory Valiant, Paul Valiant

Abstract: We study the question of closeness testing for two discrete distributions. More precisely, given samples from two distributions $p$ and $q$ over an $n$-element set, we wish to distinguish whether $p=q$ versus $p$ is at least $\eps$-far from $q$, in either $\ell_1$ or $\ell_2$ distance. Batu et al. gave the first sub-linear time algorithms for these problems, which matched the lower bounds of Valia… ▽ More We study the question of closeness testing for two discrete distributions. More precisely, given samples from two distributions $p$ and $q$ over an $n$-element set, we wish to distinguish whether $p=q$ versus $p$ is at least $\eps$-far from $q$, in either $\ell_1$ or $\ell_2$ distance. Batu et al. gave the first sub-linear time algorithms for these problems, which matched the lower bounds of Valiant up to a logarithmic factor in $n$, and a polynomial factor of $\eps.$ In this work, we present simple (and new) testers for both the $\ell_1$ and $\ell_2$ settings, with sample complexity that is information-theoretically optimal, to constant factors, both in the dependence on $n$, and the dependence on $\eps$; for the $\ell_1$ testing problem we establish that the sample complexity is $Θ(\max\{n^{2/3}/\eps^{4/3}, n^{1/2}/\eps^2 \}).$ △ Less

Submitted 19 August, 2013; originally announced August 2013.

arXiv:1112.5659 [pdf, ps, other]

Testing $k$-Modal Distributions: Optimal Algorithms via Reductions

Authors: Constantinos Daskalakis, Ilias Diakonikolas, Rocco A. Servedio, Gregory Valiant, Paul Valiant

Abstract: We give highly efficient algorithms, and almost matching lower bounds, for a range of basic statistical problems that involve testing and estimating the L_1 distance between two k-modal distributions $p$ and $q$ over the discrete domain $\{1,\dots,n\}$. More precisely, we consider the following four problems: given sample access to an unknown k-modal distribution $p$, Testing identity to a known… ▽ More We give highly efficient algorithms, and almost matching lower bounds, for a range of basic statistical problems that involve testing and estimating the L_1 distance between two k-modal distributions $p$ and $q$ over the discrete domain $\{1,\dots,n\}$. More precisely, we consider the following four problems: given sample access to an unknown k-modal distribution $p$, Testing identity to a known or unknown distribution: 1. Determine whether $p = q$ (for an explicitly given k-modal distribution $q$) versus $p$ is $\eps$-far from $q$; 2. Determine whether $p=q$ (where $q$ is available via sample access) versus $p$ is $\eps$-far from $q$; Estimating $L_1$ distance ("tolerant testing'') against a known or unknown distribution: 3. Approximate $d_{TV}(p,q)$ to within additive $\eps$ where $q$ is an explicitly given k-modal distribution $q$; 4. Approximate $d_{TV}(p,q)$ to within additive $\eps$ where $q$ is available via sample access. For each of these four problems we give sub-logarithmic sample algorithms, that we show are tight up to additive $\poly(k)$ and multiplicative $\polylog\log n+\polylog k$ factors. Thus our bounds significantly improve the previous results of \cite{BKR:04}, which were for testing identity of distributions (items (1) and (2) above) in the special cases k=0 (monotone distributions) and k=1 (unimodal distributions) and required $O((\log n)^3)$ samples. As our main conceptual contribution, we introduce a new reduction-based approach for distribution-testing problems that lets us obtain all the above results in a unified way. Roughly speaking, this approach enables us to transform various distribution testing problems for k-modal distributions over $\{1,\dots,n\}$ to the corresponding distribution testing problems for unrestricted distributions over a much smaller domain $\{1,\dots,\ell\}$ where $\ell = O(k \log n).$ △ Less

Submitted 23 December, 2011; originally announced December 2011.

arXiv:0909.2030 [pdf, ps, other]

Size Bounds for Conjunctive Queries with General Functional Dependencies

Authors: Gregory Valiant, Paul Valiant

Abstract: This paper extends the work of Gottlob, Lee, and Valiant (PODS 2009)[GLV], and considers worst-case bounds for the size of the result Q(D) of a conjunctive query Q to a database D given an arbitrary set of functional dependencies. The bounds in [GLV] are based on a "coloring" of the query variables. In order to extend the previous bounds to the setting of arbitrary functional dependencies, we le… ▽ More This paper extends the work of Gottlob, Lee, and Valiant (PODS 2009)[GLV], and considers worst-case bounds for the size of the result Q(D) of a conjunctive query Q to a database D given an arbitrary set of functional dependencies. The bounds in [GLV] are based on a "coloring" of the query variables. In order to extend the previous bounds to the setting of arbitrary functional dependencies, we leverage tools from information theory to formalize the original intuition that each color used represents some possible entropy of that variable, and bound the maximum possible size increase via a linear program that seeks to maximize how much more entropy is in the result of the query than the input. This new view allows us to precisely characterize the entropy structure of worst-case instances for conjunctive queries with simple functional dependencies (keys), providing new insights into the results of [GLV]. We extend these results to the case of general functional dependencies, providing upper and lower bounds on the worst-case size increase. We identify the fundamental connection between the gap in these bounds and a central open question in information theory. Finally, we show that, while both the upper and lower bounds are given by exponentially large linear programs, one can distinguish in polynomial time whether the result of a query with an arbitrary set of functional dependencies can be any larger than the input database. △ Less

Submitted 12 December, 2009; v1 submitted 10 September, 2009; originally announced September 2009.

Comments: 22 pages, 2 figures

ACM Class: H.2.4; F.2.0

arXiv:0802.1604 [pdf, ps, other]

On the Complexity of Nash Equilibria of Action-Graph Games

Authors: Constantinos Daskalakis, Grant Schoenebeck, Gregory Valiant, Paul Valiant

Abstract: We consider the problem of computing Nash Equilibria of action-graph games (AGGs). AGGs, introduced by Bhat and Leyton-Brown, is a succinct representation of games that encapsulates both "local" dependencies as in graphical games, and partial indifference to other agents' identities as in anonymous games, which occur in many natural settings. This is achieved by specifying a graph on the set of… ▽ More We consider the problem of computing Nash Equilibria of action-graph games (AGGs). AGGs, introduced by Bhat and Leyton-Brown, is a succinct representation of games that encapsulates both "local" dependencies as in graphical games, and partial indifference to other agents' identities as in anonymous games, which occur in many natural settings. This is achieved by specifying a graph on the set of actions, so that the payoff of an agent for selecting a strategy depends only on the number of agents playing each of the neighboring strategies in the action graph. We present a Polynomial Time Approximation Scheme for computing mixed Nash equilibria of AGGs with constant treewidth and a constant number of agent types (and an arbitrary number of strategies), together with hardness results for the cases when either the treewidth or the number of agent types is unconstrained. In particular, we show that even if the action graph is a tree, but the number of agent-types is unconstrained, it is NP-complete to decide the existence of a pure-strategy Nash equilibrium and PPAD-complete to compute a mixed Nash equilibrium (even an approximate one); similarly for symmetric AGGs (all agents belong to a single type), if we allow arbitrary treewidth. These hardness results suggest that, in some sense, our PTAS is as strong of a positive result as one can expect. △ Less

Submitted 12 February, 2008; originally announced February 2008.

arXiv:quant-ph/0211179 [pdf, ps, other]

Comparing EQP and MOD_{p^k}P using Polynomial Degree Lower Bounds

Authors: M. de Graaf, P. Valiant

Abstract: We show that an oracle A that contains either 1/4 or 3/4 of all strings of length n can be used to separate EQP from the counting classes MOD_{p^k}P. Our proof makes use of the degree of a representing polynomial over the finite field of size p^k. We show a linear lower bound on the degree of this polynomial. We also show an upper bound of O(n^{1/log_p m}) on the degree over the ring of intege… ▽ More We show that an oracle A that contains either 1/4 or 3/4 of all strings of length n can be used to separate EQP from the counting classes MOD_{p^k}P. Our proof makes use of the degree of a representing polynomial over the finite field of size p^k. We show a linear lower bound on the degree of this polynomial. We also show an upper bound of O(n^{1/log_p m}) on the degree over the ring of integers modulo m, whenever m is a squarefree composite with largest prime factor p. △ Less

Submitted 27 November, 2002; originally announced November 2002.

Comments: 10 pages, no figures

Showing 1–19 of 19 results for author: Valiant, P