Search | arXiv e-print repository

Improved bounds for calibration via stronger sign preservation games

Authors: Yuval Dagan, Constantinos Daskalakis, Maxwell Fishelson, Noah Golowich, Robert Kleinberg, Princewill Okoroafor

Abstract: A set of probabilistic forecasts is calibrated if each prediction of the forecaster closely approximates the empirical distribution of outcomes on the subset of timesteps where that prediction was made. We study the fundamental problem of online calibrated forecasting of binary sequences, which was initially studied by Foster & Vohra (1998). They derived an algorithm with $O(T^{2/3})$ calibration… ▽ More A set of probabilistic forecasts is calibrated if each prediction of the forecaster closely approximates the empirical distribution of outcomes on the subset of timesteps where that prediction was made. We study the fundamental problem of online calibrated forecasting of binary sequences, which was initially studied by Foster & Vohra (1998). They derived an algorithm with $O(T^{2/3})$ calibration error after $T$ time steps, and showed a lower bound of $Ω(T^{1/2})$. These bounds remained stagnant for two decades, until Qiao & Valiant (2021) improved the lower bound to $Ω(T^{0.528})$ by introducing a combinatorial game called sign preservation and showing that lower bounds for this game imply lower bounds for calibration. We introduce a strengthening of Qiao & Valiant's game that we call sign preservation with reuse (SPR). We prove that the relationship between SPR and calibrated forecasting is bidirectional: not only do lower bounds for SPR translate into lower bounds for calibration, but algorithms for SPR also translate into new algorithms for calibrated forecasting. In particular, any strategy that improves the trivial upper bound for the value of the SPR game would imply a forecasting algorithm with calibration error exponent less than 2/3, improving Foster & Vohra's upper bound for the first time. Using similar ideas, we then prove a slightly stronger lower bound than that of Qiao & Valiant, namely $Ω(T^{0.54389})$. Our lower bound is obtained by an oblivious adversary, marking the first $ω(T^{1/2})$ calibration lower bound for oblivious adversaries. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.03855 [pdf, other]

Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As

Authors: Eden Avnat, Michal Levy, Daniel Herstain, Elia Yanko, Daniel Ben Joya, Michal Tzuchman Katz, Dafna Eshel, Sahar Laros, Yael Dagan, Shahar Barami, Joseph Mermelstein, Shahar Ovadia, Noam Shomron, Varda Shalev, Raja-Elie E. Abdulnour

Abstract: Clinical problem-solving requires processing of semantic medical knowledge such as illness scripts and numerical medical knowledge of diagnostic tests for evidence-based decision-making. As large language models (LLMs) show promising results in many aspects of language-based clinical practice, their ability to generate non-language evidence-based answers to clinical questions is inherently limited… ▽ More Clinical problem-solving requires processing of semantic medical knowledge such as illness scripts and numerical medical knowledge of diagnostic tests for evidence-based decision-making. As large language models (LLMs) show promising results in many aspects of language-based clinical practice, their ability to generate non-language evidence-based answers to clinical questions is inherently limited by tokenization. Therefore, we evaluated LLMs' performance on two question types: numeric (correlating findings) and semantic (differentiating entities) while examining differences within and between LLMs in medical aspects and comparing their performance to humans. To generate straightforward multi-choice questions and answers (QAs) based on evidence-based medicine (EBM), we used a comprehensive medical knowledge graph (encompassed data from more than 50,00 peer-reviewed articles) and created the "EBMQA". EBMQA contains 105,000 QAs labeled with medical and non-medical topics and classified into numerical or semantic questions. We benchmarked this dataset using more than 24,500 QAs on two state-of-the-art LLMs: Chat-GPT4 and Claude3-Opus. We evaluated the LLMs accuracy on semantic and numerical question types and according to sub-labeled topics. For validation, six medical experts were tested on 100 numerical EBMQA questions. We found that both LLMs excelled more in semantic than numerical QAs, with Claude3 surpassing GPT4 in numerical QAs. However, both LLMs showed inter and intra gaps in different medical aspects and remained inferior to humans. Thus, their medical advice should be addressed carefully. △ Less

Submitted 1 July, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

arXiv:2310.19786 [pdf, ps, other]

From External to Swap Regret 2.0: An Efficient Reduction and Oblivious Adversary for Large Action Spaces

Authors: Yuval Dagan, Constantinos Daskalakis, Maxwell Fishelson, Noah Golowich

Abstract: We provide a novel reduction from swap-regret minimization to external-regret minimization, which improves upon the classical reductions of Blum-Mansour [BM07] and Stolz-Lugosi [SL05] in that it does not require finiteness of the space of actions. We show that, whenever there exists a no-external-regret algorithm for some hypothesis class, there must also exist a no-swap-regret algorithm for that… ▽ More We provide a novel reduction from swap-regret minimization to external-regret minimization, which improves upon the classical reductions of Blum-Mansour [BM07] and Stolz-Lugosi [SL05] in that it does not require finiteness of the space of actions. We show that, whenever there exists a no-external-regret algorithm for some hypothesis class, there must also exist a no-swap-regret algorithm for that same class. For the problem of learning with expert advice, our result implies that it is possible to guarantee that the swap regret is bounded by ε after $\log(N)^{O(1/ε)}$ rounds and with $O(N)$ per iteration complexity, where $N$ is the number of experts, while the classical reductions of Blum-Mansour and Stolz-Lugosi require $O(N/ε^2)$ rounds and at least $Ω(N^2)$ per iteration complexity. Our result comes with an associated lower bound, which -- in contrast to that in [BM07] -- holds for oblivious and $\ell_1$-constrained adversaries and learners that can employ distributions over experts, showing that the number of rounds must be $\tildeΩ(N/ε^2)$ or exponential in $1/ε$. Our reduction implies that, if no-regret learning is possible in some game, then this game must have approximate correlated equilibria, of arbitrarily good approximation. This strengthens the folklore implication of no-regret learning that approximate coarse correlated equilibria exist. Importantly, it provides a sufficient condition for the existence of correlated equilibrium which vastly extends the requirement that the action set is finite, thus answering a question left open by [DG22; Ass+23]. Moreover, it answers several outstanding questions about equilibrium computation and learning in games. △ Less

Submitted 6 December, 2023; v1 submitted 30 October, 2023; originally announced October 2023.

arXiv:2307.01689 [pdf, ps, other]

Online Learning and Solving Infinite Games with an ERM Oracle

Authors: Angelos Assos, Idan Attias, Yuval Dagan, Constantinos Daskalakis, Maxwell Fishelson

Abstract: While ERM suffices to attain near-optimal generalization error in the stochastic learning setting, this is not known to be the case in the online learning setting, where algorithms for general concept classes rely on computationally inefficient oracles such as the Standard Optimal Algorithm (SOA). In this work, we propose an algorithm for online binary classification setting that relies solely on… ▽ More While ERM suffices to attain near-optimal generalization error in the stochastic learning setting, this is not known to be the case in the online learning setting, where algorithms for general concept classes rely on computationally inefficient oracles such as the Standard Optimal Algorithm (SOA). In this work, we propose an algorithm for online binary classification setting that relies solely on ERM oracle calls, and show that it has finite regret in the realizable setting and sublinearly growing regret in the agnostic setting. We bound the regret in terms of the Littlestone and threshold dimensions of the underlying concept class. We obtain similar results for nonparametric games, where the ERM oracle can be interpreted as a best response oracle, finding the best response of a player to a given history of play of the other players. In this setting, we provide learning algorithms that only rely on best response oracles and converge to approximate-minimax equilibria in two-player zero-sum games and approximate coarse correlated equilibria in multi-player general-sum games, as long as the game has a bounded fat-threshold dimension. Our algorithms apply to both binary-valued and real-valued games and can be viewed as providing justification for the wide use of double oracle and multiple oracle algorithms in the practice of solving large games. △ Less

Submitted 10 July, 2023; v1 submitted 4 July, 2023; originally announced July 2023.

Comments: In COLT2023

arXiv:2305.19256 [pdf, other]

Ambient Diffusion: Learning Clean Distributions from Corrupted Data

Authors: Giannis Daras, Kulin Shah, Yuval Dagan, Aravind Gollakota, Alexandros G. Dimakis, Adam Klivans

Abstract: We present the first diffusion-based framework that can learn an unknown distribution using only highly-corrupted samples. This problem arises in scientific applications where access to uncorrupted samples is impossible or expensive to acquire. Another benefit of our approach is the ability to train generative models that are less likely to memorize individual training samples since they never obs… ▽ More We present the first diffusion-based framework that can learn an unknown distribution using only highly-corrupted samples. This problem arises in scientific applications where access to uncorrupted samples is impossible or expensive to acquire. Another benefit of our approach is the ability to train generative models that are less likely to memorize individual training samples since they never observe clean training data. Our main idea is to introduce additional measurement distortion during the diffusion process and require the model to predict the original corrupted image from the further corrupted image. We prove that our method leads to models that learn the conditional expectation of the full uncorrupted image given this additional measurement corruption. This holds for any corruption process that satisfies some technical conditions (and in particular includes inpainting and compressed sensing). We train models on standard benchmarks (CelebA, CIFAR-10 and AFHQ) and show that we can learn the distribution even when all the training samples have $90\%$ of their pixels missing. We also show that we can finetune foundation models on small corrupted datasets (e.g. MRI scans with block corruptions) and learn the clean distribution without memorizing the training set. △ Less

Submitted 30 May, 2023; originally announced May 2023.

Comments: 24 pages, 11 figures

arXiv:2302.09057 [pdf, other]

Consistent Diffusion Models: Mitigating Sampling Drift by Learning to be Consistent

Authors: Giannis Daras, Yuval Dagan, Alexandros G. Dimakis, Constantinos Daskalakis

Abstract: Imperfect score-matching leads to a shift between the training and the sampling distribution of diffusion models. Due to the recursive nature of the generation process, errors in previous steps yield sampling iterates that drift away from the training distribution. Yet, the standard training objective via Denoising Score Matching (DSM) is only designed to optimize over non-drifted data. To train o… ▽ More Imperfect score-matching leads to a shift between the training and the sampling distribution of diffusion models. Due to the recursive nature of the generation process, errors in previous steps yield sampling iterates that drift away from the training distribution. Yet, the standard training objective via Denoising Score Matching (DSM) is only designed to optimize over non-drifted data. To train on drifted data, we propose to enforce a \emph{consistency} property which states that predictions of the model on its own generated data are consistent across time. Theoretically, we show that if the score is learned perfectly on some non-drifted points (via DSM) and if the consistency property is enforced everywhere, then the score is learned accurately everywhere. Empirically we show that our novel training objective yields state-of-the-art results for conditional and unconditional generation in CIFAR-10 and baseline improvements in AFHQ and FFHQ. We open-source our code and models: https://github.com/giannisdaras/cdm △ Less

Submitted 17 February, 2023; originally announced February 2023.

Comments: 29 pages, 8 figures

arXiv:2211.13291 [pdf, ps, other]

Learning and Testing Latent-Tree Ising Models Efficiently

Authors: Davin Choo, Yuval Dagan, Constantinos Daskalakis, Anthimos Vardis Kandiros

Abstract: We provide time- and sample-efficient algorithms for learning and testing latent-tree Ising models, i.e. Ising models that may only be observed at their leaf nodes. On the learning side, we obtain efficient algorithms for learning a tree-structured Ising model whose leaf node distribution is close in Total Variation Distance, improving on the results of prior work. On the testing side, we provide… ▽ More We provide time- and sample-efficient algorithms for learning and testing latent-tree Ising models, i.e. Ising models that may only be observed at their leaf nodes. On the learning side, we obtain efficient algorithms for learning a tree-structured Ising model whose leaf node distribution is close in Total Variation Distance, improving on the results of prior work. On the testing side, we provide an efficient algorithm with fewer samples for testing whether two latent-tree Ising models have leaf-node distributions that are close or far in Total Variation distance. We obtain our algorithms by showing novel localization results for the total variation distance between the leaf-node distributions of tree-structured Ising models, in terms of their marginals on pairs of leaves. △ Less

Submitted 10 July, 2023; v1 submitted 23 November, 2022; originally announced November 2022.

arXiv:2211.11904 [pdf, ps, other]

EM's Convergence in Gaussian Latent Tree Models

Authors: Yuval Dagan, Constantinos Daskalakis, Anthimos Vardis Kandiros

Abstract: We study the optimization landscape of the log-likelihood function and the convergence of the Expectation-Maximization (EM) algorithm in latent Gaussian tree models, i.e. tree-structured Gaussian graphical models whose leaf nodes are observable and non-leaf nodes are unobservable. We show that the unique non-trivial stationary point of the population log-likelihood is its global maximum, and estab… ▽ More We study the optimization landscape of the log-likelihood function and the convergence of the Expectation-Maximization (EM) algorithm in latent Gaussian tree models, i.e. tree-structured Gaussian graphical models whose leaf nodes are observable and non-leaf nodes are unobservable. We show that the unique non-trivial stationary point of the population log-likelihood is its global maximum, and establish that the expectation-maximization algorithm is guaranteed to converge to it in the single latent variable case. Our results for the landscape of the log-likelihood function in general latent tree models provide support for the extensive practical use of maximum likelihood based-methods in this setting. Our results for the EM algorithm extend an emerging line of work on obtaining global convergence guarantees for this celebrated algorithm. We show our results for the non-trivial stationary points of the log-likelihood by arguing that a certain system of polynomial equations obtained from the EM updates has a unique non-trivial solution. The global convergence of the EM algorithm follows by arguing that all trivial fixed points are higher-order saddle points. △ Less

Submitted 23 November, 2022; v1 submitted 21 November, 2022; originally announced November 2022.

arXiv:2206.09104 [pdf, other]

Score-Guided Intermediate Layer Optimization: Fast Langevin Mixing for Inverse Problems

Authors: Giannis Daras, Yuval Dagan, Alexandros G. Dimakis, Constantinos Daskalakis

Abstract: We prove fast mixing and characterize the stationary distribution of the Langevin Algorithm for inverting random weighted DNN generators. This result extends the work of Hand and Voroninski from efficient inversion to efficient posterior sampling. In practice, to allow for increased expressivity, we propose to do posterior sampling in the latent space of a pre-trained generative model. To achieve… ▽ More We prove fast mixing and characterize the stationary distribution of the Langevin Algorithm for inverting random weighted DNN generators. This result extends the work of Hand and Voroninski from efficient inversion to efficient posterior sampling. In practice, to allow for increased expressivity, we propose to do posterior sampling in the latent space of a pre-trained generative model. To achieve that, we train a score-based model in the latent space of a StyleGAN-2 and we use it to solve inverse problems. Our framework, Score-Guided Intermediate Layer Optimization (SGILO), extends prior work by replacing the sparsity regularization with a generative prior in the intermediate layer. Experimentally, we obtain significant improvements over the previous state-of-the-art, especially in the low measurement regime. △ Less

Submitted 22 June, 2022; v1 submitted 17 June, 2022; originally announced June 2022.

Comments: Accepted to ICML 2022. 32 pages, 9 Figures

arXiv:2202.04690 [pdf, ps, other]

Smoothed Online Learning is as Easy as Statistical Learning

Authors: Adam Block, Yuval Dagan, Noah Golowich, Alexander Rakhlin

Abstract: Much of modern learning theory has been split between two regimes: the classical offline setting, where data arrive independently, and the online setting, where data arrive adversarially. While the former model is often both computationally and statistically tractable, the latter requires no distributional assumptions. In an attempt to achieve the best of both worlds, previous work proposed the sm… ▽ More Much of modern learning theory has been split between two regimes: the classical offline setting, where data arrive independently, and the online setting, where data arrive adversarially. While the former model is often both computationally and statistically tractable, the latter requires no distributional assumptions. In an attempt to achieve the best of both worlds, previous work proposed the smooth online setting where each sample is drawn from an adversarially chosen distribution, which is smooth, i.e., it has a bounded density with respect to a fixed dominating measure. We provide tight bounds on the minimax regret of learning a nonparametric function class, with nearly optimal dependence on both the horizon and smoothness parameters. Furthermore, we provide the first oracle-efficient, no-regret algorithms in this setting. In particular, we propose an oracle-efficient improper algorithm whose regret achieves optimal dependence on the horizon and a proper algorithm requiring only a single oracle call per round whose regret has the optimal horizon dependence in the classification setting and is sublinear in general. Both algorithms have exponentially worse dependence on the smoothness parameter of the adversary than the minimax rate. We then prove a lower bound on the oracle complexity of any proper learning algorithm, which matches the oracle-efficient upper bounds up to a polynomial factor, thus demonstrating the existence of a statistical-computational gap in smooth online learning. Finally, we apply our results to the contextual bandit setting to show that if a function class is learnable in the classical setting, then there is an oracle-efficient, no-regret algorithm for contextual bandits in the case that contexts arrive in a smooth manner. △ Less

Submitted 31 May, 2022; v1 submitted 9 February, 2022; originally announced February 2022.

arXiv:2107.09773 [pdf, other]

Statistical Estimation from Dependent Data

Authors: Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, Surbhi Goel, Anthimos Vardis Kandiros

Abstract: We consider a general statistical estimation problem wherein binary labels across different observations are not independent conditioned on their feature vectors, but dependent, capturing settings where e.g. these observations are collected on a spatial domain, a temporal domain, or a social network, which induce dependencies. We model these dependencies in the language of Markov Random Fields and… ▽ More We consider a general statistical estimation problem wherein binary labels across different observations are not independent conditioned on their feature vectors, but dependent, capturing settings where e.g. these observations are collected on a spatial domain, a temporal domain, or a social network, which induce dependencies. We model these dependencies in the language of Markov Random Fields and, importantly, allow these dependencies to be substantial, i.e do not assume that the Markov Random Field capturing these dependencies is in high temperature. As our main contribution we provide algorithms and statistically efficient estimation rates for this model, giving several instantiations of our bounds in logistic regression, sparse logistic regression, and neural network settings with dependent data. Our estimation guarantees follow from novel results for estimating the parameters (i.e. external fields and interaction strengths) of Ising models from a {\em single} sample. {We evaluate our estimation approach on real networked data, showing that it outperforms standard regression approaches that ignore dependencies, across three text classification datasets: Cora, Citeseer and Pubmed.} △ Less

Submitted 20 July, 2021; originally announced July 2021.

Comments: 41 pages, ICML 2021

arXiv:2102.01729 [pdf, ps, other]

Majorizing Measures, Sequential Complexities, and Online Learning

Authors: Adam Block, Yuval Dagan, Sasha Rakhlin

Abstract: We introduce the technique of generic chaining and majorizing measures for controlling sequential Rademacher complexity. We relate majorizing measures to the notion of fractional covering numbers, which we show to be dominated in terms of sequential scale-sensitive dimensions in a horizon-independent way, and, under additional complexity assumptions establish a tight control on worst-case sequenti… ▽ More We introduce the technique of generic chaining and majorizing measures for controlling sequential Rademacher complexity. We relate majorizing measures to the notion of fractional covering numbers, which we show to be dominated in terms of sequential scale-sensitive dimensions in a horizon-independent way, and, under additional complexity assumptions establish a tight control on worst-case sequential Rademacher complexity in terms of the integral of sequential scale-sensitive dimension. Finally, we establish a tight contraction inequality for worst-case sequential Rademacher complexity. The above constitutes the resolution of a number of outstanding open problems in extending the classical theory of empirical processes to the sequential case, and, in turn, establishes sharp results for online learning. △ Less

Submitted 2 February, 2021; originally announced February 2021.

arXiv:2101.09054 [pdf, other]

Adversarial Laws of Large Numbers and Optimal Regret in Online Classification

Authors: Noga Alon, Omri Ben-Eliezer, Yuval Dagan, Shay Moran, Moni Naor, Eylon Yogev

Abstract: Laws of large numbers guarantee that given a large enough sample from some population, the measure of any fixed sub-population is well-estimated by its frequency in the sample. We study laws of large numbers in sampling processes that can affect the environment they are acting upon and interact with it. Specifically, we consider the sequential sampling model proposed by Ben-Eliezer and Yogev (2020… ▽ More Laws of large numbers guarantee that given a large enough sample from some population, the measure of any fixed sub-population is well-estimated by its frequency in the sample. We study laws of large numbers in sampling processes that can affect the environment they are acting upon and interact with it. Specifically, we consider the sequential sampling model proposed by Ben-Eliezer and Yogev (2020), and characterize the classes which admit a uniform law of large numbers in this model: these are exactly the classes that are \emph{online learnable}. Our characterization may be interpreted as an online analogue to the equivalence between learnability and uniform convergence in statistical (PAC) learning. The sample-complexity bounds we obtain are tight for many parameter regimes, and as an application, we determine the optimal regret bounds in online learning, stated in terms of \emph{Littlestone's dimension}, thus resolving the main open question from Ben-David, Pál, and Shalev-Shwartz (2009), which was also posed by Rakhlin, Sridharan, and Tewari (2015). △ Less

Submitted 22 January, 2021; originally announced January 2021.

arXiv:2012.03817 [pdf, other]

A bounded-noise mechanism for differential privacy

Authors: Yuval Dagan, Gil Kur

Abstract: We present an asymptotically optimal $(ε,δ)$ differentially private mechanism for answering multiple, adaptively asked, $Δ$-sensitive queries, settling the conjecture of Steinke and Ullman [2020]. Our algorithm has a significant advantage that it adds independent bounded noise to each query, thus providing an absolute error bound. Additionally, we apply our algorithm in adaptive data analysis, obt… ▽ More We present an asymptotically optimal $(ε,δ)$ differentially private mechanism for answering multiple, adaptively asked, $Δ$-sensitive queries, settling the conjecture of Steinke and Ullman [2020]. Our algorithm has a significant advantage that it adds independent bounded noise to each query, thus providing an absolute error bound. Additionally, we apply our algorithm in adaptive data analysis, obtaining an improved guarantee for answering multiple queries regarding some underlying distribution using a finite sample. Numerical computations show that the bounded-noise mechanism outperforms the Gaussian mechanism in many standard settings. △ Less

Submitted 6 November, 2021; v1 submitted 7 December, 2020; originally announced December 2020.

arXiv:2004.09370 [pdf, ps, other]

Learning Ising models from one or multiple samples

Authors: Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, Anthimos Vardis Kandiros

Abstract: There have been two separate lines of work on estimating Ising models: (1) estimating them from multiple independent samples under minimal assumptions about the model's interaction matrix; and (2) estimating them from one sample in restrictive settings. We propose a unified framework that smoothly interpolates between these two settings, enabling significantly richer estimation guarantees from one… ▽ More There have been two separate lines of work on estimating Ising models: (1) estimating them from multiple independent samples under minimal assumptions about the model's interaction matrix; and (2) estimating them from one sample in restrictive settings. We propose a unified framework that smoothly interpolates between these two settings, enabling significantly richer estimation guarantees from one, a few, or many samples. Our main theorem provides guarantees for one-sample estimation, quantifying the estimation error in terms of the metric entropy of a family of interaction matrices. As corollaries of our main theorem, we derive bounds when the model's interaction matrix is a (sparse) linear combination of known matrices, or it belongs to a finite set, or to a high-dimensional manifold. In fact, our main result handles multiple independent samples by viewing them as one sample from a larger model, and can be used to derive estimation bounds that are qualitatively similar to those obtained in the afore-described multiple-sample literature. Our technical approach benefits from sparsifying a model's interaction network, conditioning on subsets of variables that make the dependencies in the resulting conditional distribution sufficiently weak. We use this sparsification technique to prove strong concentration and anti-concentration results for the Ising model, which we believe have applications beyond the scope of this paper. △ Less

Submitted 10 December, 2020; v1 submitted 20 April, 2020; originally announced April 2020.

arXiv:1911.10541 [pdf, ps, other]

PAC learning with stable and private predictions

Authors: Yuval Dagan, Vitaly Feldman

Abstract: We study binary classification algorithms for which the prediction on any point is not too sensitive to individual examples in the dataset. Specifically, we consider the notions of uniform stability (Bousquet and Elisseeff, 2001) and prediction privacy (Dwork and Feldman, 2018). Previous work on these notions shows how they can be achieved in the standard PAC model via simple aggregation of models… ▽ More We study binary classification algorithms for which the prediction on any point is not too sensitive to individual examples in the dataset. Specifically, we consider the notions of uniform stability (Bousquet and Elisseeff, 2001) and prediction privacy (Dwork and Feldman, 2018). Previous work on these notions shows how they can be achieved in the standard PAC model via simple aggregation of models trained on disjoint subsets of data. Unfortunately, this approach leads to a significant overhead in terms of sample complexity. Here we demonstrate several general approaches to stable and private prediction that either eliminate or significantly reduce the overhead. Specifically, we demonstrate that for any class $C$ of VC dimension $d$ there exists a $γ$-uniformly stable algorithm for learning $C$ with excess error $α$ using $\tilde O(d/(αγ) + d/α^2)$ samples. We also show that this bound is nearly tight. For $ε$-differentially private prediction we give two new algorithms: one using $\tilde O(d/(α^2ε))$ samples and another one using $\tilde O(d^2/(αε) + d/α^2)$ samples. The best previously known bounds for these problems are $O(d/(α^2γ))$ and $O(d/(α^3ε))$, respectively. △ Less

Submitted 23 September, 2020; v1 submitted 24 November, 2019; originally announced November 2019.

arXiv:1911.04014 [pdf, ps, other]

Interaction is necessary for distributed learning with privacy or communication constraints

Authors: Yuval Dagan, Vitaly Feldman

Abstract: Local differential privacy (LDP) is a model where users send privatized data to an untrusted central server whose goal it to solve some data analysis task. In the non-interactive version of this model the protocol consists of a single round in which a server sends requests to all users then receives their responses. This version is deployed in industry due to its practical advantages and has attra… ▽ More Local differential privacy (LDP) is a model where users send privatized data to an untrusted central server whose goal it to solve some data analysis task. In the non-interactive version of this model the protocol consists of a single round in which a server sends requests to all users then receives their responses. This version is deployed in industry due to its practical advantages and has attracted significant research interest. Our main result is an exponential lower bound on the number of samples necessary to solve the standard task of learning a large-margin linear separator in the non-interactive LDP model. Via a standard reduction this lower bound implies an exponential lower bound for stochastic convex optimization and specifically, for learning linear models with a convex, Lipschitz and smooth loss. These results answer the questions posed in \citep{SmithTU17,DanielyF18}. Our lower bound relies on a new technique for constructing pairs of distributions with nearly matching moments but whose supports can be nearly separated by a large margin hyperplane. These lower bounds also hold in the model where communication from each user is limited and follow from a lower bound on learning using non-adaptive \emph{statistical queries}. △ Less

Submitted 23 September, 2020; v1 submitted 10 November, 2019; originally announced November 2019.

arXiv:1906.09247 [pdf, ps, other]

Learning from weakly dependent data under Dobrushin's condition

Authors: Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, Siddhartha Jayanti

Abstract: Statistical learning theory has largely focused on learning and generalization given independent and identically distributed (i.i.d.) samples. Motivated by applications involving time-series data, there has been a growing literature on learning and generalization in settings where data is sampled from an ergodic process. This work has also developed complexity measures, which appropriately extend… ▽ More Statistical learning theory has largely focused on learning and generalization given independent and identically distributed (i.i.d.) samples. Motivated by applications involving time-series data, there has been a growing literature on learning and generalization in settings where data is sampled from an ergodic process. This work has also developed complexity measures, which appropriately extend the notion of Rademacher complexity to bound the generalization error and learning rates of hypothesis classes in this setting. Rather than time-series data, our work is motivated by settings where data is sampled on a network or a spatial domain, and thus do not fit well within the framework of prior work. We provide learning and generalization bounds for data that are complexly dependent, yet their distribution satisfies the standard Dobrushin's condition. Indeed, we show that the standard complexity measures of Gaussian and Rademacher complexities and VC dimension are sufficient measures of complexity for the purposes of bounding the generalization error and learning rates of hypothesis classes in our setting. Moreover, our generalization bounds only degrade by constant factors compared to their i.i.d. analogs, and our learnability bounds degrade by log factors in the size of the training set. △ Less

Submitted 21 June, 2019; originally announced June 2019.

arXiv:1903.05315 [pdf, ps, other]

Optimality of Maximum Likelihood for Log-Concave Density Estimation and Bounded Convex Regression

Authors: Gil Kur, Yuval Dagan, Alexander Rakhlin

Abstract: In this paper, we study two problems: (1) estimation of a $d$-dimensional log-concave distribution and (2) bounded multivariate convex regression with random design with an underlying log-concave density or a compactly supported distribution with a continuous density. First, we show that for all $d \ge 4$ the maximum likelihood estimators of both problems achieve an optimal risk of… ▽ More In this paper, we study two problems: (1) estimation of a $d$-dimensional log-concave distribution and (2) bounded multivariate convex regression with random design with an underlying log-concave density or a compactly supported distribution with a continuous density. First, we show that for all $d \ge 4$ the maximum likelihood estimators of both problems achieve an optimal risk of $Θ_d(n^{-2/(d+1)})$ (up to a logarithmic factor) in terms of squared Hellinger distance and $L_2$ squared distance, respectively. Previously, the optimality of both these estimators was known only for $d\le 3$. We also prove that the $ε$-entropy numbers of the two aforementioned families are equal up to logarithmic factors. We complement these results by proving a sharp bound $Θ_d(n^{-2/(d+4)})$ on the minimax rate (up to logarithmic factors) with respect to the total variation distance. Finally, we prove that estimating a log-concave density - even a uniform distribution on a convex set - up to a fixed accuracy requires the number of samples \emph{at least} exponential in the dimension. We do that by improving the dimensional constant in the best known lower bound for the minimax rate from $2^{-d}\cdot n^{-2/(d+1)}$ to $c\cdot n^{-2/(d+1)}$ (when $d\geq 2$). △ Less

Submitted 20 February, 2020; v1 submitted 13 March, 2019; originally announced March 2019.

MSC Class: 62G07 62G08

arXiv:1902.03498 [pdf, ps, other]

Space lower bounds for linear prediction in the streaming model

Authors: Yuval Dagan, Gil Kur, Ohad Shamir

Abstract: We show that fundamental learning tasks, such as finding an approximate linear separator or linear regression, require memory at least \emph{quadratic} in the dimension, in a natural streaming setting. This implies that such problems cannot be solved (at least in this setting) by scalable memory-efficient streaming algorithms. Our results build on a memory lower bound for a simple linear-algebraic… ▽ More We show that fundamental learning tasks, such as finding an approximate linear separator or linear regression, require memory at least \emph{quadratic} in the dimension, in a natural streaming setting. This implies that such problems cannot be solved (at least in this setting) by scalable memory-efficient streaming algorithms. Our results build on a memory lower bound for a simple linear-algebraic problem -- finding orthogonal vectors -- and utilize the estimates on the packing of the Grassmannian, the manifold of all linear subspaces of fixed dimension. △ Less

Submitted 11 June, 2019; v1 submitted 9 February, 2019; originally announced February 2019.

Comments: Added a minor correction in referencing the prior work

arXiv:1811.02177 [pdf, ps, other]

The entropy of lies: playing twenty questions with a liar

Authors: Yuval Dagan, Yuval Filmus, Daniel Kane, Shay Moran

Abstract: `Twenty questions' is a guessing game played by two players: Bob thinks of an integer between $1$ and $n$, and Alice's goal is to recover it using a minimal number of Yes/No questions. Shannon's entropy has a natural interpretation in this context. It characterizes the average number of questions used by an optimal strategy in the distributional variant of the game: let $μ$ be a distribution over… ▽ More `Twenty questions' is a guessing game played by two players: Bob thinks of an integer between $1$ and $n$, and Alice's goal is to recover it using a minimal number of Yes/No questions. Shannon's entropy has a natural interpretation in this context. It characterizes the average number of questions used by an optimal strategy in the distributional variant of the game: let $μ$ be a distribution over $[n]$, then the average number of questions used by an optimal strategy that recovers $x\sim μ$ is between $H(μ)$ and $H(μ)+1$. We consider an extension of this game where at most $k$ questions can be answered falsely. We extend the classical result by showing that an optimal strategy uses roughly $H(μ) + k H_2(μ)$ questions, where $H_2(μ) = \sum_x μ(x)\log\log\frac{1}{μ(x)}$. This also generalizes a result by Rivest et al. for the uniform distribution. Moreover, we design near optimal strategies that only use comparison queries of the form `$x \leq c$?' for $c\in[n]$. The usage of comparison queries lends itself naturally to the context of sorting, where we derive sorting algorithms in the presence of adversarial noise. △ Less

Submitted 6 November, 2018; originally announced November 2018.

arXiv:1803.10415 [pdf, other]

A Better Resource Allocation Algorithm with Semi-Bandit Feedback

Authors: Yuval Dagan, Koby Crammer

Abstract: We study a sequential resource allocation problem between a fixed number of arms. On each iteration the algorithm distributes a resource among the arms in order to maximize the expected success rate. Allocating more of the resource to a given arm increases the probability that it succeeds, yet with a cut-off. We follow Lattimore et al. (2014) and assume that the probability increases linearly unti… ▽ More We study a sequential resource allocation problem between a fixed number of arms. On each iteration the algorithm distributes a resource among the arms in order to maximize the expected success rate. Allocating more of the resource to a given arm increases the probability that it succeeds, yet with a cut-off. We follow Lattimore et al. (2014) and assume that the probability increases linearly until it equals one, after which allocating more of the resource is wasteful. These cut-off values are fixed and unknown to the learner. We present an algorithm for this problem and prove a regret upper bound of $O(\log n)$ improving over the best known bound of $O(\log^2 n)$. Lower bounds we prove show that our upper bound is tight. Simulations demonstrate the superiority of our algorithm. △ Less

Submitted 28 March, 2018; originally announced March 2018.

arXiv:1803.01420 [pdf, ps, other]

Detecting Correlations with Little Memory and Communication

Authors: Yuval Dagan, Ohad Shamir

Abstract: We study the problem of identifying correlations in multivariate data, under information constraints: Either on the amount of memory that can be used by the algorithm, or the amount of communication when the data is distributed across several machines. We prove a tight trade-off between the memory/communication complexity and the sample complexity, implying (for example) that to detect pairwise co… ▽ More We study the problem of identifying correlations in multivariate data, under information constraints: Either on the amount of memory that can be used by the algorithm, or the amount of communication when the data is distributed across several machines. We prove a tight trade-off between the memory/communication complexity and the sample complexity, implying (for example) that to detect pairwise correlations with optimal sample complexity, the number of required memory/communication bits is at least quadratic in the dimension. Our results substantially improve those of Shamir [2014], which studied a similar question in a much more restricted setting. To the best of our knowledge, these are the first provable sample/memory/communication trade-offs for a practical estimation problem, using standard distributions, and in the natural regime where the memory/communication budget is larger than the size of a single data point. To derive our theorems, we prove a new information-theoretic result, which may be relevant for studying other information-constrained learning problems. △ Less

Submitted 6 June, 2018; v1 submitted 4 March, 2018; originally announced March 2018.

Comments: Accepted for presentation at Conference on Learning Theory (COLT) 2018. Changes: Added a comparison to Raz [2016]; Corrected typos; Added references

arXiv:1611.06650 [pdf, ps, other]

Trading information complexity for error

Authors: Yuval Dagan, Yuval Filmus, Hamed Hatami, Yaqiao Li

Abstract: We consider the standard two-party communication model. The central problem studied in this article is how much one can save in information complexity by allowing an error of $ε$. For arbitrary functions, we obtain lower bounds and upper bounds indicating a gain that is of order $Ω(h(ε))$ and $O(h(\sqrtε))$. Here $h$ denotes the binary entropy function. We analyze the case of the two-bit AND fun… ▽ More We consider the standard two-party communication model. The central problem studied in this article is how much one can save in information complexity by allowing an error of $ε$. For arbitrary functions, we obtain lower bounds and upper bounds indicating a gain that is of order $Ω(h(ε))$ and $O(h(\sqrtε))$. Here $h$ denotes the binary entropy function. We analyze the case of the two-bit AND function in detail to show that for this function the gain is $Θ(h(ε))$. This answers a question of [M. Braverman, A. Garg, D. Pankratov, and O. Weinstein, From information to exact communication (extended abstract), STOC'13]. We obtain sharp bounds for the set disjointness function of order $n$. For the case of the distributional error, we introduce a new protocol that achieves a gain of $Θ(\sqrt{h(ε)})$ provided that $n$ is sufficiently large. We apply these results to answer another of question of Braverman et al. regarding the randomized communication complexity of the set disjointness function. Answering a question of [Mark Braverman, Interactive information complexity, STOC'12], we apply our analysis of the set disjointness function to establish a gap between the two different notions of the prior-free information cost. This implies that amortized randomized communication complexity is not necessarily equal to the amortized distributional communication complexity with respect to the hardest distribution. △ Less

Submitted 21 November, 2016; originally announced November 2016.

arXiv:1611.01655 [pdf, ps, other]

Twenty (simple) questions

Authors: Yuval Dagan, Yuval Filmus, Ariel Gabizon, Shay Moran

Abstract: A basic combinatorial interpretation of Shannon's entropy function is via the "20 questions" game. This cooperative game is played by two players, Alice and Bob: Alice picks a distribution $π$ over the numbers $\{1,\ldots,n\}$, and announces it to Bob. She then chooses a number $x$ according to $π$, and Bob attempts to identify $x$ using as few Yes/No queries as possible, on average. An optimal… ▽ More A basic combinatorial interpretation of Shannon's entropy function is via the "20 questions" game. This cooperative game is played by two players, Alice and Bob: Alice picks a distribution $π$ over the numbers $\{1,\ldots,n\}$, and announces it to Bob. She then chooses a number $x$ according to $π$, and Bob attempts to identify $x$ using as few Yes/No queries as possible, on average. An optimal strategy for the "20 questions" game is given by a Huffman code for $π$: Bob's questions reveal the codeword for $x$ bit by bit. This strategy finds $x$ using fewer than $H(π)+1$ questions on average. However, the questions asked by Bob could be arbitrary. In this paper, we investigate the following question: Are there restricted sets of questions that match the performance of Huffman codes, either exactly or approximately? Our first main result shows that for every distribution $π$, Bob has a strategy that uses only questions of the form "$x < c$?" and "$x = c$?", and uncovers $x$ using at most $H(π)+1$ questions on average, matching the performance of Huffman codes in this sense. We also give a natural set of $O(rn^{1/r})$ questions that achieve a performance of at most $H(π)+r$, and show that $Ω(rn^{1/r})$ questions are required to achieve such a guarantee. Our second main result gives a set $\mathcal{Q}$ of $1.25^{n+o(n)}$ questions such that for every distribution $π$, Bob can implement an optimal strategy for $π$ using only questions from $\mathcal{Q}$. We also show that $1.25^{n-o(n)}$ questions are needed, for infinitely many $n$. If we allow a small slack of $r$ over the optimal strategy, then roughly $(rn)^{Θ(1/r)}$ questions are necessary and sufficient. △ Less

Submitted 25 April, 2017; v1 submitted 5 November, 2016; originally announced November 2016.

Comments: 33 pages; to appear in STOC 2017

Showing 1–25 of 25 results for author: Dagan, Y