Search | arXiv e-print repository

Exponential Stochastic Inequality

Authors: Peter D. Grünwald, Muriel F. Pérez-Ortiz, Zakaria Mhammedi

Abstract: We develop the concept of exponential stochastic inequality (ESI), a novel notation that simultaneously captures high-probability and in-expectation statements. It is especially well suited to succinctly state, prove, and reason about excess-risk and generalization bounds in statistical learning, specifically, but not restricted to, the PAC-Bayesian type. We show that the ESI satisfies transitivit… ▽ More We develop the concept of exponential stochastic inequality (ESI), a novel notation that simultaneously captures high-probability and in-expectation statements. It is especially well suited to succinctly state, prove, and reason about excess-risk and generalization bounds in statistical learning, specifically, but not restricted to, the PAC-Bayesian type. We show that the ESI satisfies transitivity and other properties which allow us to use it like standard, nonstochastic inequalities. We substantially extend the original definition from Koolen et al. (2016) and show that general ESIs satisfy a host of useful additional properties, including a novel Markov-like inequality. We show how ESIs relate to, and clarify, PAC-Bayesian bounds, subcentered subgamma random variables and *fast-rate conditions* such as the central and Bernstein conditions. We also show how the ideas can be extended to random scaling factors (learning rates). △ Less

Submitted 27 April, 2023; originally announced April 2023.

arXiv:2302.11401 [pdf, other]

Safe Sequential Testing and Effect Estimation in Stratified Count Data

Authors: Rosanne J. Turner, Peter D. Grünwald

Abstract: Sequential decision making significantly speeds up research and is more cost-effective compared to fixed-n methods. We present a method for sequential decision making for stratified count data that retains Type-I error guarantee or false discovery rate under optional stop**, using e-variables. We invert the method to construct stratified anytime-valid confidence sequences, where cross-talk betwe… ▽ More Sequential decision making significantly speeds up research and is more cost-effective compared to fixed-n methods. We present a method for sequential decision making for stratified count data that retains Type-I error guarantee or false discovery rate under optional stop**, using e-variables. We invert the method to construct stratified anytime-valid confidence sequences, where cross-talk between subpopulations in the data can be allowed during data collection to improve power. Finally, we combine information collected in separate subpopulations through pseudo-Bayesian averaging and switching to create effective estimates for the minimal, mean and maximal treatment effects in the subpopulations. △ Less

Submitted 22 February, 2023; originally announced February 2023.

Comments: Preprint, to be published in the Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS) 2023, Valencia, Spain. PMLR: Volume 206

arXiv:2202.04513 [pdf, ps, other]

doi 10.1007/s11229-021-03233-1

The no-free-lunch theorems of supervised learning

Authors: Tom F. Sterkenburg, Peter D. Grünwald

Abstract: The no-free-lunch theorems promote a skeptical conclusion that all possible machine learning algorithms equally lack justification. But how could this leave room for a learning theory, that shows that some algorithms are better than others? Drawing parallels to the philosophy of induction, we point out that the no-free-lunch results presuppose a conception of learning algorithms as purely data-dri… ▽ More The no-free-lunch theorems promote a skeptical conclusion that all possible machine learning algorithms equally lack justification. But how could this leave room for a learning theory, that shows that some algorithms are better than others? Drawing parallels to the philosophy of induction, we point out that the no-free-lunch results presuppose a conception of learning algorithms as purely data-driven. On this conception, every algorithm must have an inherent inductive bias, that wants justification. We argue that many standard learning algorithms should rather be understood as model-dependent: in each application they also require for input a model, representing a bias. Generic algorithms themselves, they can be given a model-relative justification. △ Less

Submitted 9 February, 2022; originally announced February 2022.

Journal ref: Synthese 199:9979-10015 (2021)

arXiv:1905.13494 [pdf, other]

doi 10.12688/f1000research.19375.1

Accumulation Bias in Meta-Analysis: The Need to Consider Time in Error Control

Authors: Judith ter Schure, Peter D. Grünwald

Abstract: Studies accumulate over time and meta-analyses are mainly retrospective. These two characteristics introduce dependencies between the analysis time, at which a series of studies is up for meta-analysis, and results within the series. Dependencies introduce bias --- Accumulation Bias --- and invalidate the sampling distribution assumed for p-value tests, thus inflating type-I errors. But dependenci… ▽ More Studies accumulate over time and meta-analyses are mainly retrospective. These two characteristics introduce dependencies between the analysis time, at which a series of studies is up for meta-analysis, and results within the series. Dependencies introduce bias --- Accumulation Bias --- and invalidate the sampling distribution assumed for p-value tests, thus inflating type-I errors. But dependencies are also inevitable, since for science to accumulate efficiently, new research needs to be informed by past results. Here, we investigate various ways in which time influences error control in meta-analysis testing. We introduce an Accumulation Bias Framework that allows us to model a wide variety of practically occurring dependencies, including study series accumulation, meta-analysis timing, and approaches to multiple testing in living systematic reviews. The strength of this framework is that it shows how all dependencies affect p-value-based tests in a similar manner. This leads to two main conclusions. First, Accumulation Bias is inevitable, and even if it can be approximated and accounted for, no valid p-value tests can be constructed. Second, tests based on likelihood ratios withstand Accumulation Bias: they provide bounds on error probabilities that remain valid despite the bias. We leave the reader with a choice between two proposals to consider time in error control: either treat individual (primary) studies and meta-analyses as two separate worlds --- each with their own timing --- or integrate individual studies in the meta-analysis world. Taking up likelihood ratios in either approach allows for valid tests that relate well to the accumulating nature of scientific knowledge. Likelihood ratios can be interpreted as betting profits, earned in previous studies and invested in new ones, while the meta-analyst is allowed to cash out at any time and advise against future studies. △ Less

Submitted 31 May, 2019; originally announced May 2019.

Comments: Soon to be published at F1000 Research

arXiv:1905.13367 [pdf, ps, other]

PAC-Bayes Un-Expected Bernstein Inequality

Authors: Zakaria Mhammedi, Peter D. Grunwald, Benjamin Guedj

Abstract: We present a new PAC-Bayesian generalization bound. Standard bounds contain a $\sqrt{L_n \cdot \KL/n}$ complexity term which dominates unless $L_n$, the empirical error of the learning algorithm's randomized predictions, vanishes. We manage to replace $L_n$ by a term which vanishes in many more situations, essentially whenever the employed learning algorithm is sufficiently stable on the dataset a… ▽ More We present a new PAC-Bayesian generalization bound. Standard bounds contain a $\sqrt{L_n \cdot \KL/n}$ complexity term which dominates unless $L_n$, the empirical error of the learning algorithm's randomized predictions, vanishes. We manage to replace $L_n$ by a term which vanishes in many more situations, essentially whenever the employed learning algorithm is sufficiently stable on the dataset at hand. Our new bound consistently beats state-of-the-art bounds both on a toy example and on UCI datasets (with large enough $n$). Theoretically, unlike existing bounds, our new bound can be expected to converge to $0$ faster whenever a Bernstein/Tsybakov condition holds, thus connecting PAC-Bayesian generalization and {\em excess risk\/} bounds---for the latter it has long been known that faster convergence can be obtained under Bernstein conditions. Our main technical tool is a new concentration inequality which is like Bernstein's but with $X^2$ taken outside its expectation. △ Less

Submitted 3 November, 2019; v1 submitted 30 May, 2019; originally announced May 2019.

Comments: 24 pages, 6 figures. To Appear in NeurIPS2019

Journal ref: NeurIPS 2019

arXiv:1710.07732 [pdf, other]

A Tight Excess Risk Bound via a Unified PAC-Bayesian-Rademacher-Shtarkov-MDL Complexity

Authors: Peter D. Grünwald, Nishant A. Mehta

Abstract: We present a novel notion of complexity that interpolates between and generalizes some classic existing complexity notions in learning theory: for estimators like empirical risk minimization (ERM) with arbitrary bounded losses, it is upper bounded in terms of data-independent Rademacher complexity; for generalized Bayesian estimators, it is upper bounded by the data-dependent information complexit… ▽ More We present a novel notion of complexity that interpolates between and generalizes some classic existing complexity notions in learning theory: for estimators like empirical risk minimization (ERM) with arbitrary bounded losses, it is upper bounded in terms of data-independent Rademacher complexity; for generalized Bayesian estimators, it is upper bounded by the data-dependent information complexity (also known as stochastic or PAC-Bayesian, $\mathrm{KL}(\text{posterior} \operatorname{\|} \text{prior})$ complexity. For (penalized) ERM, the new complexity reduces to (generalized) normalized maximum likelihood (NML) complexity, i.e. a minimax log-loss individual-sequence regret. Our first main result bounds excess risk in terms of the new complexity. Our second main result links the new complexity via Rademacher complexity to $L_2(P)$ entropy, thereby generalizing earlier results of Opper, Haussler, Lugosi, and Cesa-Bianchi who did the log-loss case with $L_\infty$. Together, these results recover optimal bounds for VC- and large (polynomial entropy) classes, replacing localized Rademacher complexity by a simpler analysis which almost completely separates the two aspects that determine the achievable rates: 'easiness' (Bernstein) conditions and model complexity. △ Less

Submitted 20 October, 2017; originally announced October 2017.

Comments: 38 pages

arXiv:1708.08278 [pdf, other]

doi 10.3758/s13423-020-01803-x

Why optional stop** can be a problem for Bayesians

Authors: Rianne de Heide, Peter D. Grünwald

Abstract: Recently, optional stop** has been a subject of debate in the Bayesian psychology community. Rouder (2014) argues that optional stop** is no problem for Bayesians, and even recommends the use of optional stop** in practice, as do Wagenmakers et al. (2012). This article addresses the question whether optional stop** is problematic for Bayesian methods, and specifies under which circumstance… ▽ More Recently, optional stop** has been a subject of debate in the Bayesian psychology community. Rouder (2014) argues that optional stop** is no problem for Bayesians, and even recommends the use of optional stop** in practice, as do Wagenmakers et al. (2012). This article addresses the question whether optional stop** is problematic for Bayesian methods, and specifies under which circumstances and in which sense it is and is not. By slightly varying and extending Rouder's (2014) experiments, we illustrate that, as soon as the parameters of interest are equipped with default or pragmatic priors - which means, in most practical applications of Bayes factor hypothesis testing - resilience to optional stop** can break down. We distinguish between three types of default priors, each having their own specific issues with optional stop**, ranging from no-problem-at-all (Type 0 priors) to quite severe (Type II priors). △ Less

Submitted 25 March, 2021; v1 submitted 28 August, 2017; originally announced August 2017.

Comments: Replacement of Figures 7a-7d in the appendix. There was a mistake in the sampling plan. Thanks to Jorge Tendeiro for pointing this out. Replaced the main text with the final (published) version. Psychonomic Bulletin & Review 2020 Advance Publication

arXiv:1605.00252 [pdf, other]

Fast Rates for General Unbounded Loss Functions: from ERM to Generalized Bayes

Authors: Peter D. Grünwald, Nishant A. Mehta

Abstract: We present new excess risk bounds for general unbounded loss functions including log loss and squared loss, where the distribution of the losses may be heavy-tailed. The bounds hold for general estimators, but they are optimized when applied to $η$-generalized Bayesian, MDL, and empirical risk minimization estimators. In the case of log loss, the bounds imply convergence rates for generalized Baye… ▽ More We present new excess risk bounds for general unbounded loss functions including log loss and squared loss, where the distribution of the losses may be heavy-tailed. The bounds hold for general estimators, but they are optimized when applied to $η$-generalized Bayesian, MDL, and empirical risk minimization estimators. In the case of log loss, the bounds imply convergence rates for generalized Bayesian inference under misspecification in terms of a generalization of the Hellinger metric as long as the learning rate $η$ is set correctly. For general loss functions, our bounds rely on two separate conditions: the $v$-GRIP (generalized reversed information projection) conditions, which control the lower tail of the excess loss; and the newly introduced witness condition, which controls the upper tail. The parameter $v$ in the $v$-GRIP conditions determines the achievable rate and is akin to the exponent in the Tsybakov margin condition and the Bernstein condition for bounded losses, which the $v$-GRIP conditions generalize; favorable $v$ in combination with small model complexity leads to $\tilde{O}(1/n)$ rates. The witness condition allows us to connect the excess risk to an "annealed" version thereof, by which we generalize several previous results connecting Hellinger and Rényi divergence to KL divergence. △ Less

Submitted 5 November, 2019; v1 submitted 1 May, 2016; originally announced May 2016.

Comments: accepted to JMLR pending minor final modifications

arXiv:1512.03223 [pdf, other]

doi 10.1016/j.ijar.2016.03.001

Robust Probability Updating

Authors: Thijs van Ommen, Wouter M. Koolen, Thijs E. Feenstra, Peter D. Grünwald

Abstract: This paper discusses an alternative to conditioning that may be used when the probability distribution is not fully specified. It does not require any assumptions (such as CAR: coarsening at random) on the unknown distribution. The well-known Monty Hall problem is the simplest scenario where neither naive conditioning nor the CAR assumption suffice to determine an updated probability distribution.… ▽ More This paper discusses an alternative to conditioning that may be used when the probability distribution is not fully specified. It does not require any assumptions (such as CAR: coarsening at random) on the unknown distribution. The well-known Monty Hall problem is the simplest scenario where neither naive conditioning nor the CAR assumption suffice to determine an updated probability distribution. This paper thus addresses a generalization of that problem to arbitrary distributions on finite outcome spaces, arbitrary sets of `messages', and (almost) arbitrary loss functions, and provides existence and characterization theorems for robust probability updating strategies. We find that for logarithmic loss, optimality is characterized by an elegant condition, which we call RCAR (reverse coarsening at random). Under certain conditions, the same condition also characterizes optimality for a much larger class of loss functions, and we obtain an objective and general answer to how one should update probabilities in the light of new information. △ Less

Submitted 2 May, 2016; v1 submitted 10 December, 2015; originally announced December 2015.

Comments: 47 pages, 4 figures. This second version is the accepted manuscript: it incorporates reviewer comments and has a new title

Journal ref: International Journal of Approximate Reasoning 74 (2016) 30-57

arXiv:1507.02592 [pdf, other]

Fast rates in statistical and online learning

Authors: Tim van Erven, Peter D. Grünwald, Nishant A. Mehta, Mark D. Reid, Robert C. Williamson

Abstract: The speed with which a learning algorithm converges as it is presented with more data is a central problem in machine learning --- a fast rate of convergence means less data is needed for the same level of performance. The pursuit of fast rates in online and statistical learning has led to the discovery of many conditions in learning theory under which fast learning is possible. We show that most… ▽ More The speed with which a learning algorithm converges as it is presented with more data is a central problem in machine learning --- a fast rate of convergence means less data is needed for the same level of performance. The pursuit of fast rates in online and statistical learning has led to the discovery of many conditions in learning theory under which fast learning is possible. We show that most of these conditions are special cases of a single, unifying condition, that comes in two forms: the central condition for 'proper' learning algorithms that always output a hypothesis in the given model, and stochastic mixability for online algorithms that may make predictions outside of the model. We show that under surprisingly weak assumptions both conditions are, in a certain sense, equivalent. The central condition has a re-interpretation in terms of convexity of a set of pseudoprobabilities, linking it to density estimation under misspecification. For bounded losses, we show how the central condition enables a direct proof of fast rates and we prove its equivalence to the Bernstein condition, itself a generalization of the Tsybakov margin condition, both of which have played a central role in obtaining fast rates in statistical learning. Yet, while the Bernstein condition is two-sided, the central condition is one-sided, making it more suitable to deal with unbounded losses. In its stochastic mixability form, our condition generalizes both a stochastic exp-concavity condition identified by Juditsky, Rigollet and Tsybakov and Vovk's notion of mixability. Our unifying conditions thus provide a substantial step towards a characterization of fast rates in statistical learning, similar to how classical mixability characterizes constant regret in the sequential prediction with expert advice setting. △ Less

Submitted 1 September, 2015; v1 submitted 9 July, 2015; originally announced July 2015.

Comments: 69 pages, 3 figures

Journal ref: Journal of Machine Learning Research 6(54):1793-1861, 2015

arXiv:1407.7190 [pdf]

A Game-Theoretic Analysis of Updating Sets of Probabilities

Authors: Peter D. Grunwald, Joseph Y. Halpern

Abstract: We consider how an agent should update her uncertainty when it is represented by a set P of probability distributions and the agent observes that a random variable X takes on value x, given that the agent makes decisions using the minimax criterion, perhaps the best-studied and most commonly-used criterion in the literature. We adopt a game-theoretic framework, where the agent plays against a book… ▽ More We consider how an agent should update her uncertainty when it is represented by a set P of probability distributions and the agent observes that a random variable X takes on value x, given that the agent makes decisions using the minimax criterion, perhaps the best-studied and most commonly-used criterion in the literature. We adopt a game-theoretic framework, where the agent plays against a bookie, who chooses some distribution from P. We consider two reasonable games that differ in what the bookie knows when he makes his choice. Anomalies that have been observed before, like time inconsistency, can be understood as arising because different games are being played, against bookies with different information. We characterize the important special cases in which the optimal decision rules according to the minimax criterion amount to either conditioning or simply ignoring the information. Finally, we consider the relationship between conditioning and calibration when uncertainty is described by sets of probabilities. △ Less

Submitted 27 July, 2014; originally announced July 2014.

Comments: Appears in Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence (UAI2008)

Report number: UAI-P-2008-PG-240-247

arXiv:1407.7188 [pdf]

When Ignorance is Bliss

Authors: Peter D. Grunwald, Joseph Y. Halpern

Abstract: It is commonly-accepted wisdom that more information is better, and that information should never be ignored. Here we argue, using both a Bayesian and a non-Bayesian analysis, that in some situations you are better off ignoring information if your uncertainty is represented by a set of probability measures. These include situations in which the information is relevant for the prediction task at ha… ▽ More It is commonly-accepted wisdom that more information is better, and that information should never be ignored. Here we argue, using both a Bayesian and a non-Bayesian analysis, that in some situations you are better off ignoring information if your uncertainty is represented by a set of probability measures. These include situations in which the information is relevant for the prediction task at hand. In the non-Bayesian analysis, we show how ignoring information avoids dilation, the phenomenon that additional pieces of information sometimes lead to an increase in uncertainty. In the Bayesian analysis, we show that for small sample sizes and certain prediction tasks, the Bayesian posterior based on a noninformative prior yields worse predictions than simply ignoring the given information. △ Less

Submitted 27 July, 2014; originally announced July 2014.

Comments: Appears in Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI2004)

Report number: UAI-P-2004-PG-226-234

arXiv:1407.7183 [pdf]

Updating Probabilities

Authors: Peter D. Grunwald, Joseph Y. Halpern

Abstract: As examples such as the Monty Hall puzzle show, applying conditioning to update a probability distribution on a ``naive space', which does not take into account the protocol used, can often lead to counterintuitive results. Here we examine why. A criterion known as CAR (coarsening at random) in the statistical literature characterizes when ``naive' conditioning in a naive space works. We show… ▽ More As examples such as the Monty Hall puzzle show, applying conditioning to update a probability distribution on a ``naive space', which does not take into account the protocol used, can often lead to counterintuitive results. Here we examine why. A criterion known as CAR (coarsening at random) in the statistical literature characterizes when ``naive' conditioning in a naive space works. We show that the CAR condition holds rather infrequently. We then consider more generalized notions of update such as Jeffrey conditioning and minimizing relative entropy (MRE). We give a generalization of the CAR condition that characterizes when Jeffrey conditioning leads to appropriate answers, but show that there are no such conditions for MRE. This generalizes and interconnects previous results obtained in the literature on CAR and MRE. △ Less

Submitted 27 July, 2014; originally announced July 2014.

Comments: Appears in Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI2002)

Report number: UAI-P-2002-PG-187-196

arXiv:1401.3906 [pdf]

doi 10.1613/jair.3374

Making Decisions Using Sets of Probabilities: Updating, Time Consistency, and Calibration

Authors: Peter D Grunwald, Joseph Y Halpern

Abstract: We consider how an agent should update her beliefs when her beliefs are represented by a set P of probability distributions, given that the agent makes decisions using the minimax criterion, perhaps the best-studied and most commonly-used criterion in the literature. We adopt a game-theoretic framework, where the agent plays against a bookie, who chooses some distribution from P. We consider two r… ▽ More We consider how an agent should update her beliefs when her beliefs are represented by a set P of probability distributions, given that the agent makes decisions using the minimax criterion, perhaps the best-studied and most commonly-used criterion in the literature. We adopt a game-theoretic framework, where the agent plays against a bookie, who chooses some distribution from P. We consider two reasonable games that differ in what the bookie knows when he makes his choice. Anomalies that have been observed before, like time inconsistency, can be understood as arising because different games are being played, against bookies with different information. We characterize the important special cases in which the optimal decision rules according to the minimax criterion amount to either conditioning or simply ignoring the information. Finally, we consider the relationship between updating and calibration when uncertainty is described by sets of probabilities. Our results emphasize the key role of the rectangularity condition of Epstein and Schneider. △ Less

Submitted 16 January, 2014; originally announced January 2014.

Journal ref: Journal Of Artificial Intelligence Research, Volume 42, pages 393-426, 2011

arXiv:1301.7378 [pdf]

Minimum Encoding Approaches for Predictive Modeling

Authors: Peter D Grunwald, Petri Kontkanen, Petri Myllymaki, Tomi Silander, Henry Tirri

Abstract: We analyze differences between two information-theoretically motivated approaches to statistical inference and model selection: the Minimum Description Length (MDL) principle, and the Minimum Message Length (MML) principle. Based on this analysis, we present two revised versions of MML: a pointwise estimator which gives the MML-optimal single parameter model, and a volumewise estimator which give… ▽ More We analyze differences between two information-theoretically motivated approaches to statistical inference and model selection: the Minimum Description Length (MDL) principle, and the Minimum Message Length (MML) principle. Based on this analysis, we present two revised versions of MML: a pointwise estimator which gives the MML-optimal single parameter model, and a volumewise estimator which gives the MML-optimal region in the parameter space. Our empirical results suggest that with small data sets, the MDL approach yields more accurate predictions than the MML estimators. The empirical results also demonstrate that the revised MML estimators introduced here perform better than the original MML estimator suggested by Wallace and Freeman. △ Less

Submitted 30 January, 2013; originally announced January 2013.

Comments: Appears in Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI1998)

Report number: UAI-P-1998-PG-183-192

arXiv:1301.3860 [pdf]

Maximum Entropy and the Glasses You Are Looking Through

Authors: Peter D. Grunwald

Abstract: We give an interpretation of the Maximum Entropy (MaxEnt) Principle in game-theoretic terms. Based on this interpretation, we make a formal distinction between different ways of {em applying/} Maximum Entropy distributions. MaxEnt has frequently been criticized on the grounds that it leads to highly representation dependent results. Our distinction allows us to avoid this problem in many cases. We give an interpretation of the Maximum Entropy (MaxEnt) Principle in game-theoretic terms. Based on this interpretation, we make a formal distinction between different ways of {em applying/} Maximum Entropy distributions. MaxEnt has frequently been criticized on the grounds that it leads to highly representation dependent results. Our distinction allows us to avoid this problem in many cases. △ Less

Submitted 16 January, 2013; originally announced January 2013.

Comments: Appears in Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI2000)

Report number: UAI-P-2000-PG-238-246

arXiv:1301.0534 [pdf, ps, other]

Follow the Leader If You Can, Hedge If You Must

Authors: Steven de Rooij, Tim van Erven, Peter D. Grünwald, Wouter M. Koolen

Abstract: Follow-the-Leader (FTL) is an intuitive sequential prediction strategy that guarantees constant regret in the stochastic setting, but has terrible performance for worst-case data. Other hedging strategies have better worst-case guarantees but may perform much worse than FTL if the data are not maximally adversarial. We introduce the FlipFlop algorithm, which is the first method that provably combi… ▽ More Follow-the-Leader (FTL) is an intuitive sequential prediction strategy that guarantees constant regret in the stochastic setting, but has terrible performance for worst-case data. Other hedging strategies have better worst-case guarantees but may perform much worse than FTL if the data are not maximally adversarial. We introduce the FlipFlop algorithm, which is the first method that provably combines the best of both worlds. As part of our construction, we develop AdaHedge, which is a new way of dynamically tuning the learning rate in Hedge without using the doubling trick. AdaHedge refines a method by Cesa-Bianchi, Mansour and Stoltz (2007), yielding slightly improved worst-case guarantees. By interleaving AdaHedge and FTL, the FlipFlop algorithm achieves regret within a constant factor of the FTL regret, without sacrificing AdaHedge's worst-case guarantees. AdaHedge and FlipFlop do not need to know the range of the losses in advance; moreover, unlike earlier methods, both have the intuitive property that the issued weights are invariant under rescaling and translation of the losses. The losses are also allowed to be negative, in which case they may be interpreted as gains. △ Less

Submitted 17 January, 2013; v1 submitted 3 January, 2013; originally announced January 2013.

Comments: under submission

Journal ref: Journal of Machine Learning Research 15(37):1281-1316, 2014

arXiv:1107.6004 [pdf]

doi 10.1109/TIT.2015.2458951

Explicit Bounds for Entropy Concentration under Linear Constraints

Authors: Kostas N. Oikonomou, Peter D. Grunwald

Abstract: Consider the set of all sequences of $n$ outcomes, each taking one of $m$ values, that satisfy a number of linear constraints. If $m$ is fixed while $n$ increases, most sequences that satisfy the constraints result in frequency vectors whose entropy approaches that of the maximum entropy vector satisfying the constraints. This well-known "entropy concentration" phenomenon underlies the maximum ent… ▽ More Consider the set of all sequences of $n$ outcomes, each taking one of $m$ values, that satisfy a number of linear constraints. If $m$ is fixed while $n$ increases, most sequences that satisfy the constraints result in frequency vectors whose entropy approaches that of the maximum entropy vector satisfying the constraints. This well-known "entropy concentration" phenomenon underlies the maximum entropy method. Existing proofs of the concentration phenomenon are based on limits or asymptotics and unrealistically assume that constraints hold precisely, supporting maximum entropy inference more in principle than in practice. We present, for the first time, non-asymptotic, explicit lower bounds on $n$ for a number of variants of the concentration result to hold to any prescribed accuracies, with the constraints holding up to any specified tolerance, taking into account the fact that allocations of discrete units can satisfy constraints only approximately. Again unlike earlier results, we measure concentration not by deviation from the maximum entropy value, but by the $\ell_1$ and $\ell_2$ distances from the maximum entropy-achieving frequency vector. One of our results holds independently of the alphabet size $m$ and is based on a novel proof technique using the multi-dimensional Berry-Esseen theorem. We illustrate and compare our results using various detailed examples. △ Less

Submitted 30 September, 2015; v1 submitted 29 July, 2011; originally announced July 2011.

Comments: 1) An error affecting sec. 3 has been corrected: the parameters delta and theta cannot be chosen independently. Sec. 3 has been revised up to Theorem 3.15 in sec. 3.6. 2) Some minor updates in sec. 4. 3) Some proofs used in both sec. 3 and sec. 4 have been unified (This version to appear in IEEE Transactions on Information Theory, December 2015)

arXiv:0811.0683 [pdf, ps, other]

doi 10.1214/07-AOS532

An Algorithmic and a geometric characterization of coarsening at random

Authors: Richard D. Gill, Peter D. Grünwald

Abstract: We show that the class of conditional distributions satisfying the coarsening at random (CAR) property for discrete data has a simple and robust algorithmic description based on randomized uniform multicovers: combinatorial objects generalizing the notion of partition of a set. However, the complexity of a given CAR mechanism can be large: the maximal "height" of the needed multicovers can be ex… ▽ More We show that the class of conditional distributions satisfying the coarsening at random (CAR) property for discrete data has a simple and robust algorithmic description based on randomized uniform multicovers: combinatorial objects generalizing the notion of partition of a set. However, the complexity of a given CAR mechanism can be large: the maximal "height" of the needed multicovers can be exponential in the number of points in the sample space. The results stem from a geometric interpretation of the set of CAR distributions as a convex polytope and a characterization of its extreme points. The hierarchy of CAR models defined in this way could be useful in parsimonious statistical modeling of CAR mechanisms, though the results also raise doubts in applied work as to the meaningfulness of the CAR assumption in its full generality. △ Less

Submitted 5 November, 2008; originally announced November 2008.

Comments: Published in at http://dx.doi.org/10.1214/07-AOS532 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: Note accidental duplicate submission arXiv:0811.0683 MSC Class: 62A01 (Primary); 62N01 (Secondary)

Journal ref: Annals of Statistics 2008, Vol. 36, No. 5, 2409-2422

arXiv:0809.2754 [pdf, ps, other]

Algorithmic information theory

Authors: Peter D. Grunwald, Paul M. B. Vitanyi

Abstract: We introduce algorithmic information theory, also known as the theory of Kolmogorov complexity. We explain the main concepts of this quantitative approach to defining `information'. We discuss the extent to which Kolmogorov's and Shannon's information theory have a common purpose, and where they are fundamentally different. We indicate how recent developments within the theory allow one to forma… ▽ More We introduce algorithmic information theory, also known as the theory of Kolmogorov complexity. We explain the main concepts of this quantitative approach to defining `information'. We discuss the extent to which Kolmogorov's and Shannon's information theory have a common purpose, and where they are fundamentally different. We indicate how recent developments within the theory allow one to formally distinguish between `structural' (meaningful) and `random' information as measured by the Kolmogorov structure function, which leads to a mathematical formalization of Occam's razor in inductive inference. We end by discussing some of the philosophical implications of the theory. △ Less

Submitted 17 September, 2008; v1 submitted 16 September, 2008; originally announced September 2008.

Comments: 37 pages, 2 figures, pdf, in: Philosophy of Information, P. Adriaans and J. van Benthem, Eds., A volume in Handbook of the philosophy of science, D. Gabbay, P. Thagard, and J. Woods, Eds., Elsevier, 2008. In version 1 of September 16 the refs are missing. Corrected in version 2 of September 17

arXiv:0711.3235 [pdf, ps, other]

A Game-Theoretic Analysis of Updating Sets of Probabilities

Authors: Peter D. Grunwald, Joseph Y. Halpern

Abstract: We consider how an agent should update her uncertainty when it is represented by a set $¶$ of probability distributions and the agent observes that a random variable $X$ takes on value $x$, given that the agent makes decisions using the minimax criterion, perhaps the best-studied and most commonly-used criterion in the literature. We adopt a game-theoretic framework, where the agent plays agains… ▽ More We consider how an agent should update her uncertainty when it is represented by a set $¶$ of probability distributions and the agent observes that a random variable $X$ takes on value $x$, given that the agent makes decisions using the minimax criterion, perhaps the best-studied and most commonly-used criterion in the literature. We adopt a game-theoretic framework, where the agent plays against a bookie, who chooses some distribution from $¶$. We consider two reasonable games that differ in what the bookie knows when he makes his choice. Anomalies that have been observed before, like time inconsistency, can be understood as arising important because different games are being played, against bookies with different information. We characterize the important special cases in which the optimal decision rules according to the minimax criterion amount to either conditioning or simply ignoring the information. Finally, we consider the relationship between conditioning and calibration when uncertainty is described by sets of probabilities. △ Less

Submitted 20 November, 2007; originally announced November 2007.

ACM Class: I.2.4

arXiv:math/0510276 [pdf, ps, other]

doi 10.1214/07-AOS532

An algorithmic and a geometric characterization of Coarsening At Random

Authors: Richard D. Gill, Peter D. Grunwald

Abstract: We show that the class of conditional distributions satisfying the coarsening at Random (CAR) property for discrete data has a simple and robust algorithmic description based on randomized uniform multicovers: combinatorial objects generalizing the notion of partition of a set. However, the complexity of a given CAR mechanism can be large: the maximal "height" of the needed multicovers can be ex… ▽ More We show that the class of conditional distributions satisfying the coarsening at Random (CAR) property for discrete data has a simple and robust algorithmic description based on randomized uniform multicovers: combinatorial objects generalizing the notion of partition of a set. However, the complexity of a given CAR mechanism can be large: the maximal "height" of the needed multicovers can be exponential in the number of points in the sample space. The results stem from a geometric interpretation of the set of CAR distributions as a convex polytope and a characterization of its extreme points. The hierarchy of CAR models defined in this way could be useful in parsimonious statistical modelling of CAR mechanisms, though the results also raise doubts in applied work as to the meaningfulness of the CAR assumption in its full generality. △ Less

Submitted 13 September, 2007; v1 submitted 13 October, 2005; originally announced October 2005.

Comments: 16 pages; accepted in this form for publication by Annals of Statistics

Report number: See also 0811.0683 (duplicate submission) MSC Class: 62A01 (Primary); 62N01; 60A99; 68T37 (Secondary)

Journal ref: The Annals of Statistics 2008, Vol. 36, No. 5, 2409-2422

arXiv:cs/0510080 [pdf, ps, other]

When Ignorance is Bliss

Authors: Peter D. Grunwald, Joseph Y. Halpern

Abstract: It is commonly-accepted wisdom that more information is better, and that information should never be ignored. Here we argue, using both a Bayesian and a non-Bayesian analysis, that in some situations you are better off ignoring information if your uncertainty is represented by a set of probability measures. These include situations in which the information is relevant for the prediction task at… ▽ More It is commonly-accepted wisdom that more information is better, and that information should never be ignored. Here we argue, using both a Bayesian and a non-Bayesian analysis, that in some situations you are better off ignoring information if your uncertainty is represented by a set of probability measures. These include situations in which the information is relevant for the prediction task at hand. In the non-Bayesian analysis, we show how ignoring information avoids dilation, the phenomenon that additional pieces of information sometimes lead to an increase in uncertainty. In the Bayesian analysis, we show that for small sample sizes and certain prediction tasks, the Bayesian posterior based on a noninformative prior yields worse predictions than simply ignoring the given information. △ Less

Submitted 25 October, 2005; originally announced October 2005.

Comments: In Proceedings of the Twentieth Conference on Uncertainty in AI, 2004, pp. 226-234

ACM Class: I.2.4

arXiv:math/0410076 [pdf, ps, other]

doi 10.1214/009053604000000553

Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory

Authors: Peter D. Grunwald, A. Philip Dawid

Abstract: We describe and develop a close relationship between two problems that have customarily been regarded as distinct: that of maximizing entropy, and that of minimizing worst-case expected loss. Using a formulation grounded in the equilibrium theory of zero-sum games between Decision Maker and Nature, these two problems are shown to be dual to each other, the solution to each providing that to th… ▽ More We describe and develop a close relationship between two problems that have customarily been regarded as distinct: that of maximizing entropy, and that of minimizing worst-case expected loss. Using a formulation grounded in the equilibrium theory of zero-sum games between Decision Maker and Nature, these two problems are shown to be dual to each other, the solution to each providing that to the other. Although Topsœdescribed this connection for the Shannon entropy over 20 years ago, it does not appear to be widely known even in that important special case. We here generalize this theory to apply to arbitrary decision problems and loss functions. We indicate how an appropriate generalized definition of entropy can be associated with such a problem, and we show that, subject to certain regularity conditions, the above-mentioned duality continues to apply in this extended context. This simultaneously provides a possible rationale for maximizing entropy and a tool for finding robust Bayes acts. We also describe the essential identity between the problem of maximizing entropy and that of minimizing a related discrepancy or divergence between distributions. This leads to an extension, to arbitrary discrepancies, of a well-known minimax theorem for the case of Kullback-Leibler divergence (the ``redundancy-capacity theorem'' of information theory). For the important case of families of distributions having certain mean values specified, we develop simple sufficient conditions and methods for identifying the desired solutions. △ Less

Submitted 5 October, 2004; originally announced October 2004.

Comments: Published by the Institute of Mathematical Statistics (http://www.imstat.org) in the Annals of Statistics (http://www.imstat.org/aos/) at http://dx.doi.org/10.1214/009053604000000553

Report number: IMS-AOS-AOS231 MSC Class: 62C20 (Primary) 94A17 (Secondary)

Journal ref: Annals of Statistics 2004, Vol. 32, No. 4, 1367-1433

arXiv:cs/0306124 [pdf, ps, other]

Updating Probabilities

Authors: Peter D. Grunwald, Joseph Y. Halpern

Abstract: As examples such as the Monty Hall puzzle show, applying conditioning to update a probability distribution on a ``naive space'', which does not take into account the protocol used, can often lead to counterintuitive results. Here we examine why. A criterion known as CAR (``coarsening at random'') in the statistical literature characterizes when ``naive'' conditioning in a naive space works. We s… ▽ More As examples such as the Monty Hall puzzle show, applying conditioning to update a probability distribution on a ``naive space'', which does not take into account the protocol used, can often lead to counterintuitive results. Here we examine why. A criterion known as CAR (``coarsening at random'') in the statistical literature characterizes when ``naive'' conditioning in a naive space works. We show that the CAR condition holds rather infrequently, and we provide a procedural characterization of it, by giving a randomized algorithm that generates all and only distributions for which CAR holds. This substantially extends previous characterizations of CAR. We also consider more generalized notions of update such as Jeffrey conditioning and minimizing relative entropy (MRE). We give a generalization of the CAR condition that characterizes when Jeffrey conditioning leads to appropriate answers, and show that there exist some very simple settings in which MRE essentially never gives the right results. This generalizes and interconnects previous results obtained in the literature on CAR and MRE. △ Less

Submitted 23 June, 2003; originally announced June 2003.

Comments: This is an expanded version of a paper that appeared in Proceedings of the Eighteenth Conference on Uncertainty in AI, 2002, pp. 187--196. to appear, Journal of AI Research

ACM Class: I.2.4

Showing 1–25 of 25 results for author: Grünwald, P D