Search | arXiv e-print repository

Universal Reverse Information Projections and Optimal E-statistics

Authors: Tyron Lardy, Peter Grünwald, Peter Harremoës

Abstract: Information projections have found important applications in probability theory, statistics, and related areas. In the field of hypothesis testing in particular, the reverse information projection (RIPr) has recently been shown to lead to so-called growth-rate optimal (GRO) e-statistics for testing simple alternatives against composite null hypotheses. However, the RIPr as well as the GRO criterio… ▽ More Information projections have found important applications in probability theory, statistics, and related areas. In the field of hypothesis testing in particular, the reverse information projection (RIPr) has recently been shown to lead to so-called growth-rate optimal (GRO) e-statistics for testing simple alternatives against composite null hypotheses. However, the RIPr as well as the GRO criterion are undefined whenever the infimum information divergence between the null and alternative is infinite. We show that in such scenarios there often still exists an element in the alternative that is 'closest' to the null: the universal reverse information projection. The universal reverse information projection and its non-universal counterpart coincide whenever information divergence is finite. Furthermore, the universal RIPr is shown to lead to optimal e-statistics in a sense that is a novel, but natural, extension of the GRO criterion. We also give conditions under which the universal RIPr is a strict sub-probability distribution, as well as conditions under which an approximation of the universal RIPr leads to approximate e-statistics. For this case we provide tight relations between the corresponding approximation rates. △ Less

Submitted 4 December, 2023; v1 submitted 28 June, 2023; originally announced June 2023.

Comments: A five-page abstract of this paper, containing a subset of the theorems but no proofs, was presented at ISIT 2023, Taipei

MSC Class: 62B10 (primary); 94A17 (secondary)

arXiv:2210.01948 [pdf, ps, other]

Game-theoretic statistics and safe anytime-valid inference

Authors: Aaditya Ramdas, Peter Grünwald, Vladimir Vovk, Glenn Shafer

Abstract: Safe anytime-valid inference (SAVI) provides measures of statistical evidence and certainty -- e-processes for testing and confidence sequences for estimation -- that remain valid at all stop** times, accommodating continuous monitoring and analysis of accumulating data and optional stop** or continuation for any reason. These measures crucially rely on test martingales, which are nonnegative… ▽ More Safe anytime-valid inference (SAVI) provides measures of statistical evidence and certainty -- e-processes for testing and confidence sequences for estimation -- that remain valid at all stop** times, accommodating continuous monitoring and analysis of accumulating data and optional stop** or continuation for any reason. These measures crucially rely on test martingales, which are nonnegative martingales starting at one. Since a test martingale is the wealth process of a player in a betting game, SAVI centrally employs game-theoretic intuition, language and mathematics. We summarize the SAVI goals and philosophy, and report recent advances in testing composite hypotheses and estimating functionals in nonparametric settings. △ Less

Submitted 17 June, 2023; v1 submitted 4 October, 2022; originally announced October 2022.

Comments: 25 pages. Under review. ArXiv does not compile/space some references properly

arXiv:2202.04513 [pdf, ps, other]

doi 10.1007/s11229-021-03233-1

The no-free-lunch theorems of supervised learning

Authors: Tom F. Sterkenburg, Peter D. Grünwald

Abstract: The no-free-lunch theorems promote a skeptical conclusion that all possible machine learning algorithms equally lack justification. But how could this leave room for a learning theory, that shows that some algorithms are better than others? Drawing parallels to the philosophy of induction, we point out that the no-free-lunch results presuppose a conception of learning algorithms as purely data-dri… ▽ More The no-free-lunch theorems promote a skeptical conclusion that all possible machine learning algorithms equally lack justification. But how could this leave room for a learning theory, that shows that some algorithms are better than others? Drawing parallels to the philosophy of induction, we point out that the no-free-lunch results presuppose a conception of learning algorithms as purely data-driven. On this conception, every algorithm must have an inherent inductive bias, that wants justification. We argue that many standard learning algorithms should rather be understood as model-dependent: in each application they also require for input a model, representing a bias. Generic algorithms themselves, they can be given a model-relative justification. △ Less

Submitted 9 February, 2022; originally announced February 2022.

Journal ref: Synthese 199:9979-10015 (2021)

arXiv:2201.06487 [pdf, ps, other]

Minimax risk classifiers with 0-1 loss

Authors: Santiago Mazuelas, Mauricio Romero, Peter Grünwald

Abstract: Supervised classification techniques use training samples to learn a classification rule with small expected 0-1 loss (error probability). Conventional methods enable tractable learning and provide out-of-sample generalization by using surrogate losses instead of the 0-1 loss and considering specific families of rules (hypothesis classes). This paper presents minimax risk classifiers (MRCs) that m… ▽ More Supervised classification techniques use training samples to learn a classification rule with small expected 0-1 loss (error probability). Conventional methods enable tractable learning and provide out-of-sample generalization by using surrogate losses instead of the 0-1 loss and considering specific families of rules (hypothesis classes). This paper presents minimax risk classifiers (MRCs) that minize the worst-case 0-1 loss with respect to uncertainty sets of distributions that can include the underlying distribution, with a tunable confidence. We show that MRCs can provide tight performance guarantees at learning and are strongly universally consistent using feature map**s given by characteristic kernels. The paper also proposes efficient optimization techniques for MRC learning and shows that the methods presented can provide accurate classification together with tight performance guarantees in practice. △ Less

Submitted 16 August, 2023; v1 submitted 17 January, 2022; originally announced January 2022.

arXiv:2106.09683 [pdf, other]

PAC-Bayes, MAC-Bayes and Conditional Mutual Information: Fast rate bounds that handle general VC classes

Authors: Peter Grünwald, Thomas Steinke, Lydia Zakynthinou

Abstract: We give a novel, unified derivation of conditional PAC-Bayesian and mutual information (MI) generalization bounds. We derive conditional MI bounds as an instance, with special choice of prior, of conditional MAC-Bayesian (Mean Approximately Correct) bounds, itself derived from conditional PAC-Bayesian bounds, where `conditional' means that one can use priors conditioned on a joint training and gho… ▽ More We give a novel, unified derivation of conditional PAC-Bayesian and mutual information (MI) generalization bounds. We derive conditional MI bounds as an instance, with special choice of prior, of conditional MAC-Bayesian (Mean Approximately Correct) bounds, itself derived from conditional PAC-Bayesian bounds, where `conditional' means that one can use priors conditioned on a joint training and ghost sample. This allows us to get nontrivial PAC-Bayes and MI-style bounds for general VC classes, something recently shown to be impossible with standard PAC-Bayesian/MI bounds. Second, it allows us to get faster rates of order $O \left(({\text{KL}}/n)^γ\right)$ for $γ> 1/2$ if a Bernstein condition holds and for exp-concave losses (with $γ=1$), which is impossible with both standard PAC-Bayes generalization and MI bounds. Our work extends the recent work by Steinke and Zakynthinou [2020] who handle MI with VC but neither PAC-Bayes nor fast rates, the recent work of Hellström and Durisi [2020] who extend the latter to the PAC-Bayes setting via a unifying exponential inequality, and Mhammedi et al. [2019] who initiated fast rate PAC-Bayes generalization error bounds but handle neither MI nor general VC classes. △ Less

Submitted 17 June, 2021; originally announced June 2021.

Comments: 24 pages, accepted for publication at COLT 2021

arXiv:2106.02693 [pdf, other]

Generic E-Variables for Exact Sequential k-Sample Tests that allow for Optional Stop**

Authors: Rosanne Turner, Alexander Ly, Peter Grünwald

Abstract: We develop E-variables for testing whether two or more data streams come from the same source or not, and more generally, whether the difference between the sources is larger than some minimal effect size. These E-variables lead to exact, nonasymptotic tests that remain safe, i.e. keep their type-I error guarantees, under flexible sampling scenarios such as optional stop** and continuation. In s… ▽ More We develop E-variables for testing whether two or more data streams come from the same source or not, and more generally, whether the difference between the sources is larger than some minimal effect size. These E-variables lead to exact, nonasymptotic tests that remain safe, i.e. keep their type-I error guarantees, under flexible sampling scenarios such as optional stop** and continuation. In special cases our E-variables also have an optimal 'growth' property under the alternative. While the construction is generic, we illustrate it through the special case of k x 2 contingency tables, where we also allow for the incorporation of different restrictions on a composite alternative. Comparison to p-value analysis in simulations and a real-world example show that E-variables, through their flexibility, often allow for early stop** of data collection, thereby retaining similar power as classical methods, while also retaining the option of extending or combining data afterwards. △ Less

Submitted 22 June, 2022; v1 submitted 4 June, 2021; originally announced June 2021.

arXiv:2103.13686 [pdf, other]

doi 10.1007/s10618-022-00856-x

Robust subgroup discovery

Authors: Hugo Manuel Proença, Peter Grünwald, Thomas Bäck, Matthijs van Leeuwen

Abstract: We introduce the problem of robust subgroup discovery, i.e., finding a set of interpretable descriptions of subsets that 1) stand out with respect to one or more target attributes, 2) are statistically robust, and 3) non-redundant. Many attempts have been made to mine either locally robust subgroups or to tackle the pattern explosion, but we are the first to address both challenges at the same tim… ▽ More We introduce the problem of robust subgroup discovery, i.e., finding a set of interpretable descriptions of subsets that 1) stand out with respect to one or more target attributes, 2) are statistically robust, and 3) non-redundant. Many attempts have been made to mine either locally robust subgroups or to tackle the pattern explosion, but we are the first to address both challenges at the same time from a global modelling perspective. First, we formulate the broad model class of subgroup lists, i.e., ordered sets of subgroups, for univariate and multivariate targets that can consist of nominal or numeric variables, including traditional top-1 subgroup discovery in its definition. This novel model class allows us to formalise the problem of optimal robust subgroup discovery using the Minimum Description Length (MDL) principle, where we resort to optimal Normalised Maximum Likelihood and Bayesian encodings for nominal and numeric targets, respectively. Second, finding optimal subgroup lists is NP-hard. Therefore, we propose SSD++, a greedy heuristic that finds good subgroup lists and guarantees that the most significant subgroup found according to the MDL criterion is added in each iteration. In fact, the greedy gain is shown to be equivalent to a Bayesian one-sample proportion, multinomial, or t-test between the subgroup and dataset marginal target distributions plus a multiple hypothesis testing penalty. Furthermore, we empirically show on 54 datasets that SSD++ outperforms previous subgroup discovery methods in terms of quality, generalisation on unseen data, and subgroup list size. △ Less

Submitted 30 June, 2022; v1 submitted 25 March, 2021; originally announced March 2021.

Comments: For associated code, see https://github.com/HMProenca/RuleList ; submitted to Data Mining and Knowledge Discovery Journal

Journal ref: Data Mining and Knowledge Discovery 36 (2022)1885-1970

arXiv:2006.09186 [pdf, other]

doi 10.1007/978-3-030-67658-2_2

Discovering outstanding subgroup lists for numeric targets using MDL

Authors: Hugo M. Proença, Peter Grünwald, Thomas Bäck, Matthijs van Leeuwen

Abstract: The task of subgroup discovery (SD) is to find interpretable descriptions of subsets of a dataset that stand out with respect to a target attribute. To address the problem of mining large numbers of redundant subgroups, subgroup set discovery (SSD) has been proposed. State-of-the-art SSD methods have their limitations though, as they typically heavily rely on heuristics and/or user-chosen hyperpar… ▽ More The task of subgroup discovery (SD) is to find interpretable descriptions of subsets of a dataset that stand out with respect to a target attribute. To address the problem of mining large numbers of redundant subgroups, subgroup set discovery (SSD) has been proposed. State-of-the-art SSD methods have their limitations though, as they typically heavily rely on heuristics and/or user-chosen hyperparameters. We propose a dispersion-aware problem formulation for subgroup set discovery that is based on the minimum description length (MDL) principle and subgroup lists. We argue that the best subgroup list is the one that best summarizes the data given the overall distribution of the target. We restrict our focus to a single numeric target variable and show that our formalization coincides with an existing quality measure when finding a single subgroup, but that-in addition-it allows to trade off subgroup quality with the complexity of the subgroup. We next propose SSD++, a heuristic algorithm for which we empirically demonstrate that it returns outstanding subgroup lists: non-redundant sets of compact subgroups that stand out by having strongly deviating means and small spread. △ Less

Submitted 16 June, 2020; originally announced June 2020.

Comments: Extended version of conference paper at ECML-PKDD

Journal ref: ECML PKDD 2020, LNAI 12457, pp. 19-35, 2021

arXiv:1910.09227 [pdf, other]

Safe-Bayesian Generalized Linear Regression

Authors: Rianne de Heide, Alisa Kirichenko, Nishant Mehta, Peter Grünwald

Abstract: We study generalized Bayesian inference under misspecification, i.e. when the model is 'wrong but useful'. Generalized Bayes equips the likelihood with a learning rate $η$. We show that for generalized linear models (GLMs), $η$-generalized Bayes concentrates around the best approximation of the truth within the model for specific $η\neq 1$, even under severely misspecified noise, as long as the ta… ▽ More We study generalized Bayesian inference under misspecification, i.e. when the model is 'wrong but useful'. Generalized Bayes equips the likelihood with a learning rate $η$. We show that for generalized linear models (GLMs), $η$-generalized Bayes concentrates around the best approximation of the truth within the model for specific $η\neq 1$, even under severely misspecified noise, as long as the tails of the true distribution are exponential. We derive MCMC samplers for generalized Bayesian lasso and logistic regression and give examples of both simulated and real-world data in which generalized Bayes substantially outperforms standard Bayes. △ Less

Submitted 29 May, 2021; v1 submitted 21 October, 2019; originally announced October 2019.

Comments: Final version. Accepted to AISTATS 2020

arXiv:1908.08484 [pdf, ps, other]

doi 10.1142/S2661335219300018

Minimum Description Length Revisited

Authors: Peter Grünwald, Teemu Roos

Abstract: This is an up-to-date introduction to and overview of the Minimum Description Length (MDL) Principle, a theory of inductive inference that can be applied to general problems in statistics, machine learning and pattern recognition. While MDL was originally based on data compression ideas, this introduction can be read without any knowledge thereof. It takes into account all major developments since… ▽ More This is an up-to-date introduction to and overview of the Minimum Description Length (MDL) Principle, a theory of inductive inference that can be applied to general problems in statistics, machine learning and pattern recognition. While MDL was originally based on data compression ideas, this introduction can be read without any knowledge thereof. It takes into account all major developments since 2007, the last time an extensive overview was written. These include new methods for model selection and averaging and hypothesis testing, as well as the first completely general definition of {\em MDL estimators}. Incorporating these developments, MDL can be seen as a powerful extension of both penalized likelihood and Bayesian approaches, in which penalization functions and prior distributions are replaced by more general luckiness functions, average-case methodology is replaced by a more robust worst-case approach, and in which methods classically viewed as highly distinct, such as AIC vs BIC and cross-validation vs Bayes can, to a large extent, be viewed from a unified perspective. △ Less

Submitted 18 December, 2019; v1 submitted 21 August, 2019; originally announced August 2019.

Comments: to appear in International Journal of Mathematics for Industry

arXiv:1906.07801 [pdf, other]

Safe Testing

Authors: Peter Grünwald, Rianne de Heide, Wouter Koolen

Abstract: We develop the theory of hypothesis testing based on the e-value, a notion of evidence that, unlike the p-value, allows for effortlessly combining results from several studies in the common scenario where the decision to perform a new study may depend on previous outcomes. Tests based on e-values are safe, i.e. they preserve Type-I error guarantees, under such optional continuation. We define grow… ▽ More We develop the theory of hypothesis testing based on the e-value, a notion of evidence that, unlike the p-value, allows for effortlessly combining results from several studies in the common scenario where the decision to perform a new study may depend on previous outcomes. Tests based on e-values are safe, i.e. they preserve Type-I error guarantees, under such optional continuation. We define growth-rate optimality (GRO) as an analogue of power in an optional continuation context, and we show how to construct GRO e-variables for general testing problems with composite null and alternative, emphasizing models with nuisance parameters. GRO e-values take the form of Bayes factors with special priors. We illustrate the theory using several classic examples including a one-sample safe t-test and the 2 x 2 contingency table. Sharing Fisherian, Neymanian and Jeffreys-Bayesian interpretations, e-values may provide a methodology acceptable to adherents of all three schools. △ Less

Submitted 10 March, 2023; v1 submitted 18 June, 2019; originally announced June 2019.

Comments: Accepted as discussion paper to the Journal of the Royal Statistical Society series B

arXiv:1905.13367 [pdf, ps, other]

PAC-Bayes Un-Expected Bernstein Inequality

Authors: Zakaria Mhammedi, Peter D. Grunwald, Benjamin Guedj

Abstract: We present a new PAC-Bayesian generalization bound. Standard bounds contain a $\sqrt{L_n \cdot \KL/n}$ complexity term which dominates unless $L_n$, the empirical error of the learning algorithm's randomized predictions, vanishes. We manage to replace $L_n$ by a term which vanishes in many more situations, essentially whenever the employed learning algorithm is sufficiently stable on the dataset a… ▽ More We present a new PAC-Bayesian generalization bound. Standard bounds contain a $\sqrt{L_n \cdot \KL/n}$ complexity term which dominates unless $L_n$, the empirical error of the learning algorithm's randomized predictions, vanishes. We manage to replace $L_n$ by a term which vanishes in many more situations, essentially whenever the employed learning algorithm is sufficiently stable on the dataset at hand. Our new bound consistently beats state-of-the-art bounds both on a toy example and on UCI datasets (with large enough $n$). Theoretically, unlike existing bounds, our new bound can be expected to converge to $0$ faster whenever a Bernstein/Tsybakov condition holds, thus connecting PAC-Bayesian generalization and {\em excess risk\/} bounds---for the latter it has long been known that faster convergence can be obtained under Bernstein conditions. Our main technical tool is a new concentration inequality which is like Bernstein's but with $X^2$ taken outside its expectation. △ Less

Submitted 3 November, 2019; v1 submitted 30 May, 2019; originally announced May 2019.

Comments: 24 pages, 6 figures. To Appear in NeurIPS2019

Journal ref: NeurIPS 2019

arXiv:1807.09077 [pdf, ps, other]

doi 10.1214/20-BA1234

Optional Stop** with Bayes Factors: a categorization and extension of folklore results, with an application to invariant situations

Authors: Allard Hendriksen, Rianne de Heide, Peter Grünwald

Abstract: It is often claimed that Bayesian methods, in particular Bayes factor methods for hypothesis testing, can deal with optional stop**. We first give an overview, using elementary probability theory, of three different mathematical meanings that various authors give to this claim: (1) stop** rule independence, (2) posterior calibration and (3) (semi-) frequentist robustness to optional stop**.… ▽ More It is often claimed that Bayesian methods, in particular Bayes factor methods for hypothesis testing, can deal with optional stop**. We first give an overview, using elementary probability theory, of three different mathematical meanings that various authors give to this claim: (1) stop** rule independence, (2) posterior calibration and (3) (semi-) frequentist robustness to optional stop**. We then prove theorems to the effect that these claims do indeed hold in a general measure-theoretic setting. For claims of type (2) and (3), such results are new. By allowing for non-integrable measures based on improper priors, we obtain particularly strong results for the practically important case of models with nuisance parameters satisfying a group invariance (such as location or scale). We also discuss the practical relevance of (1)--(3), and conclude that whether Bayes factor methods actually perform well under optional stop** crucially depends on details of models, priors and the goal of the analysis. △ Less

Submitted 29 April, 2020; v1 submitted 24 July, 2018; originally announced July 2018.

Comments: 29 pages

arXiv:1710.07732 [pdf, other]

A Tight Excess Risk Bound via a Unified PAC-Bayesian-Rademacher-Shtarkov-MDL Complexity

Authors: Peter D. Grünwald, Nishant A. Mehta

Abstract: We present a novel notion of complexity that interpolates between and generalizes some classic existing complexity notions in learning theory: for estimators like empirical risk minimization (ERM) with arbitrary bounded losses, it is upper bounded in terms of data-independent Rademacher complexity; for generalized Bayesian estimators, it is upper bounded by the data-dependent information complexit… ▽ More We present a novel notion of complexity that interpolates between and generalizes some classic existing complexity notions in learning theory: for estimators like empirical risk minimization (ERM) with arbitrary bounded losses, it is upper bounded in terms of data-independent Rademacher complexity; for generalized Bayesian estimators, it is upper bounded by the data-dependent information complexity (also known as stochastic or PAC-Bayesian, $\mathrm{KL}(\text{posterior} \operatorname{\|} \text{prior})$ complexity. For (penalized) ERM, the new complexity reduces to (generalized) normalized maximum likelihood (NML) complexity, i.e. a minimax log-loss individual-sequence regret. Our first main result bounds excess risk in terms of the new complexity. Our second main result links the new complexity via Rademacher complexity to $L_2(P)$ entropy, thereby generalizing earlier results of Opper, Haussler, Lugosi, and Cesa-Bianchi who did the log-loss case with $L_\infty$. Together, these results recover optimal bounds for VC- and large (polynomial entropy) classes, replacing localized Rademacher complexity by a simpler analysis which almost completely separates the two aspects that determine the achievable rates: 'easiness' (Bernstein) conditions and model complexity. △ Less

Submitted 20 October, 2017; originally announced October 2017.

Comments: 38 pages

arXiv:1605.06439 [pdf, ps, other]

Combining Adversarial Guarantees and Stochastic Fast Rates in Online Learning

Authors: Wouter M. Koolen, Peter Grünwald, Tim van Erven

Abstract: We consider online learning algorithms that guarantee worst-case regret rates in adversarial environments (so they can be deployed safely and will perform robustly), yet adapt optimally to favorable stochastic environments (so they will perform well in a variety of settings of practical importance). We quantify the friendliness of stochastic environments by means of the well-known Bernstein (a.k.a… ▽ More We consider online learning algorithms that guarantee worst-case regret rates in adversarial environments (so they can be deployed safely and will perform robustly), yet adapt optimally to favorable stochastic environments (so they will perform well in a variety of settings of practical importance). We quantify the friendliness of stochastic environments by means of the well-known Bernstein (a.k.a. generalized Tsybakov margin) condition. For two recent algorithms (Squint for the Hedge setting and MetaGrad for online convex optimization) we show that the particular form of their data-dependent individual-sequence regret guarantees implies that they adapt automatically to the Bernstein parameters of the stochastic environment. We prove that these algorithms attain fast rates in their respective settings both in expectation and with high probability. △ Less

Submitted 20 May, 2016; originally announced May 2016.

Journal ref: Advances in Neural Information Processing Systems 29 (NeurIPS), 4457-4465, 2016

arXiv:1605.00252 [pdf, other]

Fast Rates for General Unbounded Loss Functions: from ERM to Generalized Bayes

Authors: Peter D. Grünwald, Nishant A. Mehta

Abstract: We present new excess risk bounds for general unbounded loss functions including log loss and squared loss, where the distribution of the losses may be heavy-tailed. The bounds hold for general estimators, but they are optimized when applied to $η$-generalized Bayesian, MDL, and empirical risk minimization estimators. In the case of log loss, the bounds imply convergence rates for generalized Baye… ▽ More We present new excess risk bounds for general unbounded loss functions including log loss and squared loss, where the distribution of the losses may be heavy-tailed. The bounds hold for general estimators, but they are optimized when applied to $η$-generalized Bayesian, MDL, and empirical risk minimization estimators. In the case of log loss, the bounds imply convergence rates for generalized Bayesian inference under misspecification in terms of a generalization of the Hellinger metric as long as the learning rate $η$ is set correctly. For general loss functions, our bounds rely on two separate conditions: the $v$-GRIP (generalized reversed information projection) conditions, which control the lower tail of the excess loss; and the newly introduced witness condition, which controls the upper tail. The parameter $v$ in the $v$-GRIP conditions determines the achievable rate and is akin to the exponent in the Tsybakov margin condition and the Bernstein condition for bounded losses, which the $v$-GRIP conditions generalize; favorable $v$ in combination with small model complexity leads to $\tilde{O}(1/n)$ rates. The witness condition allows us to connect the excess risk to an "annealed" version thereof, by which we generalize several previous results connecting Hellinger and Rényi divergence to KL divergence. △ Less

Submitted 5 November, 2019; v1 submitted 1 May, 2016; originally announced May 2016.

Comments: accepted to JMLR pending minor final modifications

arXiv:1604.01785 [pdf, other]

Safe Probability

Authors: Peter Grünwald

Abstract: We formalize the idea of probability distributions that lead to reliable predictions about some, but not all aspects of a domain. The resulting notion of `safety' provides a fresh perspective on foundational issues in statistics, providing a middle ground between imprecise probability and multiple-prior models on the one hand and strictly Bayesian approaches on the other. It also allows us to form… ▽ More We formalize the idea of probability distributions that lead to reliable predictions about some, but not all aspects of a domain. The resulting notion of `safety' provides a fresh perspective on foundational issues in statistics, providing a middle ground between imprecise probability and multiple-prior models on the one hand and strictly Bayesian approaches on the other. It also allows us to formalize fiducial distributions in terms of the set of random variables that they can safely predict, thus taking some of the sting out of the fiducial idea. By restricting probabilistic inference to safe uses, one also automatically avoids paradoxes such as the Monty Hall problem. Safety comes in a variety of degrees, such as "validity" (the strongest notion), "calibration", "confidence safety" and "unbiasedness" (almost the weakest notion). △ Less

Submitted 6 April, 2016; originally announced April 2016.

Comments: Submitted to a journal

MSC Class: 62A01

arXiv:1512.03223 [pdf, other]

doi 10.1016/j.ijar.2016.03.001

Robust Probability Updating

Authors: Thijs van Ommen, Wouter M. Koolen, Thijs E. Feenstra, Peter D. Grünwald

Abstract: This paper discusses an alternative to conditioning that may be used when the probability distribution is not fully specified. It does not require any assumptions (such as CAR: coarsening at random) on the unknown distribution. The well-known Monty Hall problem is the simplest scenario where neither naive conditioning nor the CAR assumption suffice to determine an updated probability distribution.… ▽ More This paper discusses an alternative to conditioning that may be used when the probability distribution is not fully specified. It does not require any assumptions (such as CAR: coarsening at random) on the unknown distribution. The well-known Monty Hall problem is the simplest scenario where neither naive conditioning nor the CAR assumption suffice to determine an updated probability distribution. This paper thus addresses a generalization of that problem to arbitrary distributions on finite outcome spaces, arbitrary sets of `messages', and (almost) arbitrary loss functions, and provides existence and characterization theorems for robust probability updating strategies. We find that for logarithmic loss, optimality is characterized by an elegant condition, which we call RCAR (reverse coarsening at random). Under certain conditions, the same condition also characterizes optimality for a much larger class of loss functions, and we obtain an objective and general answer to how one should update probabilities in the light of new information. △ Less

Submitted 2 May, 2016; v1 submitted 10 December, 2015; originally announced December 2015.

Comments: 47 pages, 4 figures. This second version is the accepted manuscript: it incorporates reviewer comments and has a new title

Journal ref: International Journal of Approximate Reasoning 74 (2016) 30-57

arXiv:1507.02592 [pdf, other]

Fast rates in statistical and online learning

Authors: Tim van Erven, Peter D. Grünwald, Nishant A. Mehta, Mark D. Reid, Robert C. Williamson

Abstract: The speed with which a learning algorithm converges as it is presented with more data is a central problem in machine learning --- a fast rate of convergence means less data is needed for the same level of performance. The pursuit of fast rates in online and statistical learning has led to the discovery of many conditions in learning theory under which fast learning is possible. We show that most… ▽ More The speed with which a learning algorithm converges as it is presented with more data is a central problem in machine learning --- a fast rate of convergence means less data is needed for the same level of performance. The pursuit of fast rates in online and statistical learning has led to the discovery of many conditions in learning theory under which fast learning is possible. We show that most of these conditions are special cases of a single, unifying condition, that comes in two forms: the central condition for 'proper' learning algorithms that always output a hypothesis in the given model, and stochastic mixability for online algorithms that may make predictions outside of the model. We show that under surprisingly weak assumptions both conditions are, in a certain sense, equivalent. The central condition has a re-interpretation in terms of convexity of a set of pseudoprobabilities, linking it to density estimation under misspecification. For bounded losses, we show how the central condition enables a direct proof of fast rates and we prove its equivalence to the Bernstein condition, itself a generalization of the Tsybakov margin condition, both of which have played a central role in obtaining fast rates in statistical learning. Yet, while the Bernstein condition is two-sided, the central condition is one-sided, making it more suitable to deal with unbounded losses. In its stochastic mixability form, our condition generalizes both a stochastic exp-concavity condition identified by Juditsky, Rigollet and Tsybakov and Vovk's notion of mixability. Our unifying conditions thus provide a substantial step towards a characterization of fast rates in statistical learning, similar to how classical mixability characterizes constant regret in the sequential prediction with expert advice setting. △ Less

Submitted 1 September, 2015; v1 submitted 9 July, 2015; originally announced July 2015.

Comments: 69 pages, 3 figures

Journal ref: Journal of Machine Learning Research 6(54):1793-1861, 2015

arXiv:1407.7190 [pdf]

A Game-Theoretic Analysis of Updating Sets of Probabilities

Authors: Peter D. Grunwald, Joseph Y. Halpern

Abstract: We consider how an agent should update her uncertainty when it is represented by a set P of probability distributions and the agent observes that a random variable X takes on value x, given that the agent makes decisions using the minimax criterion, perhaps the best-studied and most commonly-used criterion in the literature. We adopt a game-theoretic framework, where the agent plays against a book… ▽ More We consider how an agent should update her uncertainty when it is represented by a set P of probability distributions and the agent observes that a random variable X takes on value x, given that the agent makes decisions using the minimax criterion, perhaps the best-studied and most commonly-used criterion in the literature. We adopt a game-theoretic framework, where the agent plays against a bookie, who chooses some distribution from P. We consider two reasonable games that differ in what the bookie knows when he makes his choice. Anomalies that have been observed before, like time inconsistency, can be understood as arising because different games are being played, against bookies with different information. We characterize the important special cases in which the optimal decision rules according to the minimax criterion amount to either conditioning or simply ignoring the information. Finally, we consider the relationship between conditioning and calibration when uncertainty is described by sets of probabilities. △ Less

Submitted 27 July, 2014; originally announced July 2014.

Comments: Appears in Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence (UAI2008)

Report number: UAI-P-2008-PG-240-247

arXiv:1407.7188 [pdf]

When Ignorance is Bliss

Authors: Peter D. Grunwald, Joseph Y. Halpern

Abstract: It is commonly-accepted wisdom that more information is better, and that information should never be ignored. Here we argue, using both a Bayesian and a non-Bayesian analysis, that in some situations you are better off ignoring information if your uncertainty is represented by a set of probability measures. These include situations in which the information is relevant for the prediction task at ha… ▽ More It is commonly-accepted wisdom that more information is better, and that information should never be ignored. Here we argue, using both a Bayesian and a non-Bayesian analysis, that in some situations you are better off ignoring information if your uncertainty is represented by a set of probability measures. These include situations in which the information is relevant for the prediction task at hand. In the non-Bayesian analysis, we show how ignoring information avoids dilation, the phenomenon that additional pieces of information sometimes lead to an increase in uncertainty. In the Bayesian analysis, we show that for small sample sizes and certain prediction tasks, the Bayesian posterior based on a noninformative prior yields worse predictions than simply ignoring the given information. △ Less

Submitted 27 July, 2014; originally announced July 2014.

Comments: Appears in Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI2004)

Report number: UAI-P-2004-PG-226-234

arXiv:1407.7183 [pdf]

Updating Probabilities

Authors: Peter D. Grunwald, Joseph Y. Halpern

Abstract: As examples such as the Monty Hall puzzle show, applying conditioning to update a probability distribution on a ``naive space', which does not take into account the protocol used, can often lead to counterintuitive results. Here we examine why. A criterion known as CAR (coarsening at random) in the statistical literature characterizes when ``naive' conditioning in a naive space works. We show… ▽ More As examples such as the Monty Hall puzzle show, applying conditioning to update a probability distribution on a ``naive space', which does not take into account the protocol used, can often lead to counterintuitive results. Here we examine why. A criterion known as CAR (coarsening at random) in the statistical literature characterizes when ``naive' conditioning in a naive space works. We show that the CAR condition holds rather infrequently. We then consider more generalized notions of update such as Jeffrey conditioning and minimizing relative entropy (MRE). We give a generalization of the CAR condition that characterizes when Jeffrey conditioning leads to appropriate answers, but show that there are no such conditions for MRE. This generalizes and interconnects previous results obtained in the literature on CAR and MRE. △ Less

Submitted 27 July, 2014; originally announced July 2014.

Comments: Appears in Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI2002)

Report number: UAI-P-2002-PG-187-196

arXiv:1401.3906 [pdf]

doi 10.1613/jair.3374

Making Decisions Using Sets of Probabilities: Updating, Time Consistency, and Calibration

Authors: Peter D Grunwald, Joseph Y Halpern

Abstract: We consider how an agent should update her beliefs when her beliefs are represented by a set P of probability distributions, given that the agent makes decisions using the minimax criterion, perhaps the best-studied and most commonly-used criterion in the literature. We adopt a game-theoretic framework, where the agent plays against a bookie, who chooses some distribution from P. We consider two r… ▽ More We consider how an agent should update her beliefs when her beliefs are represented by a set P of probability distributions, given that the agent makes decisions using the minimax criterion, perhaps the best-studied and most commonly-used criterion in the literature. We adopt a game-theoretic framework, where the agent plays against a bookie, who chooses some distribution from P. We consider two reasonable games that differ in what the bookie knows when he makes his choice. Anomalies that have been observed before, like time inconsistency, can be understood as arising because different games are being played, against bookies with different information. We characterize the important special cases in which the optimal decision rules according to the minimax criterion amount to either conditioning or simply ignoring the information. Finally, we consider the relationship between updating and calibration when uncertainty is described by sets of probabilities. Our results emphasize the key role of the rectangularity condition of Epstein and Schneider. △ Less

Submitted 16 January, 2014; originally announced January 2014.

Journal ref: Journal Of Artificial Intelligence Research, Volume 42, pages 393-426, 2011

arXiv:1305.4324 [pdf, ps, other]

Horizon-Independent Optimal Prediction with Log-Loss in Exponential Families

Authors: Peter Bartlett, Peter Grunwald, Peter Harremoes, Fares Hedayati, Wojciech Kotlowski

Abstract: We study online learning under logarithmic loss with regular parametric models. Hedayati and Bartlett (2012b) showed that a Bayesian prediction strategy with Jeffreys prior and sequential normalized maximum likelihood (SNML) coincide and are optimal if and only if the latter is exchangeable, and if and only if the optimal strategy can be calculated without knowing the time horizon in advance. They… ▽ More We study online learning under logarithmic loss with regular parametric models. Hedayati and Bartlett (2012b) showed that a Bayesian prediction strategy with Jeffreys prior and sequential normalized maximum likelihood (SNML) coincide and are optimal if and only if the latter is exchangeable, and if and only if the optimal strategy can be calculated without knowing the time horizon in advance. They put forward the question what families have exchangeable SNML strategies. This paper fully answers this open problem for one-dimensional exponential families. The exchangeability can happen only for three classes of natural exponential family distributions, namely the Gaussian, Gamma, and the Tweedie exponential family of order 3/2. Keywords: SNML Exchangeability, Exponential Family, Online Learning, Logarithmic Loss, Bayesian Strategy, Jeffreys Prior, Fisher Information1 △ Less

Submitted 19 May, 2013; originally announced May 2013.

Comments: 23 pages

arXiv:1301.7378 [pdf]

Minimum Encoding Approaches for Predictive Modeling

Authors: Peter D Grunwald, Petri Kontkanen, Petri Myllymaki, Tomi Silander, Henry Tirri

Abstract: We analyze differences between two information-theoretically motivated approaches to statistical inference and model selection: the Minimum Description Length (MDL) principle, and the Minimum Message Length (MML) principle. Based on this analysis, we present two revised versions of MML: a pointwise estimator which gives the MML-optimal single parameter model, and a volumewise estimator which give… ▽ More We analyze differences between two information-theoretically motivated approaches to statistical inference and model selection: the Minimum Description Length (MDL) principle, and the Minimum Message Length (MML) principle. Based on this analysis, we present two revised versions of MML: a pointwise estimator which gives the MML-optimal single parameter model, and a volumewise estimator which gives the MML-optimal region in the parameter space. Our empirical results suggest that with small data sets, the MDL approach yields more accurate predictions than the MML estimators. The empirical results also demonstrate that the revised MML estimators introduced here perform better than the original MML estimator suggested by Wallace and Freeman. △ Less

Submitted 30 January, 2013; originally announced January 2013.

Comments: Appears in Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI1998)

Report number: UAI-P-1998-PG-183-192

arXiv:1301.3860 [pdf]

Maximum Entropy and the Glasses You Are Looking Through

Authors: Peter D. Grunwald

Abstract: We give an interpretation of the Maximum Entropy (MaxEnt) Principle in game-theoretic terms. Based on this interpretation, we make a formal distinction between different ways of {em applying/} Maximum Entropy distributions. MaxEnt has frequently been criticized on the grounds that it leads to highly representation dependent results. Our distinction allows us to avoid this problem in many cases. We give an interpretation of the Maximum Entropy (MaxEnt) Principle in game-theoretic terms. Based on this interpretation, we make a formal distinction between different ways of {em applying/} Maximum Entropy distributions. MaxEnt has frequently been criticized on the grounds that it leads to highly representation dependent results. Our distinction allows us to avoid this problem in many cases. △ Less

Submitted 16 January, 2013; originally announced January 2013.

Comments: Appears in Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI2000)

Report number: UAI-P-2000-PG-238-246

arXiv:1301.0534 [pdf, ps, other]

Follow the Leader If You Can, Hedge If You Must

Authors: Steven de Rooij, Tim van Erven, Peter D. Grünwald, Wouter M. Koolen

Abstract: Follow-the-Leader (FTL) is an intuitive sequential prediction strategy that guarantees constant regret in the stochastic setting, but has terrible performance for worst-case data. Other hedging strategies have better worst-case guarantees but may perform much worse than FTL if the data are not maximally adversarial. We introduce the FlipFlop algorithm, which is the first method that provably combi… ▽ More Follow-the-Leader (FTL) is an intuitive sequential prediction strategy that guarantees constant regret in the stochastic setting, but has terrible performance for worst-case data. Other hedging strategies have better worst-case guarantees but may perform much worse than FTL if the data are not maximally adversarial. We introduce the FlipFlop algorithm, which is the first method that provably combines the best of both worlds. As part of our construction, we develop AdaHedge, which is a new way of dynamically tuning the learning rate in Hedge without using the doubling trick. AdaHedge refines a method by Cesa-Bianchi, Mansour and Stoltz (2007), yielding slightly improved worst-case guarantees. By interleaving AdaHedge and FTL, the FlipFlop algorithm achieves regret within a constant factor of the FTL regret, without sacrificing AdaHedge's worst-case guarantees. AdaHedge and FlipFlop do not need to know the range of the losses in advance; moreover, unlike earlier methods, both have the intuitive property that the issued weights are invariant under rescaling and translation of the losses. The losses are also allowed to be negative, in which case they may be interpreted as gains. △ Less

Submitted 17 January, 2013; v1 submitted 3 January, 2013; originally announced January 2013.

Comments: under submission

Journal ref: Journal of Machine Learning Research 15(37):1281-1316, 2014

arXiv:1205.2597

Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (2010)

Authors: Peter Grunwald, Peter Spirtes

Abstract: This is the Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, which was held on Catalina Island, CA, July 8 - 11 2010. This is the Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, which was held on Catalina Island, CA, July 8 - 11 2010. △ Less

Submitted 28 August, 2014; v1 submitted 11 May, 2012; originally announced May 2012.

Report number: UAI2010

arXiv:1107.6004 [pdf]

doi 10.1109/TIT.2015.2458951

Explicit Bounds for Entropy Concentration under Linear Constraints

Authors: Kostas N. Oikonomou, Peter D. Grunwald

Abstract: Consider the set of all sequences of $n$ outcomes, each taking one of $m$ values, that satisfy a number of linear constraints. If $m$ is fixed while $n$ increases, most sequences that satisfy the constraints result in frequency vectors whose entropy approaches that of the maximum entropy vector satisfying the constraints. This well-known "entropy concentration" phenomenon underlies the maximum ent… ▽ More Consider the set of all sequences of $n$ outcomes, each taking one of $m$ values, that satisfy a number of linear constraints. If $m$ is fixed while $n$ increases, most sequences that satisfy the constraints result in frequency vectors whose entropy approaches that of the maximum entropy vector satisfying the constraints. This well-known "entropy concentration" phenomenon underlies the maximum entropy method. Existing proofs of the concentration phenomenon are based on limits or asymptotics and unrealistically assume that constraints hold precisely, supporting maximum entropy inference more in principle than in practice. We present, for the first time, non-asymptotic, explicit lower bounds on $n$ for a number of variants of the concentration result to hold to any prescribed accuracies, with the constraints holding up to any specified tolerance, taking into account the fact that allocations of discrete units can satisfy constraints only approximately. Again unlike earlier results, we measure concentration not by deviation from the maximum entropy value, but by the $\ell_1$ and $\ell_2$ distances from the maximum entropy-achieving frequency vector. One of our results holds independently of the alphabet size $m$ and is based on a novel proof technique using the multi-dimensional Berry-Esseen theorem. We illustrate and compare our results using various detailed examples. △ Less

Submitted 30 September, 2015; v1 submitted 29 July, 2011; originally announced July 2011.

Comments: 1) An error affecting sec. 3 has been corrected: the parameters delta and theta cannot be chosen independently. Sec. 3 has been revised up to Theorem 3.15 in sec. 3.6. 2) Some minor updates in sec. 4. 3) Some proofs used in both sec. 3 and sec. 4 have been unified (This version to appear in IEEE Transactions on Information Theory, December 2015)

arXiv:1002.0757 [pdf, ps, other]

Prequential Plug-In Codes that Achieve Optimal Redundancy Rates even if the Model is Wrong

Authors: Peter Grünwald, Wojciech Kotłowski

Abstract: We analyse the prequential plug-in codes relative to one-parameter exponential families M. We show that if data are sampled i.i.d. from some distribution outside M, then the redundancy of any plug-in prequential code grows at rate larger than 1/2 ln(n) in the worst case. This means that plug-in codes, such as the Rissanen-Dawid ML code, may behave inferior to other important universal codes such… ▽ More We analyse the prequential plug-in codes relative to one-parameter exponential families M. We show that if data are sampled i.i.d. from some distribution outside M, then the redundancy of any plug-in prequential code grows at rate larger than 1/2 ln(n) in the worst case. This means that plug-in codes, such as the Rissanen-Dawid ML code, may behave inferior to other important universal codes such as the 2-part MDL, Shtarkov and Bayes codes, for which the redundancy is always 1/2 ln(n) + O(1). However, we also show that a slight modification of the ML plug-in code, "almost" in the model, does achieve the optimal redundancy even if the the true distribution is outside M. △ Less

Submitted 3 February, 2010; originally announced February 2010.

arXiv:0903.5399 [pdf, ps, other]

Regret and Jeffreys Integrals in Exp. Families

Authors: Peter Grunwald, Peter Harremoes

Abstract: The problem of whether minimax redundancy, minimax regret and Jeffreys integrals are finite or infinite are discussed. The problem of whether minimax redundancy, minimax regret and Jeffreys integrals are finite or infinite are discussed. △ Less

Submitted 31 March, 2009; originally announced March 2009.

arXiv:0809.2754 [pdf, ps, other]

Algorithmic information theory

Authors: Peter D. Grunwald, Paul M. B. Vitanyi

Abstract: We introduce algorithmic information theory, also known as the theory of Kolmogorov complexity. We explain the main concepts of this quantitative approach to defining `information'. We discuss the extent to which Kolmogorov's and Shannon's information theory have a common purpose, and where they are fundamentally different. We indicate how recent developments within the theory allow one to forma… ▽ More We introduce algorithmic information theory, also known as the theory of Kolmogorov complexity. We explain the main concepts of this quantitative approach to defining `information'. We discuss the extent to which Kolmogorov's and Shannon's information theory have a common purpose, and where they are fundamentally different. We indicate how recent developments within the theory allow one to formally distinguish between `structural' (meaningful) and `random' information as measured by the Kolmogorov structure function, which leads to a mathematical formalization of Occam's razor in inductive inference. We end by discussing some of the philosophical implications of the theory. △ Less

Submitted 17 September, 2008; v1 submitted 16 September, 2008; originally announced September 2008.

Comments: 37 pages, 2 figures, pdf, in: Philosophy of Information, P. Adriaans and J. van Benthem, Eds., A volume in Handbook of the philosophy of science, D. Gabbay, P. Thagard, and J. Woods, Eds., Elsevier, 2008. In version 1 of September 16 the refs are missing. Corrected in version 2 of September 17

arXiv:0809.1017 [pdf, ps, other]

Entropy Concentration and the Empirical Coding Game

Authors: Peter Grunwald

Abstract: We give a characterization of Maximum Entropy/Minimum Relative Entropy inference by providing two `strong entropy concentration' theorems. These theorems unify and generalize Jaynes' `concentration phenomenon' and Van Campenhout and Cover's `conditional limit theorem'. The theorems characterize exactly in what sense a prior distribution Q conditioned on a given constraint, and the distribution P… ▽ More We give a characterization of Maximum Entropy/Minimum Relative Entropy inference by providing two `strong entropy concentration' theorems. These theorems unify and generalize Jaynes' `concentration phenomenon' and Van Campenhout and Cover's `conditional limit theorem'. The theorems characterize exactly in what sense a prior distribution Q conditioned on a given constraint, and the distribution P, minimizing the relative entropy D(P ||Q) over all distributions satisfying the constraint, are `close' to each other. We then apply our theorems to establish the relationship between entropy concentration and a game-theoretic characterization of Maximum Entropy Inference due to Topsoe and others. △ Less

Submitted 5 September, 2008; originally announced September 2008.

Comments: A somewhat modified version of this paper was published in Statistica Neerlandica 62(3), pages 374-392, 2008

arXiv:0807.1005 [pdf, ps, other]

Catching Up Faster by Switching Sooner: A Prequential Solution to the AIC-BIC Dilemma

Authors: Tim van Erven, Peter Grunwald, Steven de Rooij

Abstract: Bayesian model averaging, model selection and its approximations such as BIC are generally statistically consistent, but sometimes achieve slower rates og convergence than other methods such as AIC and leave-one-out cross-validation. On the other hand, these other methods can br inconsistent. We identify the "catch-up phenomenon" as a novel explanation for the slow convergence of Bayesian method… ▽ More Bayesian model averaging, model selection and its approximations such as BIC are generally statistically consistent, but sometimes achieve slower rates og convergence than other methods such as AIC and leave-one-out cross-validation. On the other hand, these other methods can br inconsistent. We identify the "catch-up phenomenon" as a novel explanation for the slow convergence of Bayesian methods. Based on this analysis we define the switch distribution, a modification of the Bayesian marginal distribution. We show that, under broad conditions,model selection and prediction based on the switch distribution is both consistent and achieves optimal convergence rates, thereby resolving the AIC-BIC dilemma. The method is practical; we give an efficient implementation. The switch distribution has a data compression interpretation, and can thus be viewed as a "prequential" or MDL method; yet it is different from the MDL methods that are usually considered in the literature. We compare the switch distribution to Bayes factor model selection and leave-one-out cross-validation. △ Less

Submitted 7 July, 2008; originally announced July 2008.

Comments: A preliminary version of a part of this paper appeared at the NIPS 2007 conference

MSC Class: 62G99; 94A99

arXiv:0711.3235 [pdf, ps, other]

A Game-Theoretic Analysis of Updating Sets of Probabilities

Authors: Peter D. Grunwald, Joseph Y. Halpern

Abstract: We consider how an agent should update her uncertainty when it is represented by a set $¶$ of probability distributions and the agent observes that a random variable $X$ takes on value $x$, given that the agent makes decisions using the minimax criterion, perhaps the best-studied and most commonly-used criterion in the literature. We adopt a game-theoretic framework, where the agent plays agains… ▽ More We consider how an agent should update her uncertainty when it is represented by a set $¶$ of probability distributions and the agent observes that a random variable $X$ takes on value $x$, given that the agent makes decisions using the minimax criterion, perhaps the best-studied and most commonly-used criterion in the literature. We adopt a game-theoretic framework, where the agent plays against a bookie, who chooses some distribution from $¶$. We consider two reasonable games that differ in what the bookie knows when he makes his choice. Anomalies that have been observed before, like time inconsistency, can be understood as arising important because different games are being played, against bookies with different information. We characterize the important special cases in which the optimal decision rules according to the minimax criterion amount to either conditioning or simply ignoring the information. Finally, we consider the relationship between conditioning and calibration when uncertainty is described by sets of probabilities. △ Less

Submitted 20 November, 2007; originally announced November 2007.

ACM Class: I.2.4

arXiv:math/0510276 [pdf, ps, other]

doi 10.1214/07-AOS532

An algorithmic and a geometric characterization of Coarsening At Random

Authors: Richard D. Gill, Peter D. Grunwald

Abstract: We show that the class of conditional distributions satisfying the coarsening at Random (CAR) property for discrete data has a simple and robust algorithmic description based on randomized uniform multicovers: combinatorial objects generalizing the notion of partition of a set. However, the complexity of a given CAR mechanism can be large: the maximal "height" of the needed multicovers can be ex… ▽ More We show that the class of conditional distributions satisfying the coarsening at Random (CAR) property for discrete data has a simple and robust algorithmic description based on randomized uniform multicovers: combinatorial objects generalizing the notion of partition of a set. However, the complexity of a given CAR mechanism can be large: the maximal "height" of the needed multicovers can be exponential in the number of points in the sample space. The results stem from a geometric interpretation of the set of CAR distributions as a convex polytope and a characterization of its extreme points. The hierarchy of CAR models defined in this way could be useful in parsimonious statistical modelling of CAR mechanisms, though the results also raise doubts in applied work as to the meaningfulness of the CAR assumption in its full generality. △ Less

Submitted 13 September, 2007; v1 submitted 13 October, 2005; originally announced October 2005.

Comments: 16 pages; accepted in this form for publication by Annals of Statistics

Report number: See also 0811.0683 (duplicate submission) MSC Class: 62A01 (Primary); 62N01; 60A99; 68T37 (Secondary)

Journal ref: The Annals of Statistics 2008, Vol. 36, No. 5, 2409-2422

arXiv:cs/0510080 [pdf, ps, other]

When Ignorance is Bliss

Authors: Peter D. Grunwald, Joseph Y. Halpern

Abstract: It is commonly-accepted wisdom that more information is better, and that information should never be ignored. Here we argue, using both a Bayesian and a non-Bayesian analysis, that in some situations you are better off ignoring information if your uncertainty is represented by a set of probability measures. These include situations in which the information is relevant for the prediction task at… ▽ More It is commonly-accepted wisdom that more information is better, and that information should never be ignored. Here we argue, using both a Bayesian and a non-Bayesian analysis, that in some situations you are better off ignoring information if your uncertainty is represented by a set of probability measures. These include situations in which the information is relevant for the prediction task at hand. In the non-Bayesian analysis, we show how ignoring information avoids dilation, the phenomenon that additional pieces of information sometimes lead to an increase in uncertainty. In the Bayesian analysis, we show that for small sample sizes and certain prediction tasks, the Bayesian posterior based on a noninformative prior yields worse predictions than simply ignoring the given information. △ Less

Submitted 25 October, 2005; originally announced October 2005.

Comments: In Proceedings of the Twentieth Conference on Uncertainty in AI, 2004, pp. 226-234

ACM Class: I.2.4

arXiv:cs/0502004 [pdf, ps, other]

Asymptotic Log-loss of Prequential Maximum Likelihood Codes

Authors: Peter Grunwald, Steven de Rooij

Abstract: We analyze the Dawid-Rissanen prequential maximum likelihood codes relative to one-parameter exponential family models M. If data are i.i.d. according to an (essentially) arbitrary P, then the redundancy grows at rate c/2 ln n. We show that c=v1/v2, where v1 is the variance of P, and v2 is the variance of the distribution m* in M that is closest to P in KL divergence. This shows that prequential… ▽ More We analyze the Dawid-Rissanen prequential maximum likelihood codes relative to one-parameter exponential family models M. If data are i.i.d. according to an (essentially) arbitrary P, then the redundancy grows at rate c/2 ln n. We show that c=v1/v2, where v1 is the variance of P, and v2 is the variance of the distribution m* in M that is closest to P in KL divergence. This shows that prequential codes behave quite differently from other important universal codes such as the 2-part MDL, Shtarkov and Bayes codes, for which c=1. This behavior is undesirable in an MDL model selection setting. △ Less

Submitted 1 February, 2005; originally announced February 2005.

Comments: 22 pages, an abstract has been submitted to COLT 2005

ACM Class: E.4

arXiv:cs/0501028 [pdf, ps, other]

An Empirical Study of MDL Model Selection with Infinite Parametric Complexity

Authors: Steven de Rooij, Peter Grunwald

Abstract: Parametric complexity is a central concept in MDL model selection. In practice it often turns out to be infinite, even for quite simple models such as the Poisson and Geometric families. In such cases, MDL model selection as based on NML and Bayesian inference based on Jeffreys' prior can not be used. Several ways to resolve this problem have been proposed. We conduct experiments to compare and… ▽ More Parametric complexity is a central concept in MDL model selection. In practice it often turns out to be infinite, even for quite simple models such as the Poisson and Geometric families. In such cases, MDL model selection as based on NML and Bayesian inference based on Jeffreys' prior can not be used. Several ways to resolve this problem have been proposed. We conduct experiments to compare and evaluate their behaviour on small sample sizes. We find interestingly poor behaviour for the plug-in predictive code; a restricted NML model performs quite well but it is questionable if the results validate its theoretical motivation. The Bayesian model with the improper Jeffreys' prior is the most dependable. △ Less

Submitted 14 January, 2005; originally announced January 2005.

Comments: 23 pages, 11 graphs

ACM Class: E.3; G.4

arXiv:cs/0410002 [pdf, ps, other]

Shannon Information and Kolmogorov Complexity

Authors: Peter Grunwald, Paul Vitanyi

Abstract: We compare the elementary theories of Shannon information and Kolmogorov complexity, the extent to which they have a common purpose, and where they are fundamentally different. We discuss and relate the basic notions of both theories: Shannon entropy versus Kolmogorov complexity, the relation of both to universal coding, Shannon mutual information versus Kolmogorov (`algorithmic') mutual informa… ▽ More We compare the elementary theories of Shannon information and Kolmogorov complexity, the extent to which they have a common purpose, and where they are fundamentally different. We discuss and relate the basic notions of both theories: Shannon entropy versus Kolmogorov complexity, the relation of both to universal coding, Shannon mutual information versus Kolmogorov (`algorithmic') mutual information, probabilistic sufficient statistic versus algorithmic sufficient statistic (related to lossy compression in the Shannon theory versus meaningful information in the Kolmogorov theory), and rate distortion theory versus Kolmogorov's structure function. Part of the material has appeared in print before, scattered through various publications, but this is the first comprehensive systematic comparison. The last mentioned relations are new. △ Less

Submitted 1 October, 2004; originally announced October 2004.

Comments: Survey, LaTeX 54 pages, 3 figures, Submitted to IEEE Trans Information Theory

ACM Class: E.4, H.1.1

Journal ref: There are some errors in this paper draft; when in doubt see the textbook Li, Vitanyi, An Introduction to Kolmogorov Complexity and Its Applications, Springer, 1993, 1997, 2008, 2019

arXiv:math/0406221 [pdf, ps, other]

Suboptimal behaviour of Bayes and MDL in classification under misspecification

Authors: Peter Grunwald, John Langford

Abstract: We show that forms of Bayesian and MDL inference that are often applied to classification problems can be *inconsistent*. This means there exists a learning problem such that for all amounts of data the generalization errors of the MDL classifier and the Bayes classifier relative to the Bayesian posterior both remain bounded away from the smallest achievable generalization error. We show that forms of Bayesian and MDL inference that are often applied to classification problems can be *inconsistent*. This means there exists a learning problem such that for all amounts of data the generalization errors of the MDL classifier and the Bayes classifier relative to the Bayesian posterior both remain bounded away from the smallest achievable generalization error. △ Less

Submitted 10 June, 2004; originally announced June 2004.

Comments: This is a slightly longer version of our paper at the COLT (Computational Learning Theory) 2004 Conference, containing two extra pages of discussion of the main results

MSC Class: 62A01; 68T05; 68T10

arXiv:math/0406077 [pdf, ps, other]

A tutorial introduction to the minimum description length principle

Authors: Peter Grunwald

Abstract: This tutorial provides an overview of and introduction to Rissanen's Minimum Description Length (MDL) Principle. The first chapter provides a conceptual, entirely non-technical introduction to the subject. It serves as a basis for the technical introduction given in the second chapter, in which all the ideas of the first chapter are made mathematically precise. The main ideas are discussed in gr… ▽ More This tutorial provides an overview of and introduction to Rissanen's Minimum Description Length (MDL) Principle. The first chapter provides a conceptual, entirely non-technical introduction to the subject. It serves as a basis for the technical introduction given in the second chapter, in which all the ideas of the first chapter are made mathematically precise. The main ideas are discussed in great conceptual and technical detail. This tutorial is an extended version of the first two chapters of the collection "Advances in Minimum Description Length: Theory and Application" (edited by P.Grunwald, I.J. Myung and M. Pitt, to be published by the MIT Press, Spring 2005). △ Less

Submitted 4 June, 2004; originally announced June 2004.

Comments: 80 pages 5 figures Report with 2 chapters

MSC Class: 6201; 6801; 68T05; 68T10; 9401

arXiv:cs/0306124 [pdf, ps, other]

Updating Probabilities

Authors: Peter D. Grunwald, Joseph Y. Halpern

Abstract: As examples such as the Monty Hall puzzle show, applying conditioning to update a probability distribution on a ``naive space'', which does not take into account the protocol used, can often lead to counterintuitive results. Here we examine why. A criterion known as CAR (``coarsening at random'') in the statistical literature characterizes when ``naive'' conditioning in a naive space works. We s… ▽ More As examples such as the Monty Hall puzzle show, applying conditioning to update a probability distribution on a ``naive space'', which does not take into account the protocol used, can often lead to counterintuitive results. Here we examine why. A criterion known as CAR (``coarsening at random'') in the statistical literature characterizes when ``naive'' conditioning in a naive space works. We show that the CAR condition holds rather infrequently, and we provide a procedural characterization of it, by giving a randomized algorithm that generates all and only distributions for which CAR holds. This substantially extends previous characterizations of CAR. We also consider more generalized notions of update such as Jeffrey conditioning and minimizing relative entropy (MRE). We give a generalization of the CAR condition that characterizes when Jeffrey conditioning leads to appropriate answers, and show that there exist some very simple settings in which MRE essentially never gives the right results. This generalizes and interconnects previous results obtained in the literature on CAR and MRE. △ Less

Submitted 23 June, 2003; originally announced June 2003.

Comments: This is an expanded version of a paper that appeared in Proceedings of the Eighteenth Conference on Uncertainty in AI, 2002, pp. 187--196. to appear, Journal of AI Research

ACM Class: I.2.4

Showing 1–43 of 43 results for author: Grünwald, P