Michael Evans
Department of Statistical Sciences, University of Toronto
Gun Ho Jang
Ontario Institute for Cancer Research
Abstract
Relative belief inferences are shown to arise as Bayes rules or limiting Bayes
rules. These inferences are invariant under reparameterizations and possess a
number of optimal properties. In particular, relative belief inferences are
based on a direct measure of statistical evidence.
Key words and phrases: Bayesian inference, evidential inference,
statistical evidence, relative belief, loss functions, Bayesian unbiasedness,
Bayes rules, admissibility, limits of Bayes rules.
1 Introduction
Consider a sampling model for data , given by a collection of densities
with respect to a support measure
on sample space and a proper prior, given by density
with respect to support measure on When the data
is observed these ingredients lead to the posterior
distribution on with density given by with respect to support measure where
is the prior
predictive density of the data. In addition, there is a quantity of interest
where for which
inferences, such as an estimate or a hypothesis assessment
are required. Let denote the
marginal prior density of and be the conditional prior predictive
of the data after integrating out the nuisance parameters via the conditional
distribution of given Bayesian inferences for
are then based on the ingredients alone or by adding a loss function
There are several different general approaches to deriving inferences based on
such ingredients. The two most commonly used are MAP-based inferences and
Bayesian decision theory. MAP-based inferences are based implicitly on
assuming that posterior probabilities can measure statistical evidence and do
not use a loss function explicitly. Bayesian decision theory seeks inferences
that are optimal with respect to risk which is defined as the expected loss
incurred by an inference under the joint distribution of Such optimal inferences are referred to as Bayes
rules and they generally exist. A concern with MAP-based inferences is that it
is not clear that posterior probabilities do measure evidence in addition to
measuring belief. A concern with decision-theory inferences is that, while the
model and prior are checkable against the data through model checking and
checking for prior-data conflict, it is not clear how to check the loss
function which can be viewed as being a somewhat arbitrary choice. In both
cases this renders such inferences of questionable validity for scientific applications.
Another approach to deriving Bayesian inferences is through relative belief.
Relative belief refers to how beliefs change from a priori to a posteriori.
This leads to a more natural approach to characterizing statistical evidence:
since it is the data that leads to change in beliefs from a priori to a
posteriori, it is this change that tells us whether evidence has been found in
favor of or against some specific value In essence, in this approach
it is the a posteriori beliefs relative to the a priori beliefs that determine
inferences and not just a posteriori beliefs alone. Also, a loss function
plays no role in determining the inferences. Using relative belief as the
basis for deriving inferences produces statistical methodology with a number
of attractive features.
A historical theme in statistical research has been to seek an acceptable
definition of statistical evidence and, once found, use this to derive
inferences. For example, this is the focus of much of the work of Alan
Birnbaum, see Birnbaum (1962), who sought such a definition within the context
of frequentist inference. Frequentist theory, as opposed to Bayesian theory,
uses the ingredients together
with the idea that inferences should be graded according to their behavior in
hypothetical repeated sampling experiments and, hopefully this would lead to a
prescription of the inferences. Despite some impressive accomplishments, it is
fair to say that Birnbaum’s program did not succeed as there is still no such
generally acceptable definition of statistical evidence within the frequentist
context. The pure likelihood theory of Royall (1997) is also concerned with
basing inference on a definition of statistical evidence, and also uses just
the ingredients Pure likelihood
theory invokes the likelihood principle to assert that the likelihood function
itself is the appropriate characterization of statistical evidence and bases
all inferences on the likelihood with no appeal to repeated sampling.
Frequency theory and likelihood theory have some appealing characteristics,
but both leave gaps in their approach to statistical reasoning. In particular,
inferences for marginal parameters can be problematical.
These issues are discussed in more detail in Evans (2024).
While not concerned directly with statistical evidence, Bayesian decision
theory has some obvious virtues. In particular, there is an axiomatization due
to Savage (1971). These axioms suggest that to not follow this path in
carrying out a statistical analysis is to commit an error. One may not find
the specific axioms of Savage acceptable, but it is difficult to argue that
the subject of Statistics does not need such an axiomatic formulation as
otherwise almost any statistical analysis seems justifiable.
The addition of a prior is what leads to a clear definition of statistical
evidence and so, provided one accepts the usage of priors, relative belief
essentially solves Birnbaum’s problem. The purpose of this paper is to review
and extend results that show that relative belief inferences can be considered
as arising within the context of Bayesian decision theory. This, of course
requires the use of a loss function and it will be seen that the loss
functions that are used are checkable against the data and so appropriate for
scientific applications. So these inferences satisfy two of the great themes
of statistical research over the years, namely, they are evidence based and
yet justifiable within the context of decision theory.
It can be shown, for example, see Bernardo and Smith (2000), that MAP
inferences arise as the limits of Bayes rules via a sequence of loss functions
(1)
where and is the ball of radius
centered at These inferences, however, are not invariant
under reparameterizations. It is shown here that relative belief inferences
also arise via a sequence of loss functions similar to (1) but based
on the prior and these inferences are invariant. In general, Bayes rules will
also not be invariant under reparameterizations. Robert (1996) proposed using
the intrinsic loss function based on a measure of distance between sampling
distributions as Bayes rules with respect to such losses are invariant.
Bernardo (2005) proposed using the intrinsic loss function based on the
Kullback-Leibler divergence between
and When
the intrinsic loss function is given by For a general marginal parameter
the intrinsic loss function is These loss functions are
intrinsic because they are based on the sampling model alone and they are
checkable via model checking.
Another possibility for an intrinsic loss function is to base the loss
function on the prior and this is in essence how the loss function arises in
the relative belief context. It is to be stressed, however, that the essential
ingredient of this approach is the clear characterization of what is meant by
statistical evidence and the loss function approach is not essential for its
justification. It is, however, a satisfying result that relative belief can be
placed into the decision-theoretic context with the loss being checkable via
checking for prior-data conflict. Furthermore, the loss function used has some
direct appeal.
In some contexts relative belief inferences are Bayes rules, but in a general
context they are seen to arise as the limits of Bayes rules. This approach has
some historical antecedents. For example, in Le Cam (1953) it is shown that
the MLE is asymptotically Bayes but this is for a fixed loss function, with
increasing amounts of data and a sequence of priors. In the context discussed
here there is a fixed amount of data, a fixed model and prior but there is a
sequence of loss functions all based on the single fixed prior.
While it is preferable in many applications to state the inferences solely
based on the evidence in the data, one can still consider inferences that
possess some kind of optimality with respect to loss. Any discrepancy can then
be justified based on particular characteristics of the application, e.g.,
evidence is obtained that a drug generally prevents the progression of a
disease but the expense and side effects are too great to warrant its usage.
So it is not being suggested here that decision-theoretic inferences are not
relevant as indeed the concept of utility or loss is a significant component
of many applications.
Section 2 is concerned with describing the general characteristics of the
three approaches to deriving Bayesian inferences. Section 3 shows how relative
belief estimation and prediction inferences can be seen to arise from decision
theory and Section 4 does this for credible regions and hypothesis assessment.
Throughout the paper the probability measures associated with a density are
denoted by the same letter but capitalized. All proofs of theorems and
corollaries are in the Appendix excepting the case where is
finite as these are quite straightforward and supply motivation for the more
complicated contexts. The overall goal of the paper is to show that relative
belief inferences can arise through decision-theoretic considerations even
though their primary motivation is through characterizing statistical
evidence. In particular, it is shown here that relative belief estimators, as
used in practice, are admissible. Some of the discussion here has appeared in
the book Evans (2015) and is included to provide a complete exposition of this relationship.
2 Bayesian inference
The various approaches to deriving Bayesian inferences are now described in
some detail.
2.1 MAP inferences
The highest posterior density (hpd), or MAP-based, approach to determining
inferences constructs credible regions of the form
(2)
where is the marginal posterior density with respect
to a support measure on and is
chosen so that It follows from (2) that, to
assess the hypothesis then we can use the tail
probability given by
Furthermore, the class of sets is naturally ”centered” at the
posterior mode (when it exists uniquely) as converges to this
point as The use of the posterior mode as an estimator
is commonly referred to as MAP (maximum a posteriori) estimation. We
can then think of the size of the set say for
as a measure of how accurate the MAP estimator is in a given context.
Furthermore, when is an open subset of a Euclidean space, then
minimizes volume among all -credible regions.
It is well-known, however, that hpd inferences suffer from a defect. In
particular, in the continuous case MAP inferences are not invariant under
reparameterizations. For example, this means that if is the
MAP estimate of , then it is not necessarily true that is the MAP estimate of when
is a 1-1, smooth transformation. The noninvariance of a statistical procedure
seems very unnatural as it implies that the statistical analysis depends on
the parameterization and typically there does not seem to be a good reason for
this. Note too that estimates based upon taking posterior expectations will
also suffer from this lack of invariance. It is also the case that
MAP inferences are not based on a direct characterization of statistical
evidence. Both of these issues motivate the development of relative belief inferences.
2.2 Bayesian decision theory
An ingredient that is commonly added to is a loss function, namely, satisfying whenever and
only when The goal is to find a
procedure, say that in some sense minimizes the
loss based on the joint distribution of
Given the assumptions on the loss function the loss function can instead be
thought of as a map with iff and
the ingredients can be represented as
The goal of a decision analysis is then to find a decision function
that minimizes the prior
risk
where is the posterior risk. Such a is called a
Bayes rule and clearly a that minimizes for each
is a Bayes rule. Further discussion of Bayesian decision theory can be
found in Berger (1985).
As noted in Bernardo (2005) a decision formulation also leads to credible
regions for namely, a -lowest posterior loss credible
region is defined by
(3)
where Note that in (3) is interpreted as the
decision function that takes the value constantly in Clearly as
the set converges to the value of a Bayes
rule at For example, with quadratic loss the Bayes rule is given by the
posterior mean and a -lowest posterior loss region is the smallest
sphere centered at the mean containing at least of the posterior probability.
2.3 Relative belief inferences
Relative belief inferences, like MAP inferences, are based on the ingredients
Note that
underlying both approaches is the principle (axiom) of conditional probability
that says that initial beliefs about as expressed by the prior
must be replaced by conditional beliefs as expressed by he
posterior In this approach, however, a measure of
statistical evidence is used given by the relative belief ratio
(4)
The relative belief ratio produces the following conclusions: if then there is evidence in favor of being the true
value, if there is evidence against being
the true value and if then there is no evidence
either way. These implications follow from a very simple principle of inference.
Principle of Evidence: for probability model if
is observed to be true where then there is
evidence in favor of being true if
evidence against being true if and no
evidence either way if
This principle seems obvious when is a discrete
probability measure. For the continuous case, where
let be a sequence of neighborhoods of converging
nicely to as (see Rudin (1974)), then under weak
conditions, e.g., is continuous and positive at
and this justifies the general interpretation of as a
measure of evidence. The relative belief ratio determines the inferences.
A natural estimate of is the relative belief estimate
as it has the maximum evidence in favor. To assess the accuracy of there is the plausible region the set of values having evidence in favor
of being the true value. The size of together with its
posterior content which measures the belief
that the true value is in provide the assessment of the
accuracy. So, if is ”small” and then is to be considered as an accurate
estimate of but not otherwise. A relative belief -credible
region
where for can also be quoted provided The containment is necessary as otherwise
would contain a value for which there is evidence
against being the true value.
For assessing the hypothesis the value
indicates whether there is evidence in favor of or
against The strength of this evidence can be measured by the
posterior probability as this measures the
belief in what the evidence says. So, if and
then there is strong evidence that
is true while, when and there is strong evidence that is
false. Since can be small, even 0 in the continuous
case, it makes more sense to measure the strength of the evidence in such a
case by
If and
then the evidence is strong that is the true values as there is
small belief that the true value of has more evidence in its favor than
If and then the evidence is strong that is not the
true values as there is large belief that the true value of has more
evidence in its favor than Actually, there is no reason to quote a
single number to measure the strength and both and can be quoted when relevant.
An important aspect of both and is what happens as the amount of data increases. To
ensure that these behave appropriately, namely, when is false(true) and it is necessary to take into account the
difference that matters By this we mean that there is a distance
measure on such that if then in terms of the application, these
values are considered equivalent. Such a always exists because
measurements are always taken to finite accuracy. For example, if is
real-valued, then there is a grid of values separated by and inferences are
determined using the relative belief ratios of the intervals In effect, is now When the computations
are carried out in this way then and do what is required. As a particular instance of this
see the results in Section 4 where such a discretization plays a key role.
It is easy to see that the class of relative belief credible regions
for is independent of the
marginal prior When a value is specified,
however, the set does depend on through
So the form of relative belief inferences about is
completely robust to the choice of but the quantification of the
uncertainty in the inferences is not. For example, when then is the MLE while, in general,
is the maximizer of the integrated likelihood
Similarly, relative belief regions are likelihood regions in the case of the
full parameter, and integrated likelihood regions generally. As such,
likelihood regions can be seen as essentially Bayesian in character with a
clear and precise characterization of evidence through the relative belief
ratio and now have probability assignments through the posterior. It is the
case, however, that a relative belief ratio while
proportional to an integrated likelihood, cannot be multiplied by an arbitrary
positive constant, as with a likelihood, without losing its interpretation in
measuring statistical evidence. It has been established in Al Labadi and Evans
(2017) that relative belief inferences for are optimally robust to the
prior
As can be seen from (4), relative belief inferences are always
invariant under smooth reparameterizations and this is at least one reason why
they are preferable to MAP inferences. It is the case, however, that any rule
for measuring evidence which satisfies the principle of evidence also produces
valid estimates as these lie in and so will have the same
”accuracy” as For example, if instead of the relative belief
ratio the difference was used as the
measure of evidence with cut-off 0, then this satisfies the principle of
evidence but the estimate is no longer necessarily invariant under
reparameterizations. The Bayes factor with cut-off 1 is also a valid measure
of evidence but there are a number of reasons why the relative belief ratio is
to be preferred to the Bayes factor for general inferences, see Al-Labadi,
Alzaatreh and Evans (2024).
3 Estimation: discrete parameter space
The following theorem presents the basic definition of the loss function when
is finite and establishes an important optimality result. The indicator
function for the set is denoted
Theorem 1. Suppose that for every
where is finite with equal
to counting measure on Then for the loss function
(5)
the relative belief estimator is a Bayes rule.
Proof: We have that
(6)
Since is finite, the first term in (6) is finite and a
Bayes rule at is given by the value that maximizes the second
term. Therefore, is a Bayes rule.
The loss function seems very natural. For beliefs about the true
value of are expressed by the prior and so values where
is very low and is indeed a false value, would be
quite misleading if the inferences pointed to such a value. So it is
appropriate for such values to bear large losses. In a sense the statistician
is acknowledging what such values are by the choice of prior. Of course, the
prior may be wrong in the sense that the bulk of its mass is placed in a
region where the true value of does not lie. This is why checking for
prior-data conflict, before inference is carried out, is always recommended.
Procedures for checking a prior are discussed in Evans and Moshonov (2006) and
Nott et al. (2020) and an approach to replacing a prior found to be at fault
is developed in Evans and Jang (2011). The loss motivates the other
losses for relative belief discussed here so this comment applies to those
losses as well.
The prior risk of satisfies
(7)
the sum of the conditional prior error probabilities over all values.
If instead the loss function is taken to be as in (1), then virtually the same
proof as Theorem 1 establishes that is a Bayes rule with respect
to this loss and the prior risk equals
(8)
the prior probability of making an error. Both and are
two-valued loss functions but, when an incorrect decision is made, the loss is
constant in for while it equals the reciprocal of the
prior probability of for . So penalizes an
incorrect decision much more severely when the true value of is
in the tails of the prior. Note that when
is uniform. It is seen too that (7) is an upper bound on (8)
so controlling losses based on automatically controls the losses
based on
As already noted, is proportional to the integrated
likelihood of So, under the conditions of Theorem 1, the maximum
integrated likelihood estimator is a Bayes rule. Furthermore, the Bayes rule
is the same for every choice of and only depends on the full
prior through the conditional prior placed on the
nuisance parameters. When then is the MLE of
and so the MLE of is a Bayes rule for every prior
Note that when then iff so when and
otherwise. This is the classical context for hypothesis testing and can be viewed as acceptance of the hypothesis and as rejection of
Theorem 1 establishes that relative belief provides a Bayes rule for
the hypothesis testing problem.
The loss function (5) does not provide meaningful results when
is infinite as (7) shows that will be
infinite. So we modify (5) via a parameter and define the
loss function
(9)
Note that is a bounded by This loss function is like
(5) but does not allow for arbitrarily large losses. Without loss of
generality we can restrict to a sequence of values converging to
0.
Theorem 2. Suppose that for every
that is countable with
equal to counting measure and that is the unique maximizer of
for all For the loss function (9) Bayes
rule then as
for every
The proof of Theorem 2 also establishes the following
results.
Corollary 3. For all sufficiently small the value of
a Bayes rule at is given by
The following is an immediate consequence of Theorem 1 and Corollary 3 as
is a Bayes rule.
Corollary 4. is an admissible estimator with
respect to the loss when is finite and the loss
with sufficiently small, when is
countable.
In a general estimation problem is risk unbiased with respect to a
loss function if for all This says
that on average is closer to the true value than any other value
when we interpret as a measure of distance between
and A definition of Bayesian unbiasedness
for with respect to is that
as this retains the idea of being closer on average to the true value than a
false value. Consider now a family of loss functions of the form
(10)
where is a nonnegative function satisfying and note that this includes
and when is finite and
Theorem 5. If is finite or countable, then
is Bayesian unbiased under the loss function (10).
Suppose after observing it is desired to predict a future (or concealed)
value where a density
with respect to support measure on and it
is assumed that the true value of in the model for gives the true
value of The prior predictive density of is given by
while the posterior predictive
density is The relative belief ratio for a future value
is thus and the relative belief prediction,
namely, the value maximizing is denoted
When is finite then, with basically the same argument as in
Theorem 1, is a Bayes rule under the loss function Also, it can be proved that
is a limit of Bayes rules when is countable.
Consider now a common application where is finite.
Example 1.Classification
For a classification problem there are categories prescribed by some function where for each Estimating is then equivalent to classifying the
data as having come from one of the distributions in the classes specified by
The standard Bayesian solution to this problem is to
use as the classifier. From (8) we have that
minimizes the prior probability of misclassification while
from (7) minimizes the sum of the probabilities of
misclassification. The essence of the difference is that treats
the errors of misclassification equally while weights the
errors by their prior probabilities.
The following shows that minimizing the sum of the error probabilities is
often more appropriate than minimizing the weighted sum. Suppose and
Bernoulli or Bernoulli with and representing the known
proportions of individuals either labelled coming from population 0 or 1. For
example, consider as the probability of a positive diagnostic test
for a disease in the nondiseased population while is this
probability for the diseased population. Suppose that is
very small, indicating that the test is successful in identifying the disease
while not yielding many false positives, and that is very small, so
the disease is rare. The question then is to assign a randomly chosen
individual to a population based on the results of their test.
The posterior is given by and
Therefore,
This implies that will always classify a person to the
nondiseased population when is small enough, e.g., when and By contrast, in this
situation, always classifies an individual with a positive test to
the diseased population and to the nondiseased population for a negative test.
Since is the Bernoulli distribution, when
and is small enough,
This illustrates clearly the difference between these two procedures as
does better than on the diseased population when
is small and is large as would be the case for a good
diagnostic. Of course minimizes the overall error rate but at the
price of ignoring the most important class in this problem. Note that this
example can be extended to the situation where we need to estimate the
based on samples from the respective populations but this will not
materially affect the overall conclusions.
Consider now a situation where is such that Bernoulli where and are
known but is unknown with prior This is a generalization of
the previous discussion where was assumed to be known. Then based
on a sample from the joint distribution
the goal is to predict the value for a newly observed
The prior of is and, if beta so the
prior predictive of is Bernoulli The
posterior predictive density of equals, where
It follows that, suppressing the dependence on the data,
(13)
(16)
Note that and are identical whenever
From these formulas it is apparent that a substantial difference will arise
between and when one of or is much bigger
than the other. As in Example 1 these correspond to situations where we
believe that or is very small. Suppose we take
and let be relatively large, as this corresponds to knowing
a priori that is very small. Then (16)
implies that and so whenever A
similar conclusion arises when we take and
To see what kind of improvement is possible consider a simulation study. Let
be a density, be a density, and the
prior on be beta Table 1 presents the Bayes risks for
and for various choices of when When
they are equivalent but we see that as rises the performance
of deteriorates while improves. Large values of
correspond to having information that is small. When
about of the prior probability is to the left of with
about of the prior probability is to the left of and
with about of the prior probability is to the left of
We see that the misclassification rates for the small group
stay about the same for as increases while they deteriorate
markedly for as the MAP procedure basically ignores the small group.
Table 1: Conditional prior probabilities of misclassification for and for various values of in Example 3 when , , and =10.
We also investigated other choices for and There is very little
change as increases. When moves towards the error rates go up
and go down as moves away from 0, as one would expect. It is the case,
however, that always dominates
4 Estimation: continuous parameter space
When has a continuous prior distribution the argument in Theorem 2 does
not work as There are several possible
ways to proceed but one approach is to use a discretization of the problem
that uses Theorem 2. For this we will assume that the spaces involved are
locally Euclidean, map**s are sufficiently smooth and take the support
measures to be the analogs of Euclidean volume on the respective spaces. While
the argument provided applies quite generally, it is simplified here by taking
all spaces to be open subsets of Euclidean spaces and the support measures to
be Euclidean volume on these sets.
For each suppose there is a discretization of into a countable number of
subsets with the following properties: and diam as So, if then For
example, the could be equal volume rectangles in
Further, we assume that as for every
This will hold whenever is continuous everywhere and
converges nicely to as
Let denote a point in such that
whenever and put So is a discretized
version of We will call this a regular discretization
of The discretized prior on is and the
discretized posterior is
The loss function for the discretized problem is defined as in Theorem 2 by
(17)
and let denote a Bayes rule for this
problem.
Theorem 6. Suppose that is positive and
continuous and we have a regular discretization of Further suppose
that is the unique maximizer of and for
any
Then, there exists as such
that a Bayes rule under the loss
converges to as for all
Theorem 6 says that is a limit of Bayes rules. So,
when we have the result that the MLE is a limit of Bayes
rules and more generally the MLE from an integrated likelihood is a limit of
Bayes rules. The regularity conditions stated in Theorem 6 hold in many common
statistical problems.
Now let be the relative belief estimate from the
discretized problem, i.e., maximizes as a function of The
following is immediate from the proof of Theorem 6, Theorem 5 and Corollary 4.
Corollary 7. is admissible and
Bayesian unbiased for the discretized problem and as for every
By similar arguments an analog of Theorem 6 for can be
established. Actually, in this case, a simpler development can be followed in
certain situations using the loss function . For this note that the posterior risk of in the
discretized problem is given by for some Now
suppose is a cube centered at of edge length
Suppose further that for each there exists
such that, when then Since is constant we have that a Bayes rule must then satisfy . This proves that is a limit of
Bayes rules. By contrast, for the loss the posterior risk of
is given by
and the first term is generally unbounded unless is compact.
Consider an important example.
Example 2.Regression
Suppose that where is fixed of
rank and We will
assume that is known to simplify the discussion but this is not
necessary. Let be a prior density for For every having
observed then the MLE
of
It is interesting to contrast this result with what might be considered more
standard Bayesian estimates such as MAP or the posterior mean. For example,
suppose that Then the posterior distribution
of is where
and note Writing the spectral decomposition
of as we have that
Since and for
each this implies that shrinks the MLE towards the prior
mean When the columns of are orthonormal, then where and so the shrinkage is
substantial unless is much larger than This shrinkage
is often cited as a positive attribute of these estimates. Consider, however,
the situation where the true value of is some distance from the mean.
In that case it seems wrong to move towards the prior mean and so it
isn’t clear that shrinking the MLE is necessarily a good thing, particularly
as this requires giving up invariance.
Suppose it is required to estimate the mean response at for the predictors. The prior distribution of
is and the posterior
distribution is Note that
since for each Therefore,
maximizing the ratio of the posterior to prior densities leads to
(18)
Then implies Note that when is much smaller
than in other words the posterior is much more
concentrated than the prior, then and are very
similar. In general is not equal to the plug-in
MLE of although it is the MLE from the integrated likelihood,
as and when
has orthonormal columns
Suppose it is required to predict a response at the predictor value When the prior distribution of is
and the
posterior distribution is where
To obtain it is necessary to maximize the ratio of the posterior
to the prior densities of and this leads to
(19)
Note that and so and is further from the prior mean than Also, we see that, when is small
then and are very similar. Finally, comparing
(18) and (19) we have that
and so at is more dispersed than the estimate
of the mean at and this makes good sense as we have to take into account
the additional variation due to prediction. By contrast
5 Credible regions and hypothesis assessment
First recall that a -relative belief credible region for is given by where There is some arbitrariness in the
choice of the greater than or equal sign to define the credible region as it
also could have been defined as where In this latter case
is the -th quantile of the posterior distribution
of the relative belief ratio. This definition has some advantages as using
this implies that the plausible region satisfies where Also, the strength of the
evidence concerning the hypothesis satisfies
where
The point here is that there is a close
relationship between relative belief credible regions and the plausible region
and the strength calculation. As such, any decision-theoretic interpretation
for relative belief credible regions also applies to the plausible region and
the strength of the evidence. Throughout this section we will, however, retain
the definition for provided in Section 2.3.
Now consider the lowest posterior loss -credible regions that arise
from the prior-based loss functions considered here.
Theorem 8. Suppose that for every
where is finite with equal
to counting measure. Then is a -lowest posterior
loss credible region for the loss function
Proof: From (3) and (6) the -lowest
posterior loss credible region is
and
As is independent of it
is clearly equivalent to define this region via namely,
Now consider the case where is countable and we use loss function
Following the proof of Theorem 8 we see that a -lowest
posterior loss region takes the form
where
Theorem 9. Suppose that for every
that is countable with equal to counting
measure. For the loss function then whenever is
such that and
whenever and
While Theorem 9 does not establish the exact convergence it is likely, however,
that this does hold under quite general circumstances due to the discreteness.
Theorem 9 does show that limit points of the class of sets always contain and their posterior probability
content differs from by at most where
is the next largest value for which we have exact content.
Now consider the continuous case with a regular discretization. For namely, is a subset of a discretized version
of define the undiscretized version of to
be Now let be the -relative belief region for the
discretized problem and let be its undiscretized
version. Note that in a continuous context we will consider two sets as equal
if they differ only by a set of measure 0 with respect to The
following result says that a -relative belief credible region for the
discretized problem, after undiscretizing, converges to the -relative
belief region for the original problem.
Theorem 10. Suppose that is positive and
continuous, there is a regular discretization of and has a continuous posterior distribution. Then
While Theorem 10 has interest in its own right, it can be also used
to prove that relative belief regions are limits of lowest posterior loss regions.
Let be the -lowest posterior loss
region obtained for the discretized problem using loss function (17)
and be the undiscretized version.
Theorem 11. Suppose that is positive and
continuous, we have a regular discretization of and has a continuous posterior distribution. Then
In Evans, Guttman, and Swartz (2006) and Evans and Shakhatreh (2008)
additional properties of relative belief regions are developed. For example,
it is proved that a -relative belief region for
satisfying minimizes
among all (measurable) subsets of satisfying So a -relative belief region is smallest
among all -credible regions for where size is measured using
the prior measure. This property has several consequences. For example, the
prior probability that a region contains a false
value from the prior is given by where a false value is a value of
generated independently of It can be proved that a -relative belief region
minimizes this probability among all -credible regions for and
is always unbiased in the sense that the probability of covering a false value
is bounded above by Furthermore, a -relative belief region
maximizes the relative belief ratio and
the Bayes factor among all regions with
While the results in this section have been concerned with obtaining credible
regions for parameters, similar results can be proved for the construction of
prediction regions.
6 Conclusions
Relative belief inferences are closely related to likelihood inferences. This
together with their invariance and optimality properties make these prime
candidates as appropriate inferences in Bayesian contexts. This paper has
shown that relative belief inferences arise naturally in a decision-theoretic
formulation using loss functions based on the prior.
Appendix
Proof of Theorem2 and Corollary 3: We have that
(20)
The first term in (20) is bounded above by and does not
depend on so the value of a Bayes rule at is obtained by
finding that maximizes the second term. Note that
(21)
There are at most finitely many values of satisfying and so assumes a maximum on this set,
say at and when If then This proves that, for all the maximizer of (21) is given by and the results are established.
Proof of Theorem 5: The prior risk of is given by
and
Therefore, is Bayesian unbiased if and only if
(22)
This inequality holds when because is the density of
with respect to and which implies that the maximum of this
density is greater than or equal to 1.
Proof of Theorem 6 and Corollary 7: Just as in Theorem 2, a
Bayes rule maximizes for Furthermore, as in Theorem 2, such a rule exists. Now
define so that and note that as We have that, as
(23)
Let Let be such that diam for all Then for
and any satisfying we have
By (24), (25) and (26) this implies that
and the
convergence is established.
Now and so by
(24), (25) and (26) this implies that and the convergence of to is established.
Proof of Theorem 9: For let and Note that as
Suppose is such that Then
for all and so This implies that and since this implies that
Now suppose is such that Then there
exists such that for all we have Since when then Then choosing for
implies that
Proof of Theorem 10: Let and Recall that
for every If there exists
such that for all then and this implies that Now and so (after possibly deleting a set of -measure 0
from If
then for infinitely many
which implies that and
therefore This proves (up to a set of -measure 0) so that
for any
Let so
and
(27)
Since then and as
Now consider the second term in (27). Since
has a continuous posterior distribution, is continuous in Let
and note that for all small enough, and which implies that and
therefore As
or
then
For all small, then is bounded
above by
and this upper bound converges to as Since
is arbitrary this implies that the second term in (27) goes
to 0 as and this proves the result.
Proof of Theorem 11: Suppose, without loss of generality
that Let and satisfy Put
and note that By Theorem 10 and as so and as This implies that
there is a such that for all then Therefore, by Theorem 9, we have that for all
[1]Al-Labadi, L., Alzaatreh, A. and Evans, M. (2024) How to measure
evidence and its strength: Bayes factors or relative belief ratios? arXiv:2301.08994
[2]Al-Labadi, L. and Evans, M. (2017) Optimal robustness results for
some Bayesian procedures and the relationship to prior-data conflict. Bayesian
Analysis 12, 3, 702-728.
[3]Berger, J. O. (1985). Statistical Decision Theory and Bayesian
Analysis. Springer.
[4]Bernardo, J. M. (2005). Intrinsic credible regions: an objective
Bayesian approach to interval estimation. Test, 14(2):317–384. With
comments and a rejoinder by the author.
[5]Bernardo, J. M. and Smith, A. F. M. (2000). Bayesian Theory.
Wiley Series in Probability and Statistics. John Wiley & Sons Ltd., New York. Paperback.
[6]Birnbaum, A. (1962) On the foundations of statistical inference
(with discussion). Journal of the American Statistical Association. 57 (298), 269–326.
[7]Evans, M. (2015) Measuring Statistical Evidence Using Relative
Belief. Chapman & Hall/CRC Monographs on Statistics & Applied Probability.
[8]Evans, M. (2024) The concept of statistical evidence:
historical roots and current developments. arXiv:2406.05843.
[9]Evans, M. and Guo, Y. (2021) Measuring and controlling bias for
some Bayesian inferences and the relation to frequentist criteria. Entropy,
23(2), 190, doi: 10.3390/e23020190.
[10]Evans, M. J., Guttman, I., and Swartz, T. (2006). Optimality and
computations for relative surprise inferences. Canad. J. Statist, 34(1):113-129.
[11]Evans, M. and Jang, G-H. (2011). Weak informativity and the
information in one prior relative to another. Statistical Science, Vol. 26,
No. 3, 423-439.
[12]Evans, M. and Moshonov, H. (2006) Checking for prior-data
conflict. Bayesian Analysis, 1, 4, 893-914.
[13]Evans, M. and Shakhatreh, M. (2008). Optimal properties of some
Bayesian inferences. Electron. J. Stat., 2, 1268–1280.
[14]Le Cam, L. (1953). On some asymptotic properties of maximum
likelihood estimates and related Bayes’ estimates. Univ. California Publ.
Statist, 1, 277–329.
[15]Nott,D., Wang, X., Evans, M., and Englert, B-G. (2020) Checking
for prior-data conflict using prior to posterior divergences. Statistical
Science, 35, 2, 234-253.
[16]Robert, C. P. (1996). Intrinsic losses. Theory and Decision, 40, 191-214.
[17]Royall, R. M. (1997). Statistical Evidence: A likelihood
paradigm. Chapman & Hall.
[18]Rudin, W. (1974). Real and Complex Analysis. McGraw Hill, New York.
[19]Savage, L. J. (1971). The Foundations of Statistics. Dover Publications.