Anytime-Valid Tests of Group Invariance through Conformal Prediction

Tyron Lardy¹¹1Declarations of interest: none [email protected] Muriel Felipe Pérez-Ortiz²²2Declarations of interest: none [email protected]

Abstract

We develop anytime-valid tests of invariance under the action of compact groups. The resulting test statistics are optimal in a logarithmic-growth sense. We apply our method to extend recent anytime-valid tests of independence and to construct tests of normality.

keywords:

Anytime-Validity , Hypothesis Test , Group Invariance , Conformal Martingale

^†^†journal: Statistics & Probability Letters

\affiliation

[1]organization=Mathematical Institute, Leiden University, addressline=Einsteinweg 55, city=Leiden, country=The Netherlands

\affiliation

[2]organization=Department of Mathematics and Computer Science, Eindhoven University of Technology, addressline= PO Box 513, city=Eindhoven, postcode=5600 MB, country=The Netherlands

1 Introduction

Suppose that we observe data $X_{1},X_{2},\dots$ sequentially, and that they take values in a probability space $\mathcal{X}$ . We are interested in testing the null hypothesis of invariance of the data under the action of a sequence of groups $(G_{n})_{n\in\mathbb{N}}$ , that is,

\mathcal{H}_{0}:gX^{n}\stackrel{{\scriptstyle\mathcal{D}}}{{=}}X^{n}\quad\text% {for all }g\in G_{n}\text{ and all }n\in\mathbb{N},

(1)

where $\stackrel{{\scriptstyle\mathcal{D}}}{{=}}$ signifies equality in distribution and, for each $n\in\mathbb{N}$ , $G_{n}$ is a group of transformations of the data—a compact topological group that acts continuously on $\mathcal{X}^{n}$ . In order to avoid pathological situations additional structure is needed (see Section 2). Note that the observations are not assumed to be independent nor identically distributed; they are only assumed to be sampled from a distribution on infinite sequences. Prominent examples of (1) include testing for exchangeability (Vovk et al., 2005; Ramdas et al., 2022), in which case $G_{n}$ is the group of permutations on $n$ elements; and testing for sphericity, in which case $G_{n}$ is the orthogonal group $\mathrm{O}(n)$ . The latter can be used to test the Gaussian-error assumption in linear regression, as we will see. We construct anytime-valid tests for $\mathcal{H}_{0}$ that monitor test martingales and randomized versions thereof.

A sequence of statistics of the data is a test martingale if it is nonnegative, starts at one, and is a martingale under every element of $\mathcal{H}_{0}$ . Formally, if $\mathbb{G}=(\mathcal{G}_{n})_{n\in\mathbb{N}}$ is a filtration of $\sigma$ -algebras such that $\mathcal{G}_{n}\subseteq\sigma(X^{n})$ , then a sequence of statistics $(M_{n})_{n\in\mathbb{N}}$ is a test martingale for $\mathcal{H}_{0}$ with respect to $\mathbb{G}$ if $\mathbf{E}_{Q}\left[M_{n}\mid\mathcal{G}_{n-1}\right]=M_{n-1}$ for all $Q\in\mathcal{H}_{0}$ and $M_{0}=1$ . Test martingales are the central objects for anytime-valid inference (Ramdas et al., 2023). A sequential test $\phi_{n}=\mathbf{1}\left\{M_{n}\geq\frac{1}{\alpha}\right\}$ can be built using the threshold $1/\alpha$ and the resulting test $\phi_{n}$ is anytime valid in the sense that it enjoys the time-uniform type-I error guarantee that $Q(\exists n\in\mathbb{N}:\phi_{n}=1)\leq\alpha$ for any $Q\in\mathcal{H}_{0}$ . Furthermore, it will also be useful to consider randomized test martingales, for which we append an independent random number $\theta_{n}\sim\text{Uniform}([0,1])$ to each $X_{n}$ .

We build test martingales using tools from conformal prediction, which is perhaps best known as a framework for uncertainty quantification for point predictors (see Vovk et al. (2005) and Section 3). The test martingales constructed using these methods are known as conformal martingales. Most crucially for our present purposes, Vovk et al. (2005) show that conformal martingales can be used to test whether data are generated by a specific class of generating mechanisms, called online compression models. The key insight of this note is that group-invariant models can be regarded as online compression models. Therefore, conformal martingales can be built using the conformal-prediction machinery to test the hypothesis of invariance in (1).

We further show that the resulting test martingales are optimal in a specific logarithmic sense. The rationale behind this criterion is that, under the alternative distribution, a “good” test martingale should no longer be a martingale and it should grow large to gather evidence against the null. Formally, given a specific alternative $P$ of interest and a stop** time $\tau$ for the experiment, the relevant test statistic is $M_{\tau}$ —the value of a test martingale $(M_{n})_{n\in\mathbb{N}}$ at $\tau$ . For example, $\tau$ could be the first hitting time for the threshold $1/\alpha$ as before. A common measure to judge the expected growth of $(M_{n})_{n\in\mathbb{N}}$ at $\tau$ is $\mathbf{E}_{P}\left[\log M_{\tau}\right]$ , and test martingales that maximize this growth rate are referred to as log-optimal (see Koolen and Grünwald, 2022; Ramdas et al., 2023).

In Section 2, we give additional structure to the problem of sequential group invariance, give the necessary background on online compression models, and we show that sequential group-invariant models define online compression models. Section 3 shows our construction of test martingales using the online-compression structure and conformal prediction. Section 4 shows that our construction is log-optimal against certain alternatives. Section 5 applies the results to test for independence and to test rotational invariance, and Section 6 discusses the limitations of invariance testing in the context of online compression models and modifications of the constructions that are used.

All the proofs can be found in A in the supplementary material.

2 Sequential group actions and online compression models

The hypothesis in (1) is only meaningful if the statements regarding group invariance for each $n$ are consistent with each other; without any further restrictions, invariance of the data at one time may contradict the invariance of the data at a later time. We assume that the groups $(G_{n})_{n\in\mathbb{N}}$ act sequentially on the data to avoid such situations.

Definition 1 (Sequential group action)

The action of the sequence of groups $(G_{n})_{n\in\mathbb{N}}$ on $(\mathcal{X}^{n})_{n\in\mathbb{N}}$ is sequential if the following conditions hold.

(i)

The sequence $(G_{n})_{n\in\mathbb{N}}$ is ordered by inclusion: for each $n$ , there is an inclusion map $\imath_{n+1}:G_{n}\to G_{n+1}$ such that $\imath_{n+1}$ is a continuous group isomorphism between $G_{n}$ and its image, and the image of $G_{n}$ under $\imath_{n+1}$ is closed in $G_{n+1}$ .
(ii)

For all $g_{n}\in G_{n}$ and all $x^{n+1}\in\mathcal{X}^{n+1}$ , $\mathrm{proj}_{\mathcal{X}^{n}}(\imath_{n+1}(g_{n})x^{n+1})=g_{n}(\mathrm{proj% }_{\mathcal{X}^{n}}(x^{n+1})),$ where $\mathrm{proj}_{\mathcal{X}^{n}}$ is the canonical projection map $\mathrm{proj}_{\mathcal{X}^{n}}:\mathcal{X}^{n+1}\to\mathcal{X}^{n}$ .
(iii)

Let $n\geq 1$ , $g_{n}\in G_{n}$ , and $g_{n+1}\in G_{n+1}$ . For $x^{n+1}=(x_{1},\dots,x_{n+1})\in\mathcal{X}^{n+1}$ , denote $(x^{n+1})_{n+1}=x_{n+1}$ . Then, $g_{n+1}=\imath_{n+1}(g_{n})$ if and only if, for all $x^{n+1}\in\mathcal{X}^{n+1},$ $(g_{n+1}x^{n+1})_{n+1}=x_{n+1}.$

Here, item (i) gives an ordering of the sequence of groups by inclusion, (ii) ensures that this inclusion does not change the action of the groups on past data, and (iii) implies that the groups do not act on “future” data. As a result, invariance of $X^{n-1}$ under $G_{n-1}$ is implied by invariance of $X^{n}$ under $G_{n}$ and the individual statements of invariance in (1) for each $n$ do not contradict each other. The simplest example of a sequential action is when each $G_{n}=G_{n-1}\times H_{n}$ for some group $H_{n}$ and $G_{n}$ acts on $\mathcal{X}^{n}$ componentwise, i.e., $g_{n}X^{n}=(h_{1}X_{1},\dots,h_{n}X_{n})$ . This setting is, among others, discussed in detail by Koning (2023). A more complicated example of a sequential action (testing for sphericity) is given in Section 5.2.

Under the assumption that the group action is sequential, we show that the null hypothesis of invariance is an online compression model. They are models for computing online summaries, or compressed representations, of the observed data. When the data is generated by an online compression model, the techniques developed for conformal prediction can be used to construct a sequence of i.i.d. uniform statistics, which in turn give rise to a test martingale, as we discuss in Section 3. Vovk et al. define online compression models in abstract terms; we use a simplified definition here.

Definition 2 (Online compression model)

An online compression model on $\mathcal{X}$ is a 3-tuple of sequences $M=((\sigma_{n})_{n\in\mathbb{N}},(F_{n})_{n\in\mathbb{N}},(Q_{n})_{n\in\mathbb% {N}}),$ where:

1.

$(\sigma_{n})_{n\in\mathbb{N}}$ is a sequence of statistics $\sigma_{n}=\sigma_{n}(X^{n})$ ; we call $\sigma_{n}$ a summary of $X^{n}$ ,
2.

$(F_{n})_{n\in\mathbb{N}}$ is a sequence of functions such that $F_{n}(\sigma_{n-1},X_{n})=\sigma_{n}$ ,
3.

$(Q_{n})_{n\in\mathbb{N}}$ is a sequence of conditional distributions for $(\sigma_{n-1},X_{n})$ given $\sigma_{n}$ .

To show how sequential group invariance defines an online compression model, we first recall some group theory. First, the orbit $G_{n}X^{n}$ of $X^{n}$ under the action of $G_{n}$ is the set of all values that are reached by the action of $G_{n}$ on $X^{n}$ , i.e., $G_{n}X^{n}=\{gX^{n}:g\in G_{n}\}$ . In order to identify each orbit, we pick a single element of $\mathcal{X}^{n}$ in each orbit—an orbit representative—and consider the map $\gamma_{n}:\mathcal{X}^{n}\to\mathcal{X}^{n}$ that takes each $X^{n}$ to its orbit representative. We call $\gamma_{n}$ an orbit selector, and we assume that it is measurable. Such measurable orbit selectors are known to exist under weak regularity conditions on $\mathcal{X}^{n}$ and $G_{n}$ (see Bondar, 1976, Theorem 2) that hold in all the examples of this work. Furthermore, because $G_{n}$ is a compact group, there exists a unique $G_{n}$ -invariant probability distribution $\mu_{n}$ , called the Haar measure (Bourbaki and Berberian, 2004, Chapter VII). The Haar measure plays the role of a uniform distribution on groups. Finally, it is a well-known fact that $X^{n}\mid\gamma_{n}(X^{n})\stackrel{{\scriptstyle\mathcal{D}}}{{=}}U\gamma_{n}% (X^{n})\mid\gamma_{n}(X^{n})$ , where $U\sim\mu_{n}$ independently of $X$ (Eaton, 1989, Theorem 4.4).

Together with the following proposition, these properties show that the sequential group invariance structure considered here defines an online compression model with $\sigma_{n}=\gamma_{n}(X^{n})$ .

Proposition 1

There exists a sequence $(F_{n})_{n\in\mathbb{N}}$ of measurable functions $F_{n}:\mathcal{X}^{n-1}\times\mathcal{X}\to\mathcal{X}^{n}$ such that $F_{n}(\gamma_{n-1}(X^{n-1}),X_{n})=\gamma_{n}(X^{n})$ , and $F_{n}(\ \cdot\ ,X_{n})$ is a one-to-one function of $\gamma_{n-1}(X^{n-1})$ .

Corollary 1

The tuple $((\gamma_{n}(X^{n}))_{n\in\mathbb{N}},(F_{n})_{n\in\mathbb{N}},(\tilde{\mu}_{n% })_{n\in\mathbb{N}})$ , where $\tilde{\mu}_{n}$ is the uniform distribution on $G_{n}X^{n}$ induced by the Haar measure $\mu_{n}$ on $G_{n}$ , defines an online compression model on $\mathcal{X}$ .

3 Testing group invariance with conformal martingales

The goal of this section is the construction of test martingales for the null hypothesis of distributional symmetry in (1) any time that a sequence of groups $(G_{n})_{n\in\mathbb{N}}$ acts sequentially on the data $(X_{n})_{n\in\mathbb{N}}$ . To this end, the invariant structure of the null hypothesis $\mathcal{H}_{0}$ is used in tandem with conformal prediction to build a sequence of independent random variables $(R_{n})_{n\in\mathbb{N}}$ with the following three properties:

1.

The sequence $(R_{n})_{n\in\mathbb{N}}$ is adapted to the data sequence with external randomization $(X_{n},\theta_{n})_{n\in\mathbb{N}}$ , that is, for each $n\in\mathbb{N}$ , $R_{n}=R_{n}(X_{n},\theta_{n})$ .
2.

Under any element of the null hypothesis $\mathcal{H}_{0}$ from (1), $(R_{n})_{n\in\mathbb{N}}$ is a sequence of independent and identically distributed $\mathrm{Uniform}([0,1])$ random variables.
3.

The distribution of $(R_{n})_{n\in\mathbb{N}}$ is not uniform when departures from symmetry are present in the data.

The construction of these random variables is the subject of Section 3.1—additional definitions are needed for their construction. In order to guide intuition, Example 1 shows a first example for testing exchangeability. Given their uniform distributions, the statistics $R_{1},R_{2},\dots$ have previously been called p-values (e.g. Vovk et al., 2005; Fedorova et al., 2012). We opt against that terminology here, because typically only small p-values are interpreted evidence against the null hypothesis. However, in this context, it is any deviation from uniformity that we interpret as evidence against the null hypothesis.

Once the sequence $(R_{n})_{n\in\mathbb{N}}$ has been built, test martingales against distributional invariance can be constructed. This is achieved by testing the uniformity of $(R_{n})_{n\in\mathbb{N}}$ . Indeed, any time that $(f_{n})_{n\in\mathbb{N}}$ is a sequence of functions $f_{i}:[0,1]\to\mathbb{R}$ such that $\int f_{i}(r)\mathrm{d}r=1$ , the process $(M_{n})_{n\in\mathbb{N}}$ given by

M_{n}:=\prod_{i\leq n}f_{i}(R_{i})

(2)

is a test martingale for $\mathcal{H}_{0}$ with respect to $\mathbb{F}$ , where $\mathbb{F}=(\sigma(R^{n}))_{n\in\mathbb{N}}$ . This follows from the fact that $\mathbf{E}_{Q}\left[M_{n}\mid\sigma(R^{n-1})\right]=M_{n-1}\cdot\int f_{n}(r)% \mathrm{d}r=M_{n-1}$ , where we leverage independence and uniformity. The functions $(f_{n})_{n\in\mathbb{N}}$ are known as calibrators (Vovk and Wang, 2021). They can be taken to be any sequence of predictable estimators of the distribution of $R_{1},R_{2},\dots$ (Fedorova et al., 2012), so that the test martingale is expected to grow if the true distribution of the orbit ranks is not uniform, i.e., the null hypothesis is violated. The optimality of this procedure is discussed in Section 4.

Example 1 (Sequential Ranks)

Consider the case of testing exchangeability, that is, the case when $\mathcal{X}=\mathbb{R}$ and each group $G_{n}$ is $G_{n}=n!$ , the group of permutations on $n$ elements. Consider, for each $n$ , the random variables $\tilde{R}_{n}=\sum_{i\leq n}\mathbf{1}\left\{X_{i}\leq X_{n}\right\}$ —the rank of $X_{n}$ among $X_{1},\dots,X_{n}$ . It is a classic observation that each $\tilde{R}_{n}$ is uniformly distributed on $\{1,\dots,n\}$ , and that $(\tilde{R}_{n})_{n\in\mathbb{N}}$ is a sequence of independent random variables (Rényi, 1962). The random variables $\tilde{R}_{1},\tilde{R}_{2},\dots$ are called sequential ranks (Malov, 1996). After rescaling and adding external randomization, a sequence of random variables $(R_{n})_{n\in\mathbb{N}}$ can be built from $(\tilde{R}_{n})_{n\in\mathbb{N}}$ such that $(R_{n})_{n\in\mathbb{N}}$ satisfies items 1, 2 and 3 at the start of this section. Furthermore, if we denote the uniform measure on $n!$ by $\mu_{n}$ , then $\tilde{R}_{n}$ can also be obtained from $n^{-1}\tilde{R}_{n}=\mu_{n}\{g:(gX_{n})_{n}\leq X_{n}\}.$ While this rewriting may seem esoteric at this point, it turns out to be the correct point of view for generalization as there exists an analogue of the uniform probability distribution on every compact group—its Haar measure.

3.1 Conformal prediction under invariance

In general, the statistics $R_{n}$ will be designed to measure how strange the observations $X^{n}$ are in contrast to what would be expected under distributional invariance. To this end, the values of $X^{n}$ are compared to those in the orbit of $X^{n}$ under the action of $G_{n}$ . In order to measure the “strangeness” of the observations in their orbit, we use an adaptation of the conformity measures introduced by Vovk et al. (2005).

Definition 3 (Conformity measure of invariance)

We say that $\alpha^{n}=\alpha^{n}(X^{n})$ is a conformity measure of invariance at time $n$ if the following hold:

(i)

$\alpha^{n}=(\alpha_{1},\dots,\alpha_{n})$ , where $\alpha_{i}=A_{n}(X_{i},\gamma_{n}(X^{n}))$ for a function $A_{n}:\mathcal{X}\times\mathcal{X}^{n}\to\mathbb{R}$ .
(ii)

If $\alpha^{n}(X^{n})=\alpha^{n}(X^{\prime n})$ for $X^{n},X^{\prime n}\in\mathcal{X}^{n}$ , then $\alpha^{n}(gX^{n})=\alpha^{n}(gX^{\prime n})$ for all $g\in G_{n}$ .

Item (ii) in Definition 3 is an addition to the definition by Vovk et al. (2005). It ensures that the action of $G_{n}$ on $\mathcal{X}^{n}$ induces an action on the conformity measures, that is, it implies that the action of $G_{n}$ on $\alpha^{n}$ defined by $g\alpha^{n}:=\alpha^{n}(gX^{n})$ is well-defined.

The distribution of the conformity measures under the null hypothesis can be obtained by leveraging the distributional invariance. Indeed, as discussed in Section 2, the distribution of $X^{n}$ conditional on $\gamma(X^{n})$ is characterized by the Haar measure. Similar to what happened in Example 1, this distribution can be used to rank the observed value of the conformity score $\alpha_{n}$ among all its possible values on the orbit of the data. This idea gives rise to the (smoothed) orbit ranks $(R_{n})_{n\in\mathbb{N}}$ in the next definition.

Definition 4 (Smoothed Orbit Ranks)

Fix $n\in\mathbb{N}$ and let $\alpha^{n}$ be a conformity measure of invariance at time $n$ . We call $R_{n}$ , defined by

R_{n}=\mu_{n}(\{g\in G_{n}:(g\alpha^{n})_{n}<\alpha_{n}\})+\theta_{n}\mu_{n}(% \{g\in G_{n}:(g\alpha^{n})_{n}=\alpha_{n}\}),

(3)

a (smoothed) orbit rank, where $\mu_{n}$ denotes the Haar probability measure on $G_{n}$ and the sequence $\theta_{1},\theta_{2},\dots\stackrel{{\scriptstyle\mathrm{i.i.d}}}{{\sim}}% \mathrm{Uniform}[0,1]$ is independent of $\alpha^{n}$ .

Note that, if the distribution of $\alpha^{n}$ conditional on $\gamma_{n}(X^{n})$ is continuous, then the smoothing plays no role in (3), and $R_{n}\perp\theta_{n}$ . Furthermore, if $A_{n}$ in Item (i) of Definition 3 is chosen properly, a small orbit rank $R_{n}$ indicates that the observed value of $\alpha_{n}$ is strange (not conform) compared to the values it would have attained on different elements in the orbit of the data. Alternatively, one can think of $R_{n}$ as the CDF of the distribution of $\alpha_{n}$ conditional on $\gamma_{n}(X^{n})$ evaluated in the data (with added randomization). It follows—and this is shown in Theorem 1—that each $R_{n}$ is uniformly distributed on $[0,1]$ . Vovk et al. (2005, Theorem 11.2) show that, if the data is generated by an online compression model, then $R_{1},R_{2},\dots$ are also independent. Since Corollary 1 shows that a sequential group invariance structure defines an online compression model, it follows that the smoothed orbit ranks form an i.i.d. uniform sequence under the null hypothesis. This is stated in the next theorem, for which we provide a direct proof for completeness in A in the supplementary material.

Theorem 1

Suppose that the action of $(G_{n})_{n\in\mathbb{N}}$ on $(\mathcal{X}^{n})_{n\in\mathbb{N}}$ is sequential and that $(X_{n})_{n\in\mathbb{N}}$ is generated by an element of $\mathcal{H}_{0}$ . Then $R^{n}\perp\gamma_{n}(X^{n})$ for each $n$ and the distribution of $(R_{n})_{n\in\mathbb{N}}$ is given by $U^{\infty}$ .

4 Optimality

In this section, we show that any martingale based on the smoothed orbit ranks as in (2) can be thought of as likelihood ratio processes, and that they are log-optimal against the implicit alternative for which they are built. Indeed, let $P$ be a distribution such that, conditionally on $R^{n-1}$ , $R_{n}$ has density $f_{n}$ under $\tilde{P}$ for all $n$ . Here, we use $\tilde{P}$ to denote the distribution $P$ with added external randomization, i.e. $\tilde{P}:=P\times\mathcal{U}^{\infty}$ , where $\mathcal{U}^{\infty}$ is the uniform distribution on $[0,1]^{\infty}$ . Analogously, for each $Q\in\mathcal{H}_{0}$ , define $\tilde{Q}:=Q\times\mathcal{U}^{\infty}$ . Observe that $M_{n}=\prod_{i\leq n}f_{i}(R_{i})$ equals the likelihood ratio between $\tilde{P}_{R^{n}}$ and $\tilde{Q}_{R^{n}}$ , which equals the uniform distribution for any $Q\in\mathcal{H}_{0}$ . Surprisingly, $M_{n}$ is also the likelihood ratio of the full data between a distribution $P$ such that $R_{n}\perp\gamma_{n}(X^{n})$ and an appropriately chosen distribution $Q^{*}\in\mathcal{H}_{0}$ , as shown in the following proposition.

Proposition 2

Suppose that $\alpha_{n}(X^{n})=A_{n}(X_{n},\gamma_{n}(X^{n}))$ , where $A_{n}(\ \cdot\ ,\gamma_{n}(X^{n}))$ is a one-to-one function for each $n\in\mathbb{N}$ . Furthermore, suppose that $P$ is any distribution under which $R_{n}\perp\gamma_{n}(X^{n})$ for each $n$ . If $M_{n}=\prod_{i\leq n}f_{i}(R_{i})$ and each $f_{i}$ is the conditional distribution of $R_{i}$ given $R^{i-1}$ , then, for $Q\in\mathcal{H}_{0}$ ,

\tilde{Q}\left(M_{n}=\frac{\mathrm{d}P}{\mathrm{d}Q^{*}}(X^{n})\right)=1,

(4)

where $Q^{*}$ denotes the distribution under which the marginal distribution of $\gamma_{n}(X^{n})$ coincides with that under $P$ , and such that $X^{n}\mid\gamma_{n}(X^{n})\stackrel{{\scriptstyle\mathcal{D}}}{{=}}U\gamma_{n}% (X^{n})\mid\gamma_{n}(X^{n})$ , where $U\sim\mu_{n}$ independently from $\gamma_{n}(X^{n})$ .

The next theorem uses the representation in (4) to show the log-optimality of $(M_{n})_{n\in\mathbb{N}}$ . Its proof is heavily inspired by Koolen and Grünwald (2022, Theorem 12).

Theorem 2

Assume that $\alpha_{n}(X^{n})=A_{n}(X_{n},\gamma_{n}(X^{n}))$ , where $A_{n}(\ \cdot\ ,\gamma_{n}(X^{n}))$ is one-to-one all $n\in\mathbb{N}$ and let $P$ be any distribution such that $X^{n}\mid\gamma_{n}(X^{n})$ has full support. Denote $f_{n}$ for the density of $R_{n}\mid R^{n-1}$ under $\tilde{P}$ . Let $\tau$ be any stop** time and $(E_{n})_{n\in\mathbb{N}}$ any test martingale for $\mathcal{H}_{0}$ , both with respect to $\mathbb{F}$ —the filtration generated by the smoothed ranks. Then, it holds that

\mathbf{E}_{\tilde{P}}\left[\ln M_{\tau}\right]=\mathbf{E}_{\tilde{P}}\left[% \ln\prod_{i=1}^{\tau}f_{i}(R_{i})\right]\geq\mathbf{E}_{\tilde{P}}\left[\ln E_% {\tau}\right].

(5)

Moreover, if $\tilde{P}$ is such that $R^{n}\perp\gamma_{n}(X^{n})$ for all $n$ , then for any test martingale $E^{\prime}$ for $\mathcal{H}_{0}$ w.r.t. $(\sigma(X^{n}))_{n\in\mathbb{N}}$ —the full-data filtration—, it also holds that

\mathbf{E}_{\tilde{P}}\left[\ln M_{\tau}\right]\geq\mathbf{E}_{\tilde{P}}\left% [\ln E^{\prime}_{\tau}\right].

(6)

The additional assumption of independence between $R^{n}$ and $\gamma_{n}(X^{n})$ is necessary for (6) to hold: if $\tilde{P}$ is a distribution under which $R_{1},\dots,R_{n}\not\perp\gamma_{n}(x^{n})$ , then the conformal martingale is not in general a likelihood ratio as in (4). For the deterministic stop** time $\tau=n$ , the log-optimal statistic is $S_{n}=\prod_{i=1}^{n}f_{n}(R_{1},\dots,R_{n}\mid\gamma_{n}(X^{n}))$ , as it can be written as a likelihood ratio (see also Grünwald et al., 2024; Koning, 2023). However, the sequence $(S_{n})_{n\in\mathbb{N}}$ does not necessarily give rise to an anytime-valid test. Using tests based on the sequential ranks circumvents this issue for such alternatives.

The optimality of $M_{n}$ in Theorem 2 is contingent on oracle knowledge of the true distributions $f_{1},f_{2},\dots$ , which are unknown in practice. To counter this, past data can be used sequentially to estimate the true density. This ideas has previously been applied for testing exchangeability (Vovk et al., 2005; Fedorova et al., 2012). More precisely, for each $n$ , let $\hat{f}_{n}$ be an estimator of $f_{n}$ based on $R^{n-1}$ , and consider the martingale defined by $\prod_{i=1}^{n}\hat{f}_{i}(R_{i})$ . While this is suboptimal with respect to an oracle that knows the true density, there is limited loss asymptotically if $\hat{f}_{i}$ is a good estimator of $f_{i}$ . In order to judge if an estimator is good for the task at hand, consider the difference in expected growth per outcome for fixed $n$ , i.e.,

\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{\tilde{P}}[\log f(R_{i})-\log\hat{f}_{i}(% R_{i})]=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{\tilde{P}}[\mathrm{KL}(f\|\hat{f}% _{i})],

(7)

where $\mathrm{KL}(f\|\hat{g})=\int_{0}^{1}f(r)\log f(r)/g(r)\mathrm{d}r$ denotes the Kullback-Leibler divergence whenever $f$ is absolutely continuous with respect to $g$ , and the expectation on the right-hand side of (7) is over past data (on which $\hat{f}_{i}$ depends). Estimators under which (7) tends to zero are known to exist under weak conditions on the true density $f$ (Haussler and Opper, 1997; Cesa-Bianchi and Lugosi, 2001; Grünwald and Mehta, 2019). Under more stringent assumptions—for example, if the density $f$ belongs to an exponential family—sequential Bayesian-update-type algorithms are known to guarantee that (7) tends to 0 as $n\to\infty$ (Kotłowski and Grünwald, 2011).

5 Applications and Extension

In this section, we discuss applications and an extension of the theory developed above.

5.1 Modification for Independence Testing

We now propose a minor modification of the conformal martingales from the previous section that can be used to test for independence. Formally, fix $K\in\mathbb{N}$ and suppose that at each time point $n\in\mathbb{N}$ , a $K$ -dimensional vector $X_{n}=(X_{1,n},\dots,X_{K,n})\in\mathcal{X}^{K}$ is observed. We are interested in testing the null hypothesis that states that: (1) for each $k=1,\dots,K$ and each $n$ the vectors $(X_{k,1},\dots,X_{k,n})$ are $G_{n}$ -invariant, and (2) $(X_{k,1},\dots,X_{k,n})\perp(X_{k^{\prime},1},\dots,X_{k^{\prime},n})$ for all $k\neq k^{\prime}\in\{1,\dots,K\}$ . Under this hypothesis, the data is invariant under the sequential action of $(\tilde{G}_{n})_{n\in\mathbb{N}}$ given by $\tilde{G}_{n}=G_{n}^{K}$ , acting on $\mathcal{X}^{K\times n}$ rowwise. That is, the first copy of the group acts on $(X_{1,1},\dots,X_{1,n})$ , the second on $(X_{2,2},\dots,X_{2,n})$ , etc. This action is sequential anytime that the action of $(G_{n})_{n\in\mathbb{N}}$ is sequential on each of the $K$ data streams.

Based on the discussion above, a first idea to test for invariance under $\tilde{G}_{n}$ is to create $K$ test martingales and combine them through multiplication. More specifically, we can treat each of the sequences $(X_{k,n})_{n\in\mathbb{N}}$ , $k\in\{1,\dots,K\}$ as a separate data stream and compute the corresponding statistics in (3), leading to $K$ sequences of uniformly distributed random variables $(R_{k,n})_{n\in\mathbb{N}}$ . If, for all $n\in\mathbb{N}$ and $k\in\{1,\dots,K\}$ , $f_{k,n}$ is a density on $[0,1]$ then, by independence, the sequence $(M_{n}^{\prime})_{n\in\mathbb{N}}$ defined by $M_{n}^{\prime}=\prod_{i=1}^{n}\prod_{k=1}^{K}f_{k,i}(R_{k,i})$ is a martingale under the null hypothesis. However, this martingale would not be able to detect alternatives under which the marginals are group invariant, but not independent. This stems from the fact that it only uses that the marginals are uniform under the null, while in fact a stronger claim is true: for each $n$ , the joint distribution of $R_{k,n}$ , $k\in\{1,\dots,K\}$ , is uniform on $[0,1]^{K}$ . As a result of this observation, one can choose any sequence of joint density (estimators) $f_{1},f_{2},\dots$ on $[0,1]^{K}$ and create a test martingale by considering $M_{n}=\prod_{i=1}^{n}f_{i}(R_{1,i},\dots,R_{K,i})$ .

In the case that $K=2$ and $G_{n}=n!$ , this is the procedure that was recently employed by Henzi and Law (2023). They discuss a specific choice of $f_{n}$ , a histogram density estimator, that is able to detect departures from independence consistently under the stronger assumption that data are i.i.d. One of their key insights is that independence of the data streams not only implies joint uniformity of the sequential ranks in their setting, but the two are actually equivalent. This equivalence breaks down if one does not assume that $X_{k,1},X_{k,2}\dots$ are i.i.d. for all $k$ . Finding conditions under which the independence of the streams and the joint uniformity of the rank distributions are equivalent so that a histogram density estimator might reliably detect independence in the more general setting, is future work.

5.2 The orthogonal group and linear models

Consider testing whether the data we observe are drawn from a spherically symmetric distribution, i.e., $\mathcal{X}=\mathbb{R}$ and $G_{n}=\mathrm{O}(n)$ , where $\mathrm{O}(n)$ is the orthogonal group in dimension $n$ . Testing for spherical symmetry is equivalent to testing whether the data are generated by a zero-mean Gaussian distribution. This follows from the fact that any distribution on $\mathbb{R}^{\infty}$ for which the marginal of the first $n$ coordinates is spherically symmetric for any $n$ , can be written as a mixture of i.i.d. zero-mean Gaussian distributions (Bernardo and Smith, 2009, Proposition 4.4). It follows that any process that is a supermartingale under all zero-mean Gaussian distributions is also a supermartingale under spherical symmetry and vice-versa. This implies that, for the purpose of testing with supermartingales, the two hypotheses are equivalent. We show how this fits in our setting, and deffer the application to regression to the Supplementary Material.

We now check that testing spherical symmetry fits in our setting, i.e., that Definition 1 is fulfilled. Consider the inclusion of $\mathrm{O}(n)$ in $\mathrm{O}(n+1)$ given by

\imath_{n+1}(O_{n})=\begin{pmatrix}O_{n}&0\\ 0&1\end{pmatrix}

for each $O_{n}\in\mathrm{O}(n)$ . Using the canonical projections in $\mathbb{R}^{n}$ , Definition 1 is readily checked. Since the data are real, $\alpha^{n}$ can be chosen to be the identity for all $n$ , i.e., $\alpha^{n}(X^{n})=X^{n}$ . An orbit selector is given by $\gamma_{n}(X^{n})=\|X^{n}\|e_{1}$ , where $e_{1}$ is the unit vector $e_{1}=(1,0,\dots,0)$ . For simplicity, we assume that the distribution of $X^{n}$ has a density with respect to the Lebesgue measure for each $n$ , so that $R_{n}=\mu_{n}(\{O_{n}\in\mathrm{O}(n):(O_{n}X^{n})_{n}<X_{n}\})$ —no external randomization is needed. Rather than thinking of $\mu_{n}$ as a measure on $O(n)$ , one can think of it as the uniform measure on $S^{n-1}(\|X^{n}\|)$ . This way, $R_{n}$ can be recognized to be the relative surface area of the hyper-spherical cap with co-latitude angle $\varphi_{n}=\pi-\cos^{-1}(X_{n}/\|X^{n}\|)$ . Li (2010) shows that an explicit expression for this area is given by

R_{n}=\begin{cases}1-\frac{1}{2}I_{\sin^{2}(\pi-\varphi_{n})}\left(\frac{n-1}{% 2},\frac{1}{2}\right)&\text{if }\varphi_{n}>\frac{\pi}{2},\\ \frac{1}{2}I_{\sin^{2}(\varphi_{n})}\left(\frac{n-1}{2},\frac{1}{2}\right)&% \text{else},\end{cases}

(8)

where $I_{x}(a,b)$ denotes the regularized beta function, $I_{x}(a,b)=\frac{B(x,a,b)}{B(1,a,b)}$ for $B(x,a,b)=\int_{0}^{x}t^{a-1}(1-t)^{b-1}$ for $0\leq x\leq 1$ .

Note that $\varphi_{n}>\frac{\pi}{2}$ if and only if $X_{n}>0$ and that $\sin^{2}(\varphi_{n})=1-\frac{X_{n}^{2}}{\|X^{n}\|^{2}}$ , so that (8) equals the CDF of the t-distribution with $n-1$ degrees of freedom evaluated in $t=\sqrt{n-1}X_{n}/\|X^{n-1}\|$ . If $X^{n}\sim\mathcal{N}(0,\sigma^{2}I_{n})$ , then $t$ is the ratio of a normally distributed random variable and an independent chi-squared-distributed random variable. Therefore, $t$ has a t-distribution with $n-1$ degrees of freedom, so that we essentially perform a type of sequential t-test.

This example can be extended to testing for centered spherical symmetry, i.e., whether $X^{n}=\mu\mathbf{1}_{n}+\epsilon^{n}$ , where $\mathbf{1}_{n}$ is the all-ones $n$ -vector, $\mu\in\mathbb{R}$ and $\epsilon^{n}$ is spherically symmetric for every $n\in\mathbb{N}$ . By similar reasoning as above, this is equivalent to testing whether the data is i.i.d. Gaussian with any mean/variance. Even more, by considering different isotropy groups, one can also cover the case where the mean $\mu$ is not fixed, but depends on covariates. The techniques needed in that case are similar; we show them in B of the supplementary material.

6 Discussion

We have discussed how the theory of conformal prediction can be applied to test for symmetry of infinite sequences of data. Here we discuss two topics. First, the relationship to noninvariant conformal martingales. Second, whether smoothing is necessary when defining orbit ranks.

6.1 Noninvariant conformal martingales

Not all online compression models correspond to a compact-group invariant null hypothesis. An interesting example of this phenomenon is when the data are i.i.d. and exponentially distributed. This distribution is invariant under reflections in any $45^{\circ}$ line (not necessarily through the origin), but these reflections do not define a compact group and therefore do not fit the setting discussed in this article. Nevertheless, the sum of data points is a sufficient statistic for the data, so this model can still be seen as an online compression model with the sum being the summary. More work is needed to find out whether conformal martingales are log-optimal against certain alternatives in such settings.

6.2 The need for smoothing

In situations when, conditionally on the orbit selector $\gamma_{n}(X^{n})$ , the conformity measure $\alpha^{n}(X^{n})$ has a continuous distribution, the smoothing plays no role in (4). This is the case for the rotations discussed in Section 5.2. In certain other scenarios, smoothing can be avoided as well. Indeed, one can always define nonsmoothed orbit ranks, in opposition to the smoothed ranks $R_{n}$ from Definition 4, by $\tilde{R}_{n}:=\mu_{n}(\{g\in G_{n}:(g\alpha^{n})_{n}\leq\alpha_{n}\}).$ Notice that this nonsmooth version satisfies $\tilde{R}_{n}\leq R_{n}$ . For a particular choice of increasing densities $f_{1},f_{2},\dots,$ on $[0,1]$ —in the sense that $u\mapsto f_{i}(u)$ is increasing—, we have that the process $\tilde{M}_{n}:=\prod_{i=1}^{n}f_{i}(\tilde{R}_{i})$ is bounded from above by the conformal martingale $M_{n}=\prod_{i=1}^{n}f_{i}(R_{i})$ . Such a choice of increasing $f_{i}$ is natural when high values of $R_{i}$ (or $\tilde{R}_{i}$ ) are associated with departures from the null hypothesis. Then, any sequential test based on an upper threshold on $\tilde{M}_{n}$ inherits the anytime-valid type-I error guarantees of $M_{n}$ —exactly because $\tilde{M}_{n}\leq M_{n}$ . This was previously noted by Vovk et al. (2003). However, the process $\tilde{M}_{n}$ may not be a martingale itself. Instead, a test martingale can sometimes directly be associated to $\tilde{R}_{n}$ . For instance, in the setting of Example 1 (testing exchangeability), the distribution of $\tilde{R}_{n}$ under the null hypothesis is known—it is uniformly distributed on $\{1,\dots,n\}$ . Therefore, we can construct likelihood ratio processes for the sequence of nonsmoothed ranks. Even more, there are parametric alternatives under which the exact distributions of the nonsmoothed ranks can be computed. This is the case for Lehmann alternatives where, under the null, each $X_{i}$ is assumed to be sampled from some continuous distribution with c.d.f. $F_{i}(x)=F_{0}(x)$ for some fixed $F_{0}$ ; under the alternative, $F_{i}(x)=1-(1-F_{0}(x))^{\theta_{i}}$ for some $\theta_{i}$ . From Theorem 7.a.1 of Savage (1956) the distribution of $\tilde{R}_{i}$ can be derived, so that the likelihood ratio process of $\tilde{R}_{i}$ can be used for testing, thus avoiding external randomization.

7 Acknowledgements

We thank the attendees of the Seminar on Anytime-Valid Inference “E-readers” at Centrum Wiskunde & Informatica in Amsterdam for their input and valuable insights. In particular, we are grateful to Peter Grünwald for feedback on a first version of this article, and Nick Koning for fruitful discussions.

References

Bernardo and Smith (2009) Bernardo, J.M., Smith, A.F., 2009. Bayesian theory. volume 405. John Wiley & Sons.
Bondar (1976) Bondar, J.V., 1976. Borel cross-sections and maximal invariants. The Annals of Statistics , 866–877.
Bourbaki and Berberian (2004) Bourbaki, N., Berberian, S., 2004. Integration II: Chapters 7–9. Springer Berlin Heidelberg.
Cesa-Bianchi and Lugosi (2001) Cesa-Bianchi, N., Lugosi, G., 2001. Worst-case bounds for the logarithmic loss of predictors. Machine Learning 43, 247–264.
Eaton (1989) Eaton, M.L., 1989. Group invariance applications in statistics. Regional Conference Series in Probability and Statistics 1, i–133. URL: http://www.jstor.org/stable/4153172.
Fedorova et al. (2012) Fedorova, V., Gammerman, A., Nouretdinov, I., Vovk, V., 2012. Plug-in martingales for testing exchangeability on-line, in: Proceedings of the 29th International Conference on Machine Learning, Omnipress, New York, NY, USA. pp. 1639–1646.
Grünwald and Mehta (2019) Grünwald, P.D., Mehta, N.A., 2019. A tight excess risk bound via a unified pac-bayesian–rademacher–shtarkov–mdl complexity, in: Algorithmic Learning Theory, PMLR. pp. 433–465.
Grünwald et al. (2024) Grünwald, P., de Heide, R., Koolen, W., 2024. Safe Testing. JRSS B: Statistical Methodology URL: https://doi.org/10.1093/jrsssb/qkae011.
Haussler and Opper (1997) Haussler, D., Opper, M., 1997. Mutual information, metric entropy and cumulative relative entropy risk. The Annals of Statistics 25, 2451–2492. URL: http://www.jstor.org/stable/2959041.
Henzi and Law (2023) Henzi, A., Law, M., 2023. A rank-based sequential test of independence. ArXiv preprint arXiv:2305.13818.
Koning (2023) Koning, N.W., 2023. Online permutation tests: $e$ -values and likelihood ratios for testing group invariance. arXiv:2310.01153. arXiv preprint arXiv:2310.01153.
Koolen and Grünwald (2022) Koolen, W.M., Grünwald, P., 2022. Log-optimal anytime-valid e-values. Int. Journal of Approximate Reasoning 141, 69–82.
Kotłowski and Grünwald (2011) Kotłowski, W., Grünwald, P., 2011. Maximum likelihood vs. sequential normalized maximum likelihood in on-line density estimation, in: Proceedings of the 24th Annual Conference on Learning Theory, JMLR Workshop and Conference Proceedings. pp. 457–476.
Li (2010) Li, S., 2010. Concise formulas for the area and volume of a hyperspherical cap. Asian Journal of Mathematics & Statistics 4, 66–70.
Malov (1996) Malov, S., 1996. Sequential ranks and order statistics. Journal of Mathematical Sciences 81, 2434–2441.
Ramdas et al. (2023) Ramdas, A., Grünwald, P., Vovk, V., Shafer, G., 2023. Game-theoretic statistics and safe anytime-valid inference. Statistical Science 38, 576–601.
Ramdas et al. (2022) Ramdas, A., Ruf, J., Larsson, M., Koolen, W.M., 2022. Testing exchangeability: Fork-convexity, supermartingales and e-processes. International Journal of Approximate Reasoning 141, 83–109.
Rényi (1962) Rényi, A., 1962. On the extreme elements of observations. MTA III. Oszt. Közl 12, 105–121.
Savage (1956) Savage, I.R., 1956. Contributions to the Theory of Rank Order Statistics-the Two-Sample Case. The Annals of Mathematical Statistics 27, 590–615. doi:10.1214/aoms/1177728170. publisher: Institute of Mathematical Statistics.
Shiryaev (2016) Shiryaev, A.N., 2016. Probability-1. volume 95. Springer.
Smith (1981) Smith, A.F., 1981. On random sequences with centred spherical symmetry. Journal of the Royal Statistical Society: Series B (Methodological) 43, 208–209.
Vovk (2002) Vovk, V., 2002. On-line confidence machines are well-calibrated, in: The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings., IEEE. pp. 187–196.
Vovk (2004) Vovk, V., 2004. A universal well-calibrated algorithm for on-line classification. The Journal of Machine Learning Research 5, 575–604.
Vovk (2023) Vovk, V., 2023. The power of forgetting in statistical hypothesis testing, in: Conformal and Probabilistic Prediction with Applications, PMLR. pp. 347–366.
Vovk et al. (2005) Vovk, V., Gammerman, A., Shafer, G., 2005. Algorithmic learning in a random world. volume 29. Springer.
Vovk et al. (2003) Vovk, V., Nouretdinov, I., Gammerman, A., 2003. Testing exchangeability on-line, in: Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 768–775.
Vovk and Wang (2021) Vovk, V., Wang, R., 2021. E-values: Calibration, combination and applications. The Annals of Statistics 49, 1736–1754.

Appendix

Appendix for “Anytime-Valid Tests of Group Invariance through Conformal Prediction” by Tyron Lardy and Muriel Felipe Pérez-Ortiz.

Appendix A Proofs

A.1 Proof of Theorem 1

Proof 1 (Theorem 1)

The proof can be divided in two main steps: (1) to show that, conditionally on $\gamma_{n}(X^{n})$ , $R_{n}$ is uniformly distributed for each $n$ and (2) to show that $R_{1},R_{2},\dots$ are also independent. The second step is completely analogous to the proof of Theorem 3 by Vovk (2002). For each $n$ , define the $\sigma$ -algebra $\mathcal{G}_{n}=\sigma(\gamma_{n}(X^{n}),X_{n+1},X_{n+2},\dots)$ . Notice that $\mathcal{G}_{n}$ contains—among others—all $G_{n}$ -invariant functions of $X^{n}$ because $\gamma_{n}$ is a maximally invariant function of $X^{n}$ —any other $G_{n}$ -invariant function of $X^{n}$ is a function of $\gamma_{n}(X^{n})$ . Let $g^{\prime}\in G_{n}$ such that $\gamma_{n}(X^{n})=g^{\prime}X^{n}$ , then we have that $\{g\in G_{n}:(g\alpha^{n})_{n}<\alpha_{n}\}=\{g\in G_{n}:(\alpha^{n}(g\gamma_{% n}(X^{n})))_{n}<\alpha_{n}\}g^{\prime}$ . By the invariance of $\mu_{n}$ —it is the Haar probability measure—, it follows that

\mu_{n}(\{g\in G_{n}:(g\alpha^{n})_{n}<\alpha_{n}\})=\mu_{n}(\{g\in G_{n}:(% \alpha^{n}(g\gamma_{n}(X^{n})))_{n}<\alpha_{n}\}).

An analogous identity can be derived for the second term in (3). We have $\alpha_{n}\mid\mathcal{G}_{n}\stackrel{{\scriptstyle\mathcal{D}}}{{=}}(\alpha^% {n}(U\gamma_{n}(X^{n})))_{n}\mid\mathcal{G}_{n}$ .

We will denote $F(b):=\mu(\{g\in G_{n}:(g\alpha^{n})_{n}<b\})$ and define $G(\delta)=\sup\{b\in\mathbb{R}:F(b)\leq\delta\}$ . If $\alpha_{n}\mid\mathcal{G}_{n}$ is continuous, then $F$ is the CDF of that distribution, otherwise it is the CDF minus the probability of equality. In any case, $F$ is is nonincreasing and left-continuous. For any $\delta\in(0,1)$ , we have that $F(G(\delta))=\delta^{\prime}$ for some $\delta^{\prime}\leq\delta$ , with equality if $F$ is continuous in $G(\delta)$ . Then we can write

\mathbb{P}(R_{n}\leq\delta\mid\mathcal{G}_{n})=\mathbb{P}(R_{n}\leq\delta^{% \prime}\mid\mathcal{G}_{n})+\mathbb{P}(\delta^{\prime}<R_{n}\leq\delta\mid% \mathcal{G}_{n}).

(9)

For any $\theta\in(0,1]$ , we have that $R_{n}=F(\alpha_{n})+\theta(F(\alpha_{n}^{+})-F(\alpha_{n}))\leq\delta^{\prime}$ if and only if either $F(\alpha_{n})<\delta^{\prime}$ or $F(\alpha_{n}^{+})-F(\alpha_{n})=0$ , which happens precisely when $\alpha_{n}<G(\delta)$ . We therefore see

\mathbb{P}(R_{n}\leq\delta^{\prime}\mid\mathcal{G}_{n})=\mathbb{P}(\alpha_{n}<% G(\delta^{\prime})\mid\mathcal{G}_{n})=F(G(\delta^{\prime}))=\delta^{\prime}.

If $F$ is continuous in $G(\delta)$ , then this shows that $\mathbb{P}(R_{n}\leq\delta\mid\mathcal{G}_{n})=\delta$ , since $\delta^{\prime}=\delta$ in that case. If $F$ is not continuous in $G(\delta)$ , then we have that

\displaystyle\mathbb{P}(\delta^{\prime}<R_{n}\leq\delta\mid\mathcal{G}_{n})

\displaystyle=\mathbb{P}(\delta^{\prime}<F(\alpha_{n})+\theta(F(\alpha^{+}_{n}% )-F(\alpha_{n}))\leq\delta\mid\mathcal{G}_{n}).

Notice that $\delta^{\prime}<F(\alpha_{n})+\theta({F}(\alpha^{+}_{n})-F(\alpha_{n}))\leq\delta$ if and only if $\alpha_{n}=G(\delta)$ and $\theta<(\delta-\delta^{\prime})/(F(\alpha^{+}_{n})-F(\alpha_{n}))$ , so that we can write

	$\displaystyle\mathbb{P}(\delta^{\prime}<R_{n}\leq\delta\mid\mathcal{G}_{n})$	$\displaystyle=\mathbb{P}(\alpha_{n}=G(\delta)\mid\mathcal{G}_{n})\mathbb{P}% \left(\theta\leq\frac{\delta-\delta^{\prime}}{F(G(\delta^{\prime})^{+})-F(G(% \delta^{\prime}))}\mid\mathcal{G}_{n}\right)$
		$\displaystyle=({F}(G(\delta^{\prime})^{+})-{F}(G(\delta^{\prime})))\frac{% \delta-\delta^{\prime}}{({F}(G(\delta^{\prime})^{+})-{F}(G(\delta^{\prime})))}$
		$\displaystyle=\delta-\delta^{\prime}.$

Putting everything together, we see that $\mathbb{P}(R_{n}\leq\delta\mid\mathcal{G}_{n})=\delta$ . This shows the first part, that $R_{n}$ has a conditional uniform distribution on $[0,1]$ .

For the second part of the proof, we show that the sequence $R_{1},R_{2},\dots$ is also an independent sequence. We have that $R_{n}$ is $\mathcal{G}_{n-1}$ -measurable because it is invariant under transformations of the form $X^{n}\mapsto(gX^{n-1},X_{n})$ for $g\in G_{n-1}$ (see also Vovk, 2004, Lemma 2). We proceed (implicitly) by induction:

	$\displaystyle\mathbb{P}(R_{n}\leq\delta_{n},\dots,R_{1}\leq\delta_{1}\mid% \mathcal{G}_{n})$	$\displaystyle=\mathbf{E}\left[\mathbf{1}\left\{R_{n}\leq\delta_{n},\dots,R_{1}% \leq\delta_{1}\right\}\mid\mathcal{G}_{n}\right]$
		$\displaystyle=\mathbf{E}\left[\mathbf{E}\left[\mathbf{1}\left\{R_{n}\leq\delta% _{n},\dots,R_{1}\leq\delta_{1}\right\}\mid\mathcal{G}_{n-1}\right]\mid\mathcal% {G}_{n}\right]$
		$\displaystyle=\mathbf{E}\left[\mathbf{1}\left\{R_{n}\leq\delta_{n}\right\}% \mathbf{E}\left[\mathbf{1}\left\{p_{n-1}\leq\delta_{n-1},\dots,R_{1}\leq\delta% _{1}\right\}\mid\mathcal{G}_{n-1}\right]\mid\mathcal{G}_{n}\right]$
		$\displaystyle=\mathbf{E}\left[\mathbf{1}\left\{R_{n}\leq\delta_{n}\right\}% \right]\delta_{n-1}\cdots\delta_{1}$
		$\displaystyle=\delta_{n}\cdots\delta_{1}.$

It follows by the law of total expectation that

\mathbb{P}(R_{n}\leq\delta_{n},\dots,R_{1}\leq\delta_{1})=\delta_{n}\cdots% \delta_{1},

which shows that $R_{1},R_{2},\dots,R_{n}$ are independent and uniformly distributed on $[0,1]$ for any $n\in\mathbb{N}$ . This implies that the distribution of $R_{1},R_{2},\dots$ coincides with $U^{\infty}$ by Kolmogorov’s extension theorem (see e.g. Shiryaev, 2016, Theorem II.3.3). This shows the claim of the theorem.

A.2 Proof of Proposition 1

Proof 2 (Proposition 1)

For $X^{n-1}\in\mathcal{X}^{n-1}$ and $X_{n}\in\mathcal{X}$ , let $F_{n}(X^{n-1},X_{n})=\gamma_{n}((X^{n-1},X_{n}))$ , where, by a slight abuse of notation, we refer by $(X^{n-1},X_{n})$ to the concatenation of $X^{n-1}$ and $X_{n}$ . We will show that $F_{n}$ has the claimed properties. First, we will show that the vectors $(\gamma_{n-1}(X^{n-1}),X_{n})$ and $X^{n}$ are in the same orbit, so that also $\gamma_{n}((\gamma_{n-1}(X^{n-1}),X_{n}))=\gamma_{n}(X^{n})$ . To this end, let $g^{\prime}\in G_{n-1}$ denote the group element such that $g^{\prime}X^{n-1}=\gamma_{n-1}(X^{n-1})$ . Then it holds that

	$\displaystyle\{g(\gamma_{n-1}(X^{n-1}),X_{n}):g\in G_{n}\}$	$\displaystyle=\{g(g^{\prime}X^{n-1},X_{n}):g\in G_{n}\}$
		$\displaystyle=\{g\imath_{n}(g^{\prime})X^{n}:g\in G_{n}\}$
		$\displaystyle=\{gX^{n}:g\in G_{n}\},$

where we called $X^{n}$ the concatenation of $X^{n-1}$ and $X_{n}$ . This shows the first claim. For the second claim, that $F_{n}(\ \cdot\ ,X_{n})$ is one-to-one for each fixed $X_{n}$ , we show that we can reconstruct $\gamma_{n-1}(X^{n-1})$ from $X_{n}$ and $\gamma_{n}(X^{n})$ .

Pick any $g_{X_{n}}\in G_{n}$ such that $(g_{X_{n}}\gamma_{n}(X^{n}))_{n}=X_{n}$ . We furthermore know that there exists some $g\in G_{n}$ such that $gX^{n}=\gamma_{n}(X^{n})$ . Note that $g_{X_{n}}g$ does nothing to the final coordinate of $X^{n}$ , so by Assumption 1 there is a $g_{n-1}^{*}\in G_{n-1}$ such that $g_{X_{n}}gX^{n}=\imath(g_{n-1}^{*})X^{n}$ . Then we see

	$\displaystyle\{\imath(g_{n-1})g_{X^{n}}\gamma_{n}(X^{n}):g_{n-1}\in G_{n-1}\}$	$\displaystyle=\{\imath(g_{n-1})g_{X^{n}}gX^{n}:g_{n-1}\in G_{n-1}\}$
		$\displaystyle=\{\imath(g_{n-1})\imath(g_{n-1}^{*})X^{n}:g_{n-1}\in G_{n-1}\}$
		$\displaystyle=\{\imath(g_{n-1})X^{n}:g_{n-1}\in G_{n-1}\}.$

We find that $G_{n-1}\mathrm{proj}_{n-1}(g_{X_{n}}\gamma_{n}(X^{n}))=G_{n-1}X^{n-1}$ and therefore $\gamma_{n-1}(\mathrm{proj}_{n-1}(g_{X_{n}}\gamma_{n}(X^{n})))=\gamma_{n-1}(X^{% n-1})$ .

A.3 Proof of Proposition 2

The proof of Proposition 2 follows directly from Lemma 1. It states that, with probability one, enough of the original data can be recovered using the smoothed ranks and the orbit representative. We state Lemma 1, prove Proposition 2 and then prove Lemma 1.

Lemma 1

Suppose, for each $n\in\mathbb{N}$ , that $\alpha_{n}(\ \cdot\ ,\gamma_{n}(X^{n}))$ is a one-to-one function of $X_{n}$ , then there exists a map $D_{n}:[0,1]^{n}\times\mathcal{X}^{n}\to[0,1]^{n}\times\mathcal{X}^{n}$ s.t. for any $Q\in\mathcal{H}_{0}$ $\tilde{Q}(D_{n}(R^{n},\gamma_{n}(X^{n}))=(\tilde{\theta}^{n},X^{n}))=1.$ Here, $\tilde{\theta}^{n}=(\tilde{\theta}_{n})_{n\in\mathbb{N}}$ is the sequence given by $\tilde{\theta}_{n}=\theta_{n}\mathbf{1}\left\{\mu_{n}(\{g\in G_{n}:(g\alpha^{n% })_{n}=\alpha_{n}\})\neq 0\right\}$ .

Proof 3 (Lemma 2)

Consider, without loss of generality, the case that $\alpha_{n}(X^{n})=X_{n}$ . Because of the independence of $R_{n}$ and $\gamma_{n}$ under $P$ and the assumption that the marginal distribution of $\gamma_{n}$ under $Q^{*}$ and under $P$ are equal, $M_{n}=\frac{\mathrm{d}\tilde{P}(R^{n},\gamma_{n}(X^{n}))}{\mathrm{d}\tilde{Q}^% {*}(R^{n},\gamma_{n}(X^{n}))}$ . Using the sequence of functions $(D_{n})_{n\in\mathbb{N}}$ from Lemma 1 and that the external randomization is independent of $X^{n}$ , the claim follows.

Proof 4 (Lemma 1)

As in the proof of Theorem 1, we will denote $F(b)=\mu_{n}(\{g\in G_{n}:(g\alpha^{n})_{n}<b\})$ and define $G(\delta)=\sup\{b\in\mathbb{R}:F(b)\leq\delta\}$ . Furthermore, we will write $\mathbb{P}_{\alpha_{n}\mid\gamma_{n}(X^{n})}$ for the distribution of $\alpha_{n}$ given $\gamma_{n}(X^{n})$ and denote its support by

\mathrm{supp}(\mathbb{P}_{\alpha_{n}\mid\gamma_{n}(X^{n})}):=\{x\in\mathbb{R}% \mid\text{for all }I\text{ open, if }x\in I\text{ then }\mathbb{P}_{\alpha_{n}% \mid\gamma_{n}(X^{n})}(I)>0\},

If $b\in\mathrm{int}(\mathrm{supp}(\mathbb{P}_{\alpha_{n}\mid\gamma_{n}(X^{n})}))$ , then there exists an open interval $B$ with $b\in B$ and $B\subseteq\mathrm{supp}(\mathbb{P}_{\alpha_{n}\mid\gamma_{n}(X^{n})}))$ . For all $c\in B$ with $c>b$ , we have that $F(c)-F(b)=\mathbb{P}_{\alpha_{n}\mid\gamma_{n}(X^{n})}([b,c))>0$ , since $[b,c)$ contains an open neighborhood of an interior point of the support. It follows that $F(c)>F(b)$ . In words, there are no points $c$ to the right of $b$ such that $F(c)>F(b)$ . Consequently, we have

G(F(b))=\sup\{a\in\mathbb{R}:F(a)\leq F(b)\}=b.

In a similar fashion, we can conclude that the same identity holds if $b\in\mathrm{supp}(\mathbb{P}_{\alpha_{n}\mid\gamma_{n}(X^{n})})\setminus% \mathrm{int}(\mathrm{supp}(\mathbb{P}_{\alpha_{n}\mid\gamma_{n}(X^{n})}))$ . Notice furthermore that $G(R_{n})=G(F(\alpha_{n})+\theta_{n}({F}(\alpha_{n}^{+})-F(\alpha_{n})))=G(F(% \alpha_{n}))$ whenever $\theta_{n}<1$ , which happens with probability one. Together with the fact that $\mathbb{P}_{\alpha_{n}\mid\gamma_{n}(X^{n})}(\mathrm{supp}(\mathbb{P}_{\alpha_% {n}\mid\gamma_{n}(X^{n})}))=1$ , this gives $\mathbb{P}_{\alpha_{n}\mid\gamma_{n}(X^{n})}(G(R_{n})=\alpha_{n})=1$ , so also $\mathbb{P}(G(R_{n})=\alpha_{n})=1$ . If $({F}(G(R_{n})^{+})-{F}(G(R_{n})))=\mu_{n}(\{g\in G_{n}:(g\alpha^{n})_{n}=% \alpha_{n}\})=0$ , set $\tilde{\theta}_{n}=0$ . If $\mu_{n}(\{g\in G_{n}:(g\alpha^{n})_{n}=\alpha_{n}\})>0$ , then it follows that $\mathbb{P}(\theta_{n}=(R_{n}-{F}(G(R_{n})))/({F}(G(R_{n})^{+})-{F}(G(R_{n}))))=1$ , so set $\tilde{\theta}_{n}=(R_{n}-{F}(G(R_{n})))/({F}(G(R_{n})^{+})-{F}(G(R_{n})))$ . Since $\alpha_{n}(\cdot,\gamma_{n}(X^{n}))$ is one-to-one by assumption, its inverse maps $\alpha_{n}$ to $X_{n}$ . By Proposition 1, there also exists a map from $X_{n}$ and $\gamma_{n}(X^{n})$ to $\gamma_{n-1}(X^{n-1})$ . At this point, we can repeat the procedure above to recover $X_{n-1}$ from $(R_{n-1},\gamma_{n-1}(X^{n-1}))$ , from which we can then recover $\gamma_{n-2}(X^{n-2})$ , etc. Together, all of the maps involved give the function as in the statement of the proposition.

A.4 Proof of Theorem 2

Proof 5 (Theorem 2)

We first show (6). Assume that $\tilde{P}$ is such that $R^{n}\perp\gamma_{n}(X^{n})$ for all $n$ . Let $Q^{*}$ denote the distribution under which the marginal of $\gamma_{n}(X^{n})$ coincides with that under $P$ , and such that $X^{n}\mid\gamma_{n}(X^{n})\stackrel{{\scriptstyle\mathcal{D}}}{{=}}U\gamma_{n}% (X^{n})\mid\gamma_{n}(X^{n})$ , where $U\sim\mu_{n}$ is uniform on $G_{n}$ and independent from $\gamma_{n}(X^{n})$ . First note that

	$\displaystyle\tilde{Q}^{}\left(\prod_{i=1}^{\tau}f_{i}(R_{i})=\frac{\mathrm{d% }{P}}{\mathrm{d}{Q}^{}}(X^{\tau})\right)$	$\displaystyle\geq\tilde{Q}^{}\left(\forall t:\prod_{i=1}^{t}f_{i}(R_{i})=% \frac{\mathrm{d}P}{\mathrm{d}Q^{}}(X^{t})\right)$
		$\displaystyle=1-\tilde{Q}^{}\left(\exists t:\prod_{i=1}^{t}f_{i}(R_{i})\neq% \frac{\mathrm{d}{P}}{\mathrm{d}Q^{}}(X^{t})\right)$
		$\displaystyle=1-\tilde{Q}^{}\left(\bigcup_{t=1}^{\infty}\left\{\prod_{i=1}^{t% }f_{i}(R_{i})\neq\frac{\mathrm{d}P}{\mathrm{d}{Q}^{}}(X^{t})\right\}\right)$
		$\displaystyle\geq 1-\sum_{t=1}^{\infty}\tilde{Q}^{}\left(\left\{\prod_{i=1}^{% t}f_{i}(R_{i})\neq\frac{\mathrm{d}P}{\mathrm{d}Q^{}}(X^{t})\right\}\right)=1.$

In the last inequality, we used Lemma 1. Since the distribution of $X\mid\gamma_{n}(X^{n})$ has full support under $P$ , we have that $\tilde{P}\ll\tilde{Q}^{*}$ , so it also holds that $\tilde{P}\left(\prod_{i=1}^{\tau}f_{i}(R_{i})=\frac{\mathrm{d}{P}}{\mathrm{d}{% Q}^{*}}(X^{\tau})\right)=1$ . We have shown that $M_{\tau}$ is a modification of the likelihood ratio evaluated at $X^{\tau}$ . We now show that the latter is optimal.

Denote $\ell_{n}=\frac{\mathrm{d}P}{\mathrm{d}Q^{*}}(X^{n})$ and let $f(\alpha)=\mathbf{E}_{\tilde{P}}\left[\ln((1-\alpha)\ell_{\tau}+\alpha E^{% \prime}_{\tau})\right]$ ; a concave function. We will show that the derivative of $f$ in $0$ is negative, which implies that $f$ attains its maximum in $\alpha=0$ . This in turn implies our claim. Indeed,

	$\displaystyle f^{\prime}(0)$	$\displaystyle=\mathbf{E}_{\tilde{P}}\left[\frac{E^{\prime}_{\tau}-\ell_{\tau}}% {\ell_{\tau}}\right]$
		$\displaystyle=\sum_{i=1}^{\infty}\mathbf{E}_{\tilde{P}}\left[\frac{E^{\prime}_% {i}}{\ell_{i}}\mathbf{1}\left\{\tau=i\right\}\right]-1$
		$\displaystyle=\sum_{i=1}^{\infty}\mathbf{E}_{\tilde{Q}^{*}}\left[E^{\prime}_{i% }\mathbf{1}\left\{\tau=i\right\}\right]-1$
		$\displaystyle=\mathbf{E}_{\tilde{Q}^{*}}\left[E^{\prime}_{\tau}\right]-1\leq 0,$

where we use that differentiation and integration can be interchanged, because

|f^{\prime}(\alpha)|=\left|\frac{E^{\prime}_{\tau}-\ell_{\tau}}{(1-\alpha)\ell% _{\tau}+\alpha E^{\prime}_{\tau}}\right|\leq\max\left\{\frac{1}{1-\alpha},% \frac{1}{\alpha}\right\},

so that the dominated convergence theorem is applicable. Finally, this gives that $\mathbf{E}_{\tilde{P}}\left[\ln\prod_{i=1}^{\tau}f(R_{i})\right]=\mathbf{E}_{% \tilde{P}}\left[\ln E^{\prime}_{\tau}\right]\geq\mathbf{E}_{\tilde{P}}\left[% \ln E^{\prime}_{\tau}\right]$ . The proof of (5) follows from the same argument, but using $\ell^{\prime}_{n}=\frac{\mathrm{d}P}{\mathrm{d}Q^{*}}(R^{n})$ .

Appendix B Linear models and isotropy groups

The rotational symmetry described in Section 5.2 is that of symmetry around the origin, which we argued is equivalent to testing whether $X_{i}\sim\mathcal{N}(0,\sigma)$ for some $\sigma\in\mathbb{R}^{+}$ . Of course, there are many applications where it is not reasonable to assume that the data is zero-mean and it is more interesting to test whether the data is spherically symmetric around some point other than the origin. One particular instance of such noncentered sphericity is to test whether, for each $n$ , the data can be written as $X^{n}=\mu\mathbf{1}_{n}+\epsilon^{n}$ , where $\mu\in\mathbb{R}$ , the error $\epsilon^{n}$ is spherically symmetric and $\mathbf{1}_{n}$ is the $n$ -vector of all ones. If $\mu$ is known, we can test for spherical symmetry of $X^{n}-\mu\mathbf{1}_{n}$ under $\mathrm{O}(n)$ and the problem reduces to that of the previous section. It is still possible treat the more realistic case where $\mu$ is unknown because the null model is still symmetric under a family of rotations. Notice the following: for any $O_{n}\in\mathrm{O}(n)$ it holds that $O_{n}X^{n}=\mu O_{n}\mathbf{1}_{n}+O_{n}\epsilon^{n}$ . Unless $\mu=0$ , it follows that $X^{n}\stackrel{{\scriptstyle\mathcal{D}}}{{=}}O_{n}X^{n}$ every time that $O_{n}\mathbf{1}_{n}=\mathbf{1}_{n}$ . That is, the null distribution of $X^{n}$ is invariant under the isotropy group of $\mathbf{1}_{n}$ , i.e. $G_{n}=\{O_{n}\in\mathrm{O}(n):O_{n}\mathbf{1}_{n}=\mathbf{1}_{n}\}$ . Invariance under the action of $G_{n}$ has previously appeared in the literature as centered spherical symmetry (Smith, 1981). Through the lens of test martingales, testing sequentially for centered spherical symmetry is equivalent to testing whether the data was generated by any Gaussian. This holds because any probability distribution on $\mathbb{R}^{\infty}$ for which the marginal of the first $n$ coordinates is centered spherically symmetric for any $n$ can be written as a mixture of Gaussians (Smith, 1981; Eaton, 1989, Theorem 8.13).

Using some geometry, a test is readily obtained. Note that we can write $X^{n}=X^{n}_{\mathbf{1}_{n}}+X^{n}_{\perp\mathbf{1}_{n}}$ , where $X^{n}_{\mathbf{1}_{n}}=\frac{\langle X^{n},\mathbf{1}\rangle}{n}\mathbf{1}_{n}$ is the projection of $X^{n}$ onto the span of $\mathbf{1}_{n}$ , and $X^{n}_{\perp\mathbf{1}_{n}}$ the projection onto its orthogonal complement. We have that $gX^{n}=X^{n}_{\mathbf{1}_{n}}+gX^{n}_{\perp\mathbf{1}_{n}}$ for any $g\in G_{n}$ . Consequently, the orbit of $X^{n}$ under $G_{n}$ is given by the intersection of $S^{n-1}(\|X^{n}\|)$ and the hyperplane $H_{n}(X^{n})$ defined by $H_{n}(X^{n})=\{x^{\prime n}\in\mathbb{R}^{n}:\langle x^{\prime n},\mathbf{1}_{% n}\rangle=\langle X^{n},\mathbf{1}_{n}\rangle\}$ . There is a unique line that is perpendicular to $H_{n}(X^{n})$ and passes through the origin $0_{n}=(0,\dots,0)$ ; it intersects $H_{n}(X^{n})$ in the point $0_{H_{n}}:=\frac{\langle X^{n},\mathbf{1}_{n}\rangle}{n}\mathbf{1}_{n}$ . For any $x^{\prime n}\in S^{n-1}(\|X^{n}\|)\cap H_{n}(X^{n})$ , Pythagoras’ theorem gives that $\|x^{\prime n}-0_{H_{n}}\|^{2}=\|X^{n}\|^{2}-\|0_{H_{n}}-0_{n}\|^{2}$ . In other words, $S^{n-1}(\|X^{n}\|)\cap H_{n}(X^{n})$ forms an $(n-2)$ -dimensional sphere of radius $(\|X^{n}\|^{2}-\|0_{H_{n}}-0_{n}\|^{2})^{1/2}$ around $0_{H_{n}}$ . If one considers the projection of this sphere on the $n$ -th coordinate, then the highest possible value is given by $\|X^{n}\|$ , and the lowest value therefore by $\frac{\langle X^{n},\mathbf{1}_{n}\rangle}{n}-\frac{1}{2}(\|X^{n}\|-\frac{% \langle X^{n},\mathbf{1}_{n}\rangle}{n})$ . The relative value of $X_{n}$ is therefore given by $\tilde{X}_{n}:=X_{n}-\frac{\langle X^{n},\mathbf{1}_{n}\rangle}{n}+\frac{1}{2}% (\|X^{n}\|-\frac{\langle X^{n},\mathbf{1}_{n}\rangle}{n})$ . As a result, $R_{n}$ is the relative surface area of the $(n-2)$ -dimensional hyper-spherical cap with co-latitude angle $\varphi=\pi-\cos^{-1}(\tilde{X}_{n}/(\|X^{n}\|^{2}-\|0_{H_{n}}-0_{n}\|^{2})^{1% /2})$ , so that equation (8) can again be used to determine $R_{n}$ . With this construction, we recover what Vovk (2023) refers to as the “full Gaussian model”, which is an online compression model that is defined in terms of the summary statistic $\sigma_{n}=(\langle X^{n},\mathbf{1}_{n}\rangle,\|X^{n}\|)$ .

This model can be extended to the case in which there are covariates, i.e. $X_{n}=(Y_{n},Z^{d}_{n})$ for some $Y_{n}\in\mathbb{R}$ and $Z^{d}_{n}\in\mathbb{R}^{d}$ . Denote $Z_{n}$ for the matrix with row-vectors $Z^{d}_{n}$ and, as is a standard assumption in regression, assume that $Z_{n}$ is full rank for every $n$ . The model of interest is $Y^{n}=Z_{n}\beta+\epsilon^{n}$ where $\beta\in\mathbb{R}^{d}$ and $\epsilon^{n}$ is spherically symmetric for each $n$ . Similar to the reasoning above, this model is invariant under the intersection of the isotropy groups of the column vectors of $Z_{n}$ , i.e. $G_{n}=\{O_{n}\in\mathrm{O}(n):O_{n}Z_{n}=Z_{n}\}$ . The orbit of $X^{n}$ under $G_{n}$ is given by the intersection of $S^{n-1}(\|X^{n}\|)$ with the intersection of the $d$ hyperplanes defined by the columns of $Z_{n}$ , so that for $\alpha^{n}(Y^{n},Z_{n})=Y^{n}$ , computing $R_{n}$ is analogous. Interestingly, however, it does not always hold that testing for invariance under $G_{n}$ is equivalent to testing for normality with mean $Z_{n}\beta^{d}$ . A sufficient condition for the equivalence to hold is that $\lim_{n\to\infty}(Z_{n}^{\prime}Z_{n})^{-1}=0$ , which is essentially the condition that the parameter vector $\beta$ can be consistently estimated by means of least squares (Eaton, 1989, Section 9.3).

	$\displaystyle\tilde{Q}^{}\left(\prod_{i=1}^{\tau}f_{i}(R_{i})=\frac{\mathrm{d% }{P}}{\mathrm{d}{Q}^{}}(X^{\tau})\right)$	$\displaystyle\geq\tilde{Q}^{}\left(\forall t:\prod_{i=1}^{t}f_{i}(R_{i})=% \frac{\mathrm{d}P}{\mathrm{d}Q^{}}(X^{t})\right)$
		$\displaystyle=1-\tilde{Q}^{}\left(\exists t:\prod_{i=1}^{t}f_{i}(R_{i})\neq% \frac{\mathrm{d}{P}}{\mathrm{d}Q^{}}(X^{t})\right)$
		$\displaystyle=1-\tilde{Q}^{}\left(\bigcup_{t=1}^{\infty}\left\{\prod_{i=1}^{t% }f_{i}(R_{i})\neq\frac{\mathrm{d}P}{\mathrm{d}{Q}^{}}(X^{t})\right\}\right)$
		$\displaystyle\geq 1-\sum_{t=1}^{\infty}\tilde{Q}^{}\left(\left\{\prod_{i=1}^{% t}f_{i}(R_{i})\neq\frac{\mathrm{d}P}{\mathrm{d}Q^{}}(X^{t})\right\}\right)=1.$