Machine Learning Predictors
for Min-Entropy Estimation

Javier Blanco-Romero\scalerel —\scalerel —{}^{\href https://orcid.org/0009-0004-0635-953X}start_FLOATSUPERSCRIPT * — end_FLOATSUPERSCRIPT, Vicente Lorenzo\scalerel —\scalerel —{}^{\href https://orcid.org/0000-0003-2077-6095}start_FLOATSUPERSCRIPT * — end_FLOATSUPERSCRIPT, Florina Almenares Mendoza\scalerel —\scalerel —{}^{\href https://orcid.org/0000-0002-5232-2031}start_FLOATSUPERSCRIPT * — end_FLOATSUPERSCRIPT, Daniel Díaz-Sánchez\scalerel —\scalerel —{}^{\href https://orcid.org/0000-0002-3323-6453}start_FLOATSUPERSCRIPT * — end_FLOATSUPERSCRIPT Javier Blanco-Romero is with the Department of Telematic Engineering, Universidad Carlos III de Madrid, Leganés, Madrid, 28911, Spain (e-mail: [email protected]).Vicente Lorenzo is with the Department of Telematic Engineering, Universidad Carlos III de Madrid, Leganés, Madrid, 28911, Spain (e-mail: [email protected]).Florina Almenares Mendoza is with the Department of Telematic Engineering, Universidad Carlos III de Madrid, Leganés, Madrid, 28911, Spain (e-mail: [email protected]).Daniel Díaz-Sánchez is with the Department of Telematic Engineering, Universidad Carlos III de Madrid, Leganés, Madrid, 28911, Spain (e-mail: [email protected]).
Abstract

This study investigates the application of machine learning predictors for min-entropy estimation in Random Number Generators (RNGs), a key component in cryptographic applications where accurate entropy assessment is essential for cybersecurity. Our research indicates that these predictors, and indeed any predictor that leverages sequence correlations, primarily estimate average min-entropy, a metric not extensively studied in this context. We explore the relationship between average min-entropy and the traditional min-entropy, focusing on their dependence on the number of target bits being predicted. Utilizing data from Generalized Binary Autoregressive Models, a subset of Markov processes, we demonstrate that machine learning models (including a hybrid of convolutional and recurrent Long Short-Term Memory layers and the transformer-based GPT-2 model) outperform traditional NIST SP 800-90B predictors in certain scenarios. Our findings underscore the importance of considering the number of target bits in min-entropy assessment for RNGs and highlight the potential of machine learning approaches in enhancing entropy estimation techniques for improved cryptographic security.

Index Terms:
Min-entropy Estimation, Machine Learning Predictors, Random Number Generators, Autoregressive Processes, Generalized Binary Autoregressive Models

I Introduction

The security of cryptographic systems often hinges on the generation of random values. Although there is a broad spectrum of algorithms and devices used to generate these random values, they are all generically denoted by Random Number Generators (RNGs). Given the important role that RNGs play in the context of cybersecurity, it becomes evident that rigorous criteria are necessary for evaluating the reliability and performance of an RNG.

Multiple approaches are commonly employed to assess the quality of the output of an RNG (cf. [bassham2010sp], [marsaglia2008marsaglia], [abbott2019experimentally], [calude2010experimental], [kavulich2021searching], [bird2020effects], etc.). In this paper the emphasis will be put on:

  • Entropy tests, as those found in NIST Special Publication 800-90B [turan2018recommendation], which estimate the entropy of a noise source based on appropriate samples (cf. [abraham2022high], [islam2022using], [li2020jitter], etc.).

  • Machine Learning models trained with the output of an RNG aiming to guess the bit or set of bits that follow a given sequence, which can give an insight into how predictable the output of the RNG is (cf. [truong2018machine], [yang2018neural], [lv2020high], [li2020deep], [feng2020testing], [li2023improvement], etc.).

The fact that the entropy of a given source and the predictability of its output are correlated was already noticed by Shannon [shannon1951prediction]. Nevertheless, the link between these two concepts is far from being completely understood, specially if one takes into account the heterogeneity of entropy definitions that can be found in the literature and how much the predictability of the output of an entropy source relies on the predictor being considered. Building on the evidence provided by [kelsey2015predictive] that the entropy estimators considered by NIST Special Publication 800-90B [turan2018recommendation] tend to underestimate min-entropy, an attempt to reinforce the argument [zhu2017analysis] that predictors are better suited to estimate average min-entropy [dodis2004fuzzy] than min-entropy is carried out in this paper. In the first stage, the theoretical framework required to support our thesis is developed. In the second stage, experimental validation of the theoretical analysis is conducted.

Whereas [kelsey2015predictive] concentrates on Ensemble, Categorical Data, and Numerical predictors, the focus of this paper will be on machine learning predictors. In particular, a hybrid model that integrates convolutional and recurrent Long Short-Term Memory (LSTM) layers and the transformer-based GPT-2 model will be considered. As in [zhu2017analysis], we generate sets of data for which a theoretical entropy can be calculated so that the machine learning entropy estimation can be compared to the theoretical value. Nevertheless, while the data generated in [zhu2017analysis] comes from an oscillator-based model and Markov processes of order at most 2222, our data comes from Generalized Binary Autoregressive Models [jentsch2019generalized], a subclass of Markov chains that allows us to easily parameterize correlations and anticorrelations at the bit level and compute min-entropies.

Our research also investigates the influence of the number of target bits on the estimation of min-entropy. We demonstrate that the relationship between average min-entropy and min-entropy is significantly affected by the number of target bits being predicted. This finding highlights the importance of considering the target bit count when assessing the min-entropy of RNGs using machine learning predictors.

The remainder of this paper is structured as follows: Section 2 presents a literature review, discussing the current state of the art in the application of predictors for min-entropy estimation. In Section 3, we establish the theoretical framework, where we study the concept of average min-entropy and its relationship with min-entropy, deriving a series of results for order-p Markov chains and gbAR(p). Section 4 outlines our experimental methodology, aimed at validating the theoretical findings. Section 5 presents the results of our experiments, followed by Section 6, which offers a discussion of these results and their implications. Finally, Section 7 concludes the paper with a summary of our findings and suggestions for future research.

I-A Notation and Conventions

The following notation and conventions will be considered throughout this paper:

  • Random Variables: Uppercase letters X1,X2,A,subscript𝑋1subscript𝑋2𝐴X_{1},X_{2},A,\ldotsitalic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_A , … represent random variables, while their corresponding realizations are represented by lowercase letters x1,x2,a,subscript𝑥1subscript𝑥2𝑎x_{1},x_{2},a,\ldotsitalic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a , … and by abusing the notation P(A=a)𝑃𝐴𝑎P(A=a)italic_P ( italic_A = italic_a ) will be denoted by P(a)𝑃𝑎P(a)italic_P ( italic_a ). Furthermore, by XB(n,p)similar-to𝑋𝐵𝑛𝑝X\sim B(n,p)italic_X ∼ italic_B ( italic_n , italic_p ) we mean that X𝑋Xitalic_X is a random variable that follows a binomial distribution with number of trials n𝑛nitalic_n and a success probability p𝑝pitalic_p, and by (X1,,Xk)Mult(n;p1,,pk)similar-tosubscript𝑋1subscript𝑋𝑘Mult𝑛subscript𝑝1subscript𝑝𝑘(X_{1},\ldots,X_{k})\sim\text{Mult}(n;p_{1},\ldots,p_{k})( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∼ Mult ( italic_n ; italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) we mean that (X1,,Xk)subscript𝑋1subscript𝑋𝑘(X_{1},\ldots,X_{k})( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is a multivariate random variable that follows a multinomial distribution with number of trials n𝑛nitalic_n and probability vector (p1,,pk)subscript𝑝1subscript𝑝𝑘(p_{1},\ldots,p_{k})( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ).

  • Expected Values: The notation delimited-⟨⟩\langle\cdot\rangle⟨ ⋅ ⟩ is used to indicate expected values. Given discrete random variables Xtp,,Xt+nsubscript𝑋𝑡𝑝subscript𝑋𝑡𝑛X_{t-p},\ldots,X_{t+n}italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT where t,p,n𝑡𝑝𝑛t,p,n\in\mathbb{Z}italic_t , italic_p , italic_n ∈ blackboard_Z and p,n𝑝𝑛p,nitalic_p , italic_n are non-negative, we will be particularly interested in the following type of expression:

    maxxt,,xt+nP(xt,,xt+nxt1,,xtp)xt1,,xtp=xt1,,xtpP(xt1,,xtp)maxxt,,xt+nP(xt,,xt+nxt1,,xtp).subscriptdelimited-⟨⟩subscriptsubscript𝑥𝑡subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝subscript𝑥𝑡1subscript𝑥𝑡𝑝absentsubscriptsubscript𝑥𝑡1subscript𝑥𝑡𝑝𝑃subscript𝑥𝑡1subscript𝑥𝑡𝑝subscriptsubscript𝑥𝑡subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝\begin{aligned} \left\langle\max_{x_{t},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}% \mid x_{t-1},\ldots,x_{t-p})\right\rangle_{x_{t-1},\ldots,x_{t-p}}=\\ \sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},\ldots,x_{t-p})\max_{x_{t},\ldots,x_{t+% n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p}).\end{aligned}start_ROW start_CELL ⟨ roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ⟩ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT = end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) . end_CELL end_ROW

  • Logarithms: All logarithmic functions are considered to be base 2 and are denoted by log\logroman_log.

II State of the Art (Literature Review)

II-A Entropies

The relationship between entropy and the predictability of a sequence was first investigated by Shannon [shannon1951prediction], who noticed that the problem of prediction is fundamentally connected to the concept of entropy. Min-entropy, denoted as H(X)subscript𝐻𝑋H_{\infty}(X)italic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X ), represents the negative logarithm of the probability of a correct guess on the random variable X𝑋Xitalic_X under an optimal strategy [cachin1997entropy]. Mathematically, the probability of guessing the most likely output of an entropy source can be expressed as:

2H(X)=maxxXPX(x).superscript2subscript𝐻𝑋subscript𝑥𝑋subscript𝑃𝑋𝑥2^{-H_{\infty}(X)}=\max_{x\in X}P_{X}(x)\ .2 start_POSTSUPERSCRIPT - italic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X ) end_POSTSUPERSCRIPT = roman_max start_POSTSUBSCRIPT italic_x ∈ italic_X end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x ) .

In cryptography, min-entropy is an important measure, as it provides a conservative estimate of the difficulty of guessing or predicting the most likely output of the entropy source, as emphasized in the NIST Recommendation [turan2018recommendation].

Moreover, entropy estimation is complex when the output distribution is unknown, and typical assumptions like outputs being independent and identically distributed (i.i.d.) do not apply. Good entropy estimation needs understanding of the underlying nondeterministic process of the entropy source, and statistical tests, as referenced, can only act as a sanity check on such an estimate [kelsey2015predictive].

In this context, the concept of predictors has been introduced. As described by Kelsey et al., a predictor contains a dynamic model that operates through a four-step process: 1) Assume a probability model for the source, 2) Estimate the model’s parameters from the input sequence on-the-fly, 3) Use these parameters to attempt to predict the still-unseen values in the input sequence, and 4) Estimate the min-entropy of the source from the performance of these predictions [kelsey2015predictive]. Unlike traditional machine learning methods, this approach is parametric and relies on a model of the underlying probability distribution. Another difference from traditional supervised learning methods, which separate training and testing sets, is that predictors remain in the training phase indefinitely, allowing for continuous adaptation and improvement in prediction accuracy. Predictors are characterized by two primary performance metrics. The first, global predictability, gauges the long-term accuracy of predictions. Specifically, a predictor’s global accuracy paccsubscript𝑝accp_{\text{acc}}italic_p start_POSTSUBSCRIPT acc end_POSTSUBSCRIPT represents the probability that it will correctly predict a given sample from a noise source over an extended sequence, effectively measuring the percentage of correct predictions. The second, local predictability, emphasizes the length of the longest streak of correct predictions, becoming important when the source produces highly predictable outputs in short spurts. The final entropy estimate for a predictor is determined by the lesser value between the global and local entropy estimates, represented by H^=min(H^global,H^local)^𝐻subscript^𝐻globalsubscript^𝐻local\hat{H}=\min(\hat{H}_{\text{global}},\hat{H}_{\text{local}})over^ start_ARG italic_H end_ARG = roman_min ( over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT global end_POSTSUBSCRIPT , over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT local end_POSTSUBSCRIPT ) .

Hence, predictors play a significant role in setting bounds on an attacker’s performance, linking predictability to min-entropy. For a description of the evolution of the introduction of predictors on the NIST SP 800-90B see [lv2020high].

Zhu et al. examined the issue of underestimation in non-IID data pertaining to the NIST collision and compression test, proposing an enhanced method to address the underestimation of min-entropy [zhu2017analysis]. They introduced a novel formula specifically aimed at the high-order Markov process, founded on the principles of conditional probability. Furthermore, they highlighted that the correct prediction probability within a predictor can also be understood as a form of conditional probability.

Zhu’s min-entropy formula for the Markov process can be related to the concept of average min-entropy as defined by Dodis [dodis2004fuzzy]. Average min-entropy considers the predictability of a random variable given another possibly correlated random variable and can be expressed as:

H~(AB)=log(maxaP(ab)b)=subscript~𝐻conditional𝐴𝐵subscriptdelimited-⟨⟩subscript𝑎𝑃conditional𝑎𝑏𝑏absent\displaystyle\tilde{H}_{\infty}(A\mid B)=-\log\left(\langle\max_{a}P(a\mid b)% \rangle_{b}\right)=over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ∣ italic_B ) = - roman_log ( ⟨ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_P ( italic_a ∣ italic_b ) ⟩ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) =
log(2H(AB=b)b),subscriptdelimited-⟨⟩superscript2subscript𝐻conditional𝐴𝐵𝑏𝑏\displaystyle-\log\left(\langle 2^{-H_{\infty}(A\mid B=b)}\rangle_{b}\right)\ ,- roman_log ( ⟨ 2 start_POSTSUPERSCRIPT - italic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ∣ italic_B = italic_b ) end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) ,

where

H(AB=b)=P(b)maxaP(ab).subscript𝐻conditional𝐴𝐵𝑏𝑃𝑏subscript𝑎𝑃conditional𝑎𝑏H_{\infty}(A\mid B=b)=P(b)\cdot\max_{a}P(a\mid b).italic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_A ∣ italic_B = italic_b ) = italic_P ( italic_b ) ⋅ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_P ( italic_a ∣ italic_b ) .

Dodis’ definition on average min-entropy offers valuable insights into the logarithm of predictability, presenting it as a “worst-case” entropy measure [dodis2004fuzzy].

Several works have considered the problem of designing machine learning predictors for the evaluating RNGs. Here, we examine some of the most relevant contributions in relation to min-entropy estimation.

Truong et al. [truong2018machine] introduced the use of a recurrent convolutional neural network (RCNN) to analyze quantum random number generators (QRNGs). This RCNN model was employed to evaluate different stages of an optical continuous variable QRNG. The study focused on detecting inherent correlations, particularly under the influence of deterministic classical noise. Their methodology included a comprehensive analysis, from examining the robustness of QRNGs against machine learning attacks to benchmarking with a congruential pseudo-random number generator (CRNG). Their model’s prediction accuracy was compared with the guessing probability of the data distribution, effectively entailing a comparison with the min-entropy.

Yang et al. and Lv et al. explored neural network-based min-entropy estimation for random number generators [yang2018neural], [lv2020high]. Their approach involved training predictive models on simulated data, where the correct entropy was ascertainable due to the known output distributions. Additionally, their study included a performance analysis and comparison of their results with the NIST SP 800-90B’s predictors, providing a detailed examination of the efficacy and accuracy of their neural network-based approach in entropy estimation.

Li et al. [li2020deep] proposed a deep learning-based predictive analysis to assess the security of a non-deterministic random number generator (NRNG) using white chaos. They employed a temporal pattern attention (TPA)-based deep learning model to analyze data from both the chaotic external-cavity semiconductor laser (ECL) stage and the final output of the NRNG. The model effectively detected correlations in the ECL stage, but not in the post-processed output, suggesting the NRNG’s resistance to predictive modeling. Prior to this, the model’s predictive power was validated on a linear congruential algorithm-based RNG. The study also compared the model’s prediction accuracy with the baseline probability, aligning with Truong et al.’s approach of using the guessing probability as a comparative metric for min-entropy estimation.

Finally, Haohao Li et al. [li2023improvement] proposed a method for min-entropy evaluation using a pruned and quantized deep neural network. They developed a temporal pattern attention-based long short-term memory (TPA-LSTM) model, which they then optimized through pruning and quantization. This optimized model was retrained and tested on various simulated datasets with known min-entropy values. Their results demonstrated greater accuracy in min-entropy estimation compared to NIST SP 800-90B predictors. This study also investigated why NIST predictors often underestimate min-entropy, attributing it to the sensitivity of local predictability probability to parameter variations. This work parallels Yang et al. and Lv et al.’s in comparing neural network-based min-entropy estimations with NIST SP 800-90B’s predictors.

II-B Autoregressive Inference and Multi-Token Prediction Strategies

In autoregressive inference, various sampling strategies can be employed to generate sequences, such as greedy decoding [radford2019language], beam search [vijayakumar2016diverse, shao2017generating], top-k sampling [fan2018hierarchical, radford2019language] or top-p/nucleus sampling [holtzman2019curious], with or without temperature-based sampling techniques [ackley1985learning]. However, these techniques may not always yield the globally optimal sequence, as they rely on local decisions at each step.

Our goal is to approximate the global maximum probability for the complete sequence of n𝑛nitalic_n bits, given the previous p𝑝pitalic_p bits:

maxxt,,xt+nP(xt,,xt+nxt1,,xtp).subscriptsubscript𝑥𝑡subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝\max_{x_{t},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p})\ .roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) . (1)

To illustrate the potential limitations of autoregressive inference strategies, let us consider greedy decoding as an example. Greedy decoding selects the most probable bit at each step, conditioned on the previously generated bits. This can be expressed as:

k=tt+nmaxxkP(xkxk1,,xkp)|xki=argmaxxkiP(xkixki1,,xkip),i[1,kp].\begin{aligned} \prod_{k=t}^{t+n}\max_{x_{k}}P(x_{k}\mid x_{k-1},\ldots,x_{k-p% })\bigg{\rvert}_{\begin{array}[]{c}x_{k-i}=\arg\max_{x_{k-i}}P(x_{k-i}\mid x_{% k-i-1},\ldots,x_{k-i-p}),\\ \forall i\in[1,k\leq p]\end{array}\ .}\end{aligned}start_ROW start_CELL ∏ start_POSTSUBSCRIPT italic_k = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + italic_n end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_k - italic_p end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT start_ARRAY start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_k - italic_i end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k - italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_k - italic_i end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_k - italic_i - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_k - italic_i - italic_p end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL ∀ italic_i ∈ [ 1 , italic_k ≤ italic_p ] end_CELL end_ROW end_ARRAY . end_POSTSUBSCRIPT end_CELL end_ROW

However, the product of the maximum conditional probabilities at each step does not necessarily equal the global maximum probability over the entire sequence. In other words, the greedy decoding approach may lead to suboptimal sequences, as it does not consider the joint probability of the complete sequence.

While other search methods, such as beam search, top-k sampling, or top-p/nucleus sampling, can perform better than greedy decoding, they still face the same fundamental challenge. Ultimately, the effectiveness of these methods in approximating the global maximum depends on the data and the search space. As the sequence length n𝑛nitalic_n increases, the search space grows exponentially, making it increasingly difficult to find the globally optimal sequence efficiently.

Recently, incorporating future information into language generation tasks has gained attention. Li et al. (2017) [li2017learning] proposed an actor-critic model that integrates a value function to estimate future success, combining MLE-based learning with an RL-based value function during decoding. Oord et al. (2018) [oord2018representation] aimed to preserve mutual information between context and future tokens by modeling a density ratio, rather than directly predicting future tokens. Serdyuk et al. (2018) [serdyuk2017twin] addressed the challenge of long-term dependency learning in RNNs by running forward and backward RNNs in parallel to better capture future information. Lawrence et al. (2019) [lawrence2019attending] trained an encoder by concatenating source and target sequences and using placeholder tokens in the target sequence, which are replaced during inference to generate the final output. These advancements illustrate the growing interest in and potential for optimizing future token predictions in natural language processing tasks.

Qi et al. (2020) [qi2020prophetnet] introduce ProphetNet, a sequence-to-sequence pre-training model that employs a novel self-supervised objective called future n-gram prediction and an n-stream self-attention mechanism. Unlike traditional sequence-to-sequence models that optimize one-step-ahead prediction, ProphetNet predicts the next n𝑛nitalic_n tokens simultaneously based on previous context tokens at each time step. This approach explicitly encourages the model to plan for future tokens and prevents overfitting on strong local correlations. The authors pre-train ProphetNet using base and large-scale datasets and demonstrate state-of-the-art results on abstractive summarization and question generation tasks.

Recent works have explored the use of multi-token prediction to improve the efficiency and performance of large language models. Gloeckle et al. propose a memory-efficient implementation and demonstrate the effectiveness of this approach on various tasks, showcasing strong performance on summarization, speeding up inference by a factor of 3×, and promoting the learning of longer-term patterns [gloeckle2024better]. This method has been shown to improve sample efficiency, downstream capabilities, and inference speed, especially for larger model sizes and generative benchmarks like coding. Similarly, Stern et al. [stern2018blockwise] and Cai et al. [cai2024medusa] introduce methods that augment LLM inference by adding extra decoding heads to predict multiple subsequent tokens in parallel. Cai et al. refine the concept introduced by Stern et al. and propose MEDUSA, which uses a tree-based attention mechanism to construct and verify multiple candidate continuations simultaneously. While all three approaches leverage multi-token prediction, Gloeckle et al. focus on the effects of such a loss during pretraining, whereas Stern et al. and Cai et al. propose model finetunings for faster inference without studying the pretraining effects [gloeckle2024better].

III Theoretical Framework

In this section, we establish the theoretical framework for our study. We focus on investigating the concept of average min-entropy and its relationship with min-entropy, particularly within the context of gbAR(p) models. The proofs and auxiliary results supporting our findings can be found in the Appendix.

III-A Entropies

The different entropies that will be considered throughout this paper are gathered in the following:

Definition 1.

Let {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT be a stochastic process with discrete state-space and let (Xtp,,Xt1,Xt,,Xt+n)subscript𝑋𝑡𝑝subscript𝑋𝑡1subscript𝑋𝑡subscript𝑋𝑡𝑛(X_{t-p},\ldots,X_{t-1},X_{t},\ldots,X_{t+n})( italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) be a subset of {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT where t,p,n𝑡𝑝𝑛t,p,n\in\mathbb{Z}italic_t , italic_p , italic_n ∈ blackboard_Z and p,n𝑝𝑛p,nitalic_p , italic_n are non-negative. Then:

  1. a)

    The min-entropy of (Xt,,Xt+n)subscript𝑋𝑡subscript𝑋𝑡𝑛(X_{t},\ldots,X_{t+n})( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) is:

    H(Xt,,Xt+n)=log[maxxt,,xt+nP(xt,,xt+n)].subscript𝐻subscript𝑋𝑡subscript𝑋𝑡𝑛subscriptsubscript𝑥𝑡subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡subscript𝑥𝑡𝑛H_{\infty}(X_{t},\ldots,X_{t+n})=-\log\left[\max_{x_{t},\ldots,x_{t+n}}P(x_{t}% ,\ldots,x_{t+n})\right].italic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) = - roman_log [ roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) ] .

  2. b)

    The min-entropy per bit of (Xt,,Xt+n)subscript𝑋𝑡subscript𝑋𝑡𝑛(X_{t},\ldots,X_{t+n})( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) is:

    h(Xt,,Xt+n)=1n+1H(Xt,,Xt+n).subscriptsubscript𝑋𝑡subscript𝑋𝑡𝑛1𝑛1subscript𝐻subscript𝑋𝑡subscript𝑋𝑡𝑛h_{\infty}(X_{t},\ldots,X_{t+n})=\frac{1}{n+1}H_{\infty}(X_{t},\ldots,X_{t+n}).italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG italic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) .
  3. c)

    The min-entropy per bit of {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT is:

    h({Xt}t)=limk1k+1log[maxxt,tk/2P({xt}tk/2)],k.subscriptsubscriptsubscript𝑋𝑡𝑡subscript𝑘1𝑘1subscriptsubscript𝑥𝑡delimited-∣∣𝑡delimited-∣∣𝑘2𝑃subscriptsubscript𝑥𝑡delimited-∣∣𝑡delimited-∣∣𝑘2𝑘\begin{aligned} h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})=-\lim_{k\to\infty}\frac% {1}{k+1}\log\left[\max_{x_{t},\mid t\mid\leq\mid k/2\mid}P(\{x_{t}\}_{\mid t% \mid\leq\mid k/2\mid})\right]\ ,\\ k\in\mathbb{Z}.\end{aligned}start_ROW start_CELL italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT ) = - roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_k + 1 end_ARG roman_log [ roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∣ italic_t ∣ ≤ ∣ italic_k / 2 ∣ end_POSTSUBSCRIPT italic_P ( { italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT ∣ italic_t ∣ ≤ ∣ italic_k / 2 ∣ end_POSTSUBSCRIPT ) ] , end_CELL end_ROW start_ROW start_CELL italic_k ∈ blackboard_Z . end_CELL end_ROW

  4. d)

    The worst-case min-entropy of (Xt,,Xt+n)subscript𝑋𝑡subscript𝑋𝑡𝑛(X_{t},\ldots,X_{t+n})( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) is:

    H(Xt,,Xt+nXt1,,Xtp)=subscript𝐻subscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝absent\displaystyle H_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})=italic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) =
    log[maxxtp,,xt+nP(xt,,xt+nxt1,,xtp)].subscriptsubscript𝑥𝑡𝑝subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝\displaystyle-\log\left[\max_{x_{t-p},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}% \mid x_{t-1},\ldots,x_{t-p})\right].- roman_log [ roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ] .
  5. e)

    The worst-case min-entropy per bit of (Xt,,Xt+n)subscript𝑋𝑡subscript𝑋𝑡𝑛(X_{t},\ldots,X_{t+n})( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) is:

    h(Xt,,Xt+nXt1,,Xtp)=subscriptsubscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝absent\displaystyle h_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})=italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) =
    1n+1H(Xt,,Xt+nXt1,,Xtp).1𝑛1subscript𝐻subscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝\displaystyle\frac{1}{n+1}H_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X% _{t-p}).divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG italic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) .
  6. f)

    The average min-entropy of (Xt,,Xt+n)subscript𝑋𝑡subscript𝑋𝑡𝑛(X_{t},\ldots,X_{t+n})( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) is:

    H~(Xt,,Xt+nXt1,,Xtp)=log(maxxt,,xt+nP(xt,,xt+nxt1,,xtp)xt1,,xtp)=log[xt1,,xtpP(xt1,,xtp)maxxt,,xt+nP(xt,,xt+nxt1,,xtp)].subscript~𝐻subscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝absentsubscriptdelimited-⟨⟩subscriptsubscript𝑥𝑡subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝subscript𝑥𝑡1subscript𝑥𝑡𝑝absentsubscriptsubscript𝑥𝑡1subscript𝑥𝑡𝑝𝑃subscript𝑥𝑡1subscript𝑥𝑡𝑝subscriptsubscript𝑥𝑡subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝\begin{aligned} \tilde{H}_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{% t-p})=\\ -\log\left(\left\langle\max_{x_{t},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x% _{t-1},\ldots,x_{t-p})\right\rangle_{x_{t-1},\ldots,x_{t-p}}\right)=\\ -\log\left[\sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},\ldots,x_{t-p})\max_{x_{t},% \ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p})\right].\end{aligned}start_ROW start_CELL over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL - roman_log ( ⟨ roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ⟩ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL - roman_log [ ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ] . end_CELL end_ROW

  7. g)

    The average min-entropy per bit of (Xt,,Xt+n)subscript𝑋𝑡subscript𝑋𝑡𝑛(X_{t},\ldots,X_{t+n})( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) is:

    h~(Xt,,Xt+nXt1,,Xtp)=subscript~subscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝absent\displaystyle\tilde{h}_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p% })=over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) =
    1n+1H~(Xt,,Xt+nXt1,,Xtp).1𝑛1subscript~𝐻subscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝\displaystyle\frac{1}{n+1}\tilde{H}_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},% \ldots,X_{t-p}).divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) .
Remark 2.

When determining the min-entropy of an entire binary stochastic process {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT, the direct evaluation

H({Xt}t)=log[maxxt,tP({xt}t)]subscript𝐻subscriptsubscript𝑋𝑡𝑡subscriptsubscript𝑥𝑡𝑡𝑃subscriptsubscript𝑥𝑡𝑡H_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})=-\log\left[\max_{x_{t},t\in\mathbb{Z}}P% (\{x_{t}\}_{t\in\mathbb{Z}})\right]italic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT ) = - roman_log [ roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ∈ blackboard_Z end_POSTSUBSCRIPT italic_P ( { italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT ) ]

can lead to undefined behaviour. Indeed, if we write this as the limit

H({Xt}t)=limkH({Xt}tk/2),kevensubscript𝐻subscriptsubscript𝑋𝑡𝑡subscript𝑘subscript𝐻subscriptsubscript𝑋𝑡delimited-∣∣𝑡delimited-∣∣𝑘2𝑘even\displaystyle H_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})=\lim_{k\to\infty}H_{% \infty}(\{X_{t}\}_{\mid t\mid\leq\mid k/2\mid})\ ,k\ \text{even}italic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT ) = roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT ∣ italic_t ∣ ≤ ∣ italic_k / 2 ∣ end_POSTSUBSCRIPT ) , italic_k even

the maximum probability decay with k is bounded below by 12k+11superscript2𝑘1\frac{1}{2^{k+1}}divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT end_ARG corresponding to the uniform noise, so in that case the limit

H({Xt}t)=limklog[12k+1]=1+limkksubscript𝐻subscriptsubscript𝑋𝑡𝑡subscript𝑘1superscript2𝑘11subscript𝑘𝑘\displaystyle H_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})=-\lim_{k\to\infty}\log% \left[\frac{1}{2^{k+1}}\right]=1+\lim_{k\to\infty}kitalic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT ) = - roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT roman_log [ divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT end_ARG ] = 1 + roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT italic_k

diverges, growing as ksimilar-toabsent𝑘\sim k∼ italic_k with the number of elements k𝑘kitalic_k. Then the limit

h({Xt}t)=limk1k+1log[maxxt,tk/2P({xt}tk/2)],ksubscriptsubscriptsubscript𝑋𝑡𝑡subscript𝑘1𝑘1subscriptsubscript𝑥𝑡delimited-∣∣𝑡delimited-∣∣𝑘2𝑃subscriptsubscript𝑥𝑡delimited-∣∣𝑡delimited-∣∣𝑘2𝑘\begin{aligned} h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})=-\lim_{k\to\infty}\frac% {1}{k+1}\log\left[\max_{x_{t},\mid t\mid\leq\mid k/2\mid}P(\{x_{t}\}_{\mid t% \mid\leq\mid k/2\mid})\right]\ ,\\ k\in\mathbb{Z}\end{aligned}start_ROW start_CELL italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT ) = - roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_k + 1 end_ARG roman_log [ roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∣ italic_t ∣ ≤ ∣ italic_k / 2 ∣ end_POSTSUBSCRIPT italic_P ( { italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT ∣ italic_t ∣ ≤ ∣ italic_k / 2 ∣ end_POSTSUBSCRIPT ) ] , end_CELL end_ROW start_ROW start_CELL italic_k ∈ blackboard_Z end_CELL end_ROW

is bounded by 1 for all {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT. For this reason we are going to refer to this as the min-entropy per bit of the stochastic process {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT.

III-B Order-p Markov Chains

Let us begin by defining order-p𝑝pitalic_p Markov Chains:

Definition 3 (cf. [ching2006markov], [raftery1985model]).

An order-p𝑝pitalic_p Markov Chain is a stochastic process {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT with discrete state-space S𝑆Sitalic_S such that:

P(xt0{xt}t<t0)=P(xt0xt01,,xt0p)𝑃conditionalsubscript𝑥subscript𝑡0subscriptsubscript𝑥𝑡𝑡subscript𝑡0𝑃conditionalsubscript𝑥subscript𝑡0subscript𝑥subscript𝑡01subscript𝑥subscript𝑡0𝑝P(x_{t_{0}}\mid\{x_{t}\}_{t<t_{0}})=P(x_{t_{0}}\mid x_{t_{0}-1},\ldots,x_{t_{0% }-p})italic_P ( italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ { italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t < italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = italic_P ( italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_p end_POSTSUBSCRIPT )

for every t0subscript𝑡0t_{0}\in\mathbb{Z}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_Z and every xtSsubscript𝑥𝑡𝑆x_{t}\in Sitalic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_S such that tt0𝑡subscript𝑡0t\leq t_{0}italic_t ≤ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

The aim of this subsection is to define special types of Markov chains and to prove some formulas that are applicable to them regarding the entropies of Definition 1. Although the experiments performed in this paper mostly involve Generalized Binary Autoregressive Models (see Definition 14 below), other types of Markov chains that have connections with Generalized Binary Autoregressive Models are also explored (see Figure 3) because they share certain properties with them and entropy formulas that are interesting on their own can be derived with relatively little additional effort for these processes.

Definition 4.

Let {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT be an order-p𝑝pitalic_p Markov chain with state-space S𝑆Sitalic_S. Then:

  1. i)

    {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT is said to be binary if S={0,1}𝑆01S=\{0,1\}italic_S = { 0 , 1 }.

  2. ii)

    {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT is said to be stationary if

    P(Xt1=x1,,Xtn=xn)=P(Xt1+τ=x1,,Xtn+τ=xn)𝑃formulae-sequencesubscript𝑋subscript𝑡1subscript𝑥1subscript𝑋subscript𝑡𝑛subscript𝑥𝑛𝑃formulae-sequencesubscript𝑋subscript𝑡1𝜏subscript𝑥1subscript𝑋subscript𝑡𝑛𝜏subscript𝑥𝑛P(X_{t_{1}}=x_{1},\ldots,X_{t_{n}}=x_{n})=P(X_{t_{1}+\tau}=x_{1},\ldots,X_{t_{% n}+\tau}=x_{n})italic_P ( italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_P ( italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_τ end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_τ end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )

    for every τ,t1,,tn𝜏subscript𝑡1subscript𝑡𝑛\tau,t_{1},\ldots,t_{n}\in\mathbb{Z}italic_τ , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_Z, every x1,,xnSsubscript𝑥1subscript𝑥𝑛𝑆x_{1},\ldots,x_{n}\in Sitalic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_S and every positive integer n𝑛nitalic_n.

  3. iii)

    {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT is said to have lag-p point-to-point correlations if

    P(xtxt1,,xtp)=P(xtxtp) for every t.𝑃conditionalsubscript𝑥𝑡subscript𝑥𝑡1subscript𝑥𝑡𝑝𝑃conditionalsubscript𝑥𝑡subscript𝑥𝑡𝑝 for every t.P(x_{t}\mid x_{t-1},\ldots,x_{t-p})=P(x_{t}\mid x_{t-p})\text{ for every $t\in% \mathbb{Z}$.}italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) for every italic_t ∈ blackboard_Z .
  4. iv)

    {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT is said to be irreducible if it is stationary, S𝑆Sitalic_S is finite and for every x,x1,,xpS𝑥subscript𝑥1subscript𝑥𝑝𝑆x,x_{1},\ldots,x_{p}\in Sitalic_x , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ italic_S there exists a non-negative integer k𝑘kitalic_k such that

    P(Xt+k=xXt1=x1,,Xtp=xp)>0.P(X_{t+k}=x\mid X_{t-1}=x_{1},\ldots,X_{t-p}=x_{p})>0.italic_P ( italic_X start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT = italic_x ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) > 0 .
  5. v)

    {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT is said to be aperiodic if it is stationary, S𝑆Sitalic_S is finite and for every xS𝑥𝑆x\in Sitalic_x ∈ italic_S,

    gcd{n1:P(Xt+n=xXt=x)>0}=1.:𝑛1𝑃subscript𝑋𝑡𝑛conditional𝑥subscript𝑋𝑡𝑥01\gcd\{n\geq 1:P(X_{t+n}=x\mid X_{t}=x)>0\}=1.roman_gcd { italic_n ≥ 1 : italic_P ( italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT = italic_x ∣ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_x ) > 0 } = 1 .
Remark 5.

Note that if {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT is a stationary order-p𝑝pitalic_p Markov chain then:

h({Xt}t)=limn1n+1H(Xt,,Xt+n).subscriptsubscriptsubscript𝑋𝑡𝑡subscript𝑛1𝑛1𝐻subscript𝑋𝑡subscript𝑋𝑡𝑛h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})=\lim_{n\to\infty}\frac{1}{n+1}H(X_{t},% \ldots,X_{t+n}).italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT ) = roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG italic_H ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) .

III-B1 Some Min-Entropy Inequalities for Order-p Markov Chains

Here we establish several inequalities involving min-entropy, average min-entropy, and worst-case min-entropy.

We start by noting that for a fixed n𝑛nitalic_n, the following inequality between min-entropy, average min-entropy, and worst-case min-entropy holds.

Lemma 6.

Let {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT be an order-p𝑝pitalic_p Markov chain. Then:

h(Xt,,Xt+n)h~(Xt,,Xt+nXt1,,Xtp)h(Xt,,Xt+nXt1,,Xtp).subscriptsubscript𝑋𝑡subscript𝑋𝑡𝑛subscript~subscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝absentsubscriptsubscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝\begin{aligned} h_{\infty}(X_{t},\ldots,X_{t+n})\geq\tilde{h}_{\infty}(X_{t},% \ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})\geq\\ h_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})\ .\end{aligned}start_ROW start_CELL italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) ≥ over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ≥ end_CELL end_ROW start_ROW start_CELL italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) . end_CELL end_ROW

We conclude this part of Section III-B1 with a result regarding order-p𝑝pitalic_p Markov chains that establishes a form of monotonicity for their average min-entropy.

Lemma 7.

Let {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT be an order-p Markov chain. Then

H~(Xt,,Xt+nXt1,,Xtp)subscript~𝐻subscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝absent\displaystyle\tilde{H}_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p% })\leqover~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ≤
H~(Xt,,Xt+n+mXt1,,Xtp).subscript~𝐻subscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛𝑚subscript𝑋𝑡1subscript𝑋𝑡𝑝\displaystyle\tilde{H}_{\infty}(X_{t},\ldots,X_{t+n+m}\mid X_{t-1},\ldots,X_{t% -p}).over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n + italic_m end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) .

This lemma establishes that the average min-entropy of an order-p𝑝pitalic_p Markov chain is non-decreasing as the length of the future sequence, n𝑛nitalic_n, increases. This property reflects the intuitive notion that the uncertainty about future states cannot decrease when considering longer future sequences. This result will be particularly useful later when we discuss an interesting property of Generalized Binary Autoregressive Models (see Remark 19).

III-B2 Convergence Theorem for the Min-Entropy and Average Min-Entropy of order-p𝑝pitalic_p Markov chains

The purpose of the following results, which is materialized in Theorem 9 below, is to establish conditions under which an asymptotical equivalence between the average min-entropy and the min-entropy of an order-p𝑝pitalic_p Markov Chain can be guaranteed.

Theorem 8 (Convergence Theorem [bozorgmanesh2016convergence]).

Let {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT be an irreducible and aperiodic stationary order-p𝑝pitalic_p Markov chain with finite state-space S𝑆Sitalic_S. Then for every x,xt1,,xtpS𝑥subscript𝑥𝑡1subscript𝑥𝑡𝑝𝑆x,x_{t-1},\ldots,x_{t-p}\in Sitalic_x , italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ∈ italic_S:

limnP(Xt+n=xXt1=xt1,,Xtp=xtp)=P(Xt=x).\lim_{n\to\infty}P(X_{t+n}=x\mid X_{t-1}=x_{t-1},\ldots,X_{t-p}=x_{t-p})=P(X_{% t}=x).roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_P ( italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT = italic_x ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = italic_P ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_x ) .

Building upon this theorem, we can now establish the asymptotic equivalence between the min-entropy and the average min-entropy for order-p Markov chains satisfying certain conditions.

Theorem 9.

If {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT satisfies the hypothesis of the Convergence Theorem, i.e. {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT is an irreducible and aperiodic stationary order-p𝑝pitalic_p Markov chain with finite state-space, then

h({Xt}t)=limnh~(Xt,,Xt+nXt1,,Xtp)=limnh(Xt,,Xt+nXt1,,Xtp).subscriptsubscriptsubscript𝑋𝑡𝑡subscript𝑛subscript~subscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝absentsubscript𝑛subscriptsubscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝\begin{aligned} h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})=\lim_{n\to\infty}\tilde% {h}_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})=\\ \lim_{n\to\infty}h_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p}).% \end{aligned}start_ROW start_CELL italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT ) = roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) . end_CELL end_ROW

This theorem shows that having conditional information about the process provides no advantage asymptotically under the stated conditions, as the min-entropy and the average min-entropy converge to the same value.

III-B3 State-Independent Maximum Transition Probability and Bitflip Symmetric Order-p𝑝pitalic_p Markov Chains

This section introduces two related classes of Markov chains: State-Independent Maximum Transition Probability (SIMTP) and Bitflip Symmetric Order-p𝑝pitalic_p Markov Chains. We investigate the properties of these chains, with a particular focus on their average min-entropy behavior. Bitflip symmetric chains are of interest as they could represent a physical symmetry of the random number generator, such as the symmetry between the two polarization states of a quantum random number generator (QRNG). Additionally, the SIMTP property enables us to perform exact min-entropy calculations for the process.

Definition 10 (State-Independent Maximum Transition Probability Order-p𝑝pitalic_p Markov Chain).

A stationary order-p𝑝pitalic_p Markov chain with state-space S𝑆Sitalic_S is said to be a State-Independent Maximum Transition Probability (SIMTP) Markov Chain if it satisfies the following property:

maxxtSP(xtyt1,,ytp)=maxxtSP(xtzt1,,ztp)subscriptsubscript𝑥𝑡𝑆𝑃conditionalsubscript𝑥𝑡subscript𝑦𝑡1subscript𝑦𝑡𝑝subscriptsubscript𝑥𝑡𝑆𝑃conditionalsubscript𝑥𝑡subscript𝑧𝑡1subscript𝑧𝑡𝑝\displaystyle\max_{x_{t}\in S}P(x_{t}\mid y_{t-1},\ldots,y_{t-p})=\max_{x_{t}% \in S}P(x_{t}\mid z_{t-1},\ldots,z_{t-p})roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_S end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_S end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT )
for every yt1,,ytp,zt1,,ztpS.for every subscript𝑦𝑡1subscript𝑦𝑡𝑝subscript𝑧𝑡1subscript𝑧𝑡𝑝𝑆\displaystyle\text{for every }y_{t-1},\ldots,y_{t-p},z_{t-1},\ldots,z_{t-p}\in S.for every italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ∈ italic_S .

SIMTP models are those stationary Markov chains for which the maximum transition probability is independent of the initial state sequence of length p𝑝pitalic_p in the chain.

Proposition 11.

Let {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT be an order-p𝑝pitalic_p SIMTP model with state-space S𝑆Sitalic_S. Then, for every non-negative integer n𝑛nitalic_n and every xt1,,xtpSsubscript𝑥𝑡1subscript𝑥𝑡𝑝𝑆x_{t-1},\ldots,x_{t-p}\in Sitalic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ∈ italic_S:

h({Xt}t)=h~(Xt,,Xt+nXt1,,Xtp)=subscriptsubscriptsubscript𝑋𝑡𝑡subscript~subscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝absent\displaystyle h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})=\tilde{h}_{\infty}(X_{t},% \ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})=italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT ) = over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) =
log[maxxtP(xtxt1,,xtp)].subscriptsubscript𝑥𝑡𝑃conditionalsubscript𝑥𝑡subscript𝑥𝑡1subscript𝑥𝑡𝑝\displaystyle-\log\left[\max_{x_{t}}P(x_{t}\mid x_{t-1},\ldots,x_{t-p})\right].- roman_log [ roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ] .

Hence, the min-entropy of the SIMTP process can be computed straightforwardly from its transition probability.

Definition 12 (Bitflip Symmetry in Binary Order-p𝑝pitalic_p Markov Chains).

A binary order-p𝑝pitalic_p Markov chain exhibits Bitflip Symmetry if for all states xtp,,xt1,xt,,xt+n{0,1}subscript𝑥𝑡𝑝subscript𝑥𝑡1subscript𝑥𝑡subscript𝑥𝑡𝑛01x_{t-p},\ldots,x_{t-1},x_{t},\ldots,x_{t+n}\in\{0,1\}italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∈ { 0 , 1 } and for all non-negative integer n𝑛nitalic_n the following property holds:

P(xt,,xt+nxt1,,xtp)=𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝absent\displaystyle P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p})=italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) =
P(1xt,,1xt+n1xt1,,1xtp)𝑃direct-sum1subscript𝑥𝑡direct-sum1conditionalsubscript𝑥𝑡𝑛direct-sum1subscript𝑥𝑡1direct-sum1subscript𝑥𝑡𝑝\displaystyle P(1\oplus x_{t},\ldots,1\oplus x_{t+n}\mid 1\oplus x_{t-1},% \ldots,1\oplus x_{t-p})italic_P ( 1 ⊕ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , 1 ⊕ italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ 1 ⊕ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , 1 ⊕ italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT )

where direct-sum\oplus represents the XOR operation.

Bitflip symmetric order-p𝑝pitalic_p Markov chains are those binary order-p𝑝pitalic_p Markov chains for which flip** the bits of all the variables in the conditional probability statement does not change the transition probability. These chains do not distinguish between 0 and 1 but still exhibit some correlation. Our interest in Bitflip symmetric order-p Markov chains is due to the following:

Lemma 13.

Let {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT be an order-p𝑝pitalic_p Bitflip-Symmetric Markov Chain with lag-p point-to-point correlations. Then {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT is a SIMTP order-p𝑝pitalic_p Markov chain.

III-B4 Generalized Binary Autoregressive Models

The gbAR(p) model [jentsch2019generalized] is an autoregressive (AR) model for binary time series data. It allows the autoregressive parameters to take values in the range (-1, 1), enabling the model to capture negative autocorrelations and alternating patterns. Despite this flexibility, the gbAR(p) model maintains a parsimonious parameterization, making it a compact yet powerful model for binary data. The gbAR(p) model is a parsimonious subclass of p-th order Markov chains for binary data. While sacrificing some flexibility compared to a full p-th order Markov chain, the gbAR(p) model offers a much more compact representation.

Definition 14 (Generalized Binary Autoregressive Models [jentsch2019generalized]).

Given t𝑡t\in\mathbb{Z}italic_t ∈ blackboard_Z let (at(1),,at(p),bt)M(1;|α1|,,|αp|,β)similar-tosuperscriptsubscript𝑎𝑡1superscriptsubscript𝑎𝑡𝑝subscript𝑏𝑡𝑀1subscript𝛼1subscript𝛼𝑝𝛽\left(a_{t}^{(1)},\ldots,a_{t}^{(p)},b_{t}\right)\sim M(1;|\alpha_{1}|,\ldots,% |\alpha_{p}|,\beta)( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∼ italic_M ( 1 ; | italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | , … , | italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT | , italic_β ) for some α1,,αp(1,1),β(0,1]formulae-sequencesubscript𝛼1subscript𝛼𝑝11𝛽01\alpha_{1},\ldots,\alpha_{p}\in(-1,1),\beta\in(0,1]italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ ( - 1 , 1 ) , italic_β ∈ ( 0 , 1 ] such that:

i=1p|αi|+β=1superscriptsubscript𝑖1𝑝subscript𝛼𝑖𝛽1\sum_{i=1}^{p}|\alpha_{i}|+\beta=1∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT | italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | + italic_β = 1

and let etB(1,ϵt)similar-tosubscript𝑒𝑡𝐵1subscriptitalic-ϵ𝑡e_{t}\sim B(1,\epsilon_{t})italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_B ( 1 , italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) for some ϵt(0,1)subscriptitalic-ϵ𝑡01\epsilon_{t}\in(0,1)italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ ( 0 , 1 ). A stationary binary order-p Markov chain {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT that can be written in operator form as

Xt=i=1p[at(i)𝟙{αi0}(0)+at(i)𝟙{αi<0}(1)]Xti+btetX_{t}=\sum_{i=1}^{p}\left[a_{t}^{(i)}\mathds{1}_{\{\alpha_{i}\geq 0\}}(0\oplus% \cdot)+a_{t}^{(i)}\mathds{1}_{\{\alpha_{i}<0\}}(1\oplus\cdot)\right]X_{t-i}+b_% {t}e_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT [ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT { italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0 } end_POSTSUBSCRIPT ( 0 ⊕ ⋅ ) + italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT { italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < 0 } end_POSTSUBSCRIPT ( 1 ⊕ ⋅ ) ] italic_X start_POSTSUBSCRIPT italic_t - italic_i end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

where 𝟙1\mathds{1}blackboard_1 is the indicator function and direct-sum\oplus is the XOR gate is said to be a Generalized Binary Autoregressive or gbAR(p) model.

We will denote the array of coefficients α1,,αpsubscript𝛼1subscript𝛼𝑝\alpha_{1},\ldots,\alpha_{p}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT as 𝜶𝜶\boldsymbol{\alpha}bold_italic_α, and its L1 norm (i.e. the sum of the absolute values of its components) as |𝜶|𝜶|\boldsymbol{\alpha}|| bold_italic_α |.

Our experiments are performed on data generated from gbAR(p) models. The rest of this section is devoted to define the type of gbAR(p) models we will be most interested in, to prove they satisfy the hypothesis of the Convergence Theorem 8 and to obtain a formula for their min-entropy that will allow to evaluate machine learning predictors entropy estimations (see Proposition 26 and Proposition 16 below).

Definition 15.

Let {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT be a gbAR(p) model. Then:

  1. i)

    {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT is said to be positive if αi0subscript𝛼𝑖0\alpha_{i}\geq 0italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0 for every i{1,,p}𝑖1𝑝i\in\{1,\ldots,p\}italic_i ∈ { 1 , … , italic_p }.

  2. ii)

    {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT is said to be a uniform noise gbAR(p) model if etB(1,12)similar-tosubscript𝑒𝑡𝐵112e_{t}\sim B\left(1,\frac{1}{2}\right)italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_B ( 1 , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) for every t𝑡t\in\mathbb{Z}italic_t ∈ blackboard_Z.

Special attention will be paid to uniform noise and positive gbAR(p) models.

Proposition 16.

Let {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT be a uniform noise and positive gbAR(p) model. Then

h({Xt}t)=limnh~(Xt,,Xt+nXt1,,Xtp)=limnh(Xt,,Xt+nXt1,,Xtp)=log(1β2).subscriptsubscriptsubscript𝑋𝑡𝑡subscript𝑛subscript~subscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝absentsubscript𝑛subscriptsubscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝1𝛽2\begin{aligned} h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})=\lim_{n\to\infty}\tilde% {h}_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})=\\ \lim_{n\to\infty}h_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})=-% \log\left(1-\frac{\beta}{2}\right).\end{aligned}start_ROW start_CELL italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT ) = roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = - roman_log ( 1 - divide start_ARG italic_β end_ARG start_ARG 2 end_ARG ) . end_CELL end_ROW

Remark 17.

Let {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT be a uniform noise gbAR(p) model with point-to-point lag-p𝑝pitalic_p correlations. Apart from having point-to-point lag-p𝑝pitalic_p correlations, {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT is bitflip-symmetric by Lemma 25. Hence, {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT is SIMTP by Lemma 13 and therefore for every non-negative integer n𝑛nitalic_n Proposition 11 yields

h({Xt}t)=h~(Xt,,Xt+nXt1,,Xtp).subscriptsubscriptsubscript𝑋𝑡𝑡subscript~subscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})=\tilde{h}_{\infty}(X_{t},\ldots,X_{t+n}% \mid X_{t-1},\ldots,X_{t-p}).italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT ) = over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) .

The argumentation above is illustrated in the first two plots of Figure 1, where the equivalence of the average min-entropy per bit and the min-entropy of uniform noise gbAR(p) models with point-to-point lag-p𝑝pitalic_p correlations is observed regardless of the values n,p,αp𝑛𝑝subscript𝛼𝑝n,p,\alpha_{p}italic_n , italic_p , italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT.

Refer to caption
Figure 1: |𝜶|𝜶|\boldsymbol{\alpha}|| bold_italic_α | dependance of average min-entropy compared with min-entropy and min-entropy limit for several sequence lengths n𝑛nitalic_n, correlation scales p𝑝pitalic_p and autocorrelation functions (uniform and point-to-point). The data points have been evaluated numerically (see Subsection IV-B).
Remark 18.

Let {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT be a uniform noise and positive gbAR(p) model. Since {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT is stationary by Definition 14 and it satisfies the hypothesis of the Convergence Theorem 8 by Proposition 26, it follows that

limnh(Xt,,Xt+n)=h({Xt}t)subscript𝑛subscript𝑋𝑡subscript𝑋𝑡𝑛subscriptsubscriptsubscript𝑋𝑡𝑡\lim_{n\to\infty}h(X_{t},\ldots,X_{t+n})=h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_h ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) = italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT )

by Remark 5 and

limnh~(Xt,,Xt+nXt1,,Xtp)=h({Xt}t)subscript𝑛subscript~subscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝subscriptsubscriptsubscript𝑋𝑡𝑡\lim_{n\to\infty}\tilde{h}_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_% {t-p})=h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT )

by Theorem 9. Moreover,

h(Xt,,Xt+n)h~(Xt,,Xt+nXt1,,Xtp)subscriptsubscript𝑋𝑡subscript𝑋𝑡𝑛subscript~subscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝h_{\infty}(X_{t},\ldots,X_{t+n})\geq\tilde{h}_{\infty}(X_{t},\ldots,X_{t+n}% \mid X_{t-1},\ldots,X_{t-p})italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) ≥ over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT )

by Lemma 6. The three (in)equations above are illustrated in Figure 2, where we can observe that both the average min-entropy per bit of (Xt,,Xt+n)subscript𝑋𝑡subscript𝑋𝑡𝑛(X_{t},\ldots,X_{t+n})( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) and the min-entropy per bit of (Xt,,Xt+n)subscript𝑋𝑡subscript𝑋𝑡𝑛(X_{t},\ldots,X_{t+n})( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) tend to the min-entropy of {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT when n𝑛nitalic_n goes to infinity, being the former partial entropy lower than the latter.

Refer to caption
Figure 2: Asymptotic behaviour of average min-entropy and min-entropy per bit in terms of the target space size n𝑛nitalic_n for gbAR(p) models with several correlation lengths p𝑝pitalic_p and fixed β𝛽\betaitalic_β. The 𝜶𝜶\boldsymbol{\alpha}bold_italic_α arrays are uniform (i.e. all their components are equal). The data points have been evaluated numerically (see Subsection IV-B).
Remark 19.

Let {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT be an order-p𝑝pitalic_p Markov chain. It is worth noting that although the average min-entropy of (Xt,,Xt+n)subscript𝑋𝑡subscript𝑋𝑡𝑛(X_{t},\ldots,X_{t+n})( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) cannot decrease with n𝑛nitalic_n by Lemma 7, the average min-entropy per bit of (Xt,,Xt+nXt1,,Xtp)subscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) can actually do it (see Figure 2).

(Binary) Order-pMarkov Chain(Binary) Order-pMarkov Chain\textstyle{{\begin{array}[]{c}\text{(Binary) Order-$p$}\\ \text{Markov Chain}\end{array}}}start_ARRAY start_ROW start_CELL (Binary) Order- italic_p end_CELL end_ROW start_ROW start_CELL Markov Chain end_CELL end_ROW end_ARRAYgbAR(p)(Binary) SIMTPBitflip symetric Order-p Markov ChainBitflip symetric Order-p Markov Chain\textstyle{{\begin{array}[]{c}\text{Bitflip symetric}\\ \text{ Order-p Markov Chain}\end{array}}\ignorespaces\ignorespaces% \ignorespaces\ignorespaces}start_ARRAY start_ROW start_CELL Bitflip symetric end_CELL end_ROW start_ROW start_CELL Order-p Markov Chain end_CELL end_ROW end_ARRAYUniform noise andpositive gbAR(p)Uniform noise andpositive gbAR(p)\textstyle{{\begin{array}[]{c}\text{Uniform noise and}\\ \text{positive gbAR(p)}\end{array}}\ignorespaces\ignorespaces\ignorespaces\ignorespaces}start_ARRAY start_ROW start_CELL Uniform noise and end_CELL end_ROW start_ROW start_CELL positive gbAR(p) end_CELL end_ROW end_ARRAYUniform noise gbAR(p) with lag-p point-to-point correlationsUniform noise gbAR(p) with lag-p point-to-point correlations\textstyle{{\begin{array}[]{c}\text{Uniform noise }\\ \text{gbAR(p) with lag-p}\\ \text{ point-to-point correlations}\end{array}}\ignorespaces\ignorespaces% \ignorespaces\ignorespaces\ignorespaces\ignorespaces\ignorespaces\ignorespaces% \ignorespaces\ignorespaces\ignorespaces\ignorespaces}start_ARRAY start_ROW start_CELL Uniform noise end_CELL end_ROW start_ROW start_CELL gbAR(p) with lag-p end_CELL end_ROW start_ROW start_CELL point-to-point correlations end_CELL end_ROW end_ARRAYBitflip symetric Order-pMarkov Chain with lag-ppoint-to-point correlationsBitflip symetric Order-pMarkov Chain with lag-ppoint-to-point correlations\textstyle{{\begin{array}[]{c}\text{Bitflip symetric Order-p}\\ \text{Markov Chain with lag-p}\\ \text{point-to-point correlations}\end{array}}\ignorespaces\ignorespaces% \ignorespaces\ignorespaces\ignorespaces\ignorespaces\ignorespaces\ignorespaces}start_ARRAY start_ROW start_CELL Bitflip symetric Order-p end_CELL end_ROW start_ROW start_CELL Markov Chain with lag-p end_CELL end_ROW start_ROW start_CELL point-to-point correlations end_CELL end_ROW end_ARRAY
Figure 3: Hierarchy of the main models considered in this paper. An arrow from model M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to model M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT means that every model of type M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is of type M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

IV Experimental Methodology

In this section, we outline the experimental methodology carried out, which is primarily based on code implementations. Our main goal is to validate our theoretical findings, for which we generate correlated data using gbAR(p) models (see Definition 15).

Building upon Kelsey’s predictor concept [kelsey2015predictive], we use machine learning as a tool for min-entropy estimation. Our methodology adopts the traditional machine learning approach, marked by separate training and evaluation phases. This strategy deviates from Kelsey’s model of continuous updates, which we consider a non-essential aspect of predictor concepts for min-entropy evaluation. Thus, our methodology, termed as machine learning predictors, streamlines the process by clearly separating these stages, focusing on essential predictive capabilities without the need for constant updates.

Contrasting with the approach in [truong2018machine], which examines processes failing randomness due to large periods, we focus on processes with shorter-range, bit-level correlations since such correlations could be more similar to the realistic failure modes of physical and hardware-based RNGs, in line with the use of order-k𝑘kitalic_k or order-2 Markov chains in [kelsey2015predictive] and [zhu2017analysis] respectively. Given this requirement for modeling realistic RNG failures with shorter-range dependencies, gbAR(p) models provide a parsimonious parameterization that allows us to control correlations and anticorrelations, making them a suitable choice for our analysis.

This data serves as the training set for two distinct types of neural networks, which are tasked with predicting the next target_bits bits. As highlighted in the theory section (17 and Figure 1), order-1 Markov Chains may present trivial cases where min-entropy and average min-entropy match. Therefore, it is important to analyze the behavior of our predictors in scenarios where this equivalence does not hold, forming the basis of our experimental approach.

All the experiments are conducted on a NVIDIA GeForce RTX 3090 with 24.576 GiB of memory and CUDA Version 12.3.

The experimental framework can be structured around four primary components: the data generation process using the gbAR(p) model, the Monte Carlo simulation for the evaluation of minimum entropies, the implementation of machine learning models training and evaluation, specifically GPT-2 and a variation of RCNN (a model taken from [truong2018machine]), and the integration of all data processing steps. This pipeline encompasses the generation of gbAR(p) data, its evaluation using the NIST SP 800-90B test suite, the execution of machine learning predictions, and the compilation of relevant results.

For detailed documentation on code usage, parameter explanations, and further technical details, please refer to the README.md file in the associated code repository [github_code_repo].

IV-A Data Generation

The data generation is an implementation of the gbAR(p) model (Definition 14) in the gbAR() function. The call to this function is wrapped in the function generate_gbAR_random_bytes(), which leverages different autocorrelation functions, namely point-to-point, uniform (all the components being equal), exponential, and Gaussian, to define the autocorrelation pattern through the 𝜶𝜶\boldsymbol{\alpha}bold_italic_α parameter (as defined in Definiton 14).

It computes binary sequences by considering the autocorrelation defined by 𝜶𝜶\boldsymbol{\alpha}bold_italic_α and the (here always uniform) random noise term (weighted by β𝛽\betaitalic_β), sourced from high-entropy random numbers generated by OpenSSL’s rand command as the source of high entropy random numbers. Random numbers from OpenSSL are also employed in the ossl_rand_mn_rvs() function, that generates samples from a multinomial distribution as required by the gbAR(p) model. These samples are then used to construct the final binary sequence according to the autocorrelation characteristics defined by the model parameters.

The gbAR() function includes a mechanism that discards an initial segment (here with size 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT bytes) of the generated binary sequence. The rationale behind this is to allow the sequence to reach a state of statistical stationarity, thereby minimizing initial transient effects introduced in the generation.

IV-B Min-Entropies Calculation

In our approach to numerically evaluate the average min-entropy and min-entropy of the gbAR(p) processes, we employ a Monte Carlo simulation. This involves creating a program that generates 100 samples, each comprising 105superscript10510^{5}10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT bytes. These samples are used to empirically estimate the joint frequencies. The computed frequencies form the basis for calculating both the min-entropy and the average min-entropy (using the known transition probabilities specific to the gbAR(p) processes).

IV-C Machine Learning Predictors

In this work, we use two distinct machine learning models to tackle the task of predicting binary sequences. The first model is an adaptation of the RCNN, while the second model is based on the GPT-2 architecture.

The selection of the RCNN and GPT-2 architectures is driven by our goal to explore prediction capabilities on binary sequences generated from autoregressive models with short-range correlations. The RCNN, as used by Truong et al., has proven effective in detecting correlations in quantum random number generators under deterministic classical noise influence. Its convolutional and recurrent layers are well suited to capture local patterns and short-term dependencies. In contrast, the transformer-based GPT-2 model, with its self-attention mechanism, offers a different approach. Although originally designed for natural language processing, we adapt it to our binary sequence prediction task to examine how it captures order-p𝑝pitalic_p Markov chain characteristics. Using these two models enables us to validate the theoretical finding that machine learning predictors tend to estimate average min-entropy independently of architecture, provided they can learn from the data’s correlations.

IV-C1 Target Space Representation and Inference Strategies

As discussed in Section II-B, various approaches exist for multi-token prediction in language models. While recent works like Gloeckle et al. [gloeckle2024better], Stern et al. [stern2018blockwise], and Cai et al. [cai2024medusa] have shown promising results in natural language processing tasks, our research focuses on a different domain. We aim to explore the relationship between model predictions and the min-entropy of the data, specifically for data with different correlations, rather than natural language.

To address the limitations of autoregressive inference strategies that rely on local decisions at each step, we propose directly predicting the entire sequence of n bits simultaneously. This approach allows us to obtain the global maximum probability for the complete sequence from the model, rather than relying on step-by-step decisions. By doing so, we aim to capture long-range dependencies and avoid the potential pitfalls of greedy, beam search or other methods in finding globally optimal sequences.

Our method involves using different tokenization strategies for input and target spaces:

  • Input space: We use binary tokenization where each token represents a single bit (0 or 1).

  • Output space: We employ a tokenization where each token represents n𝑛nitalic_n bits, resulting in 2nsuperscript2𝑛2^{n}2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT unique classes.

This tokenization approach allows the model to predict groups of bits as single tokens, considering the joint probability of the entire sequence. We believe this method will lead to more accurate and globally optimal predictions, as it forces the model to consider the interdependencies between bits in the sequence.

IV-C2 Model Training and Evaluation Methodology

Our training dataset consists of sequences generated from the gbAR(p) model. From these sequences, we extract subsequences of length seqlentarget_bitsseqlentarget_bits\texttt{seqlen}-\texttt{target\_bits}seqlen - target_bits bits Both models are adapted to classify over 2nsuperscript2𝑛2^{n}2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT classes, corresponding to all possible sequences of n𝑛nitalic_n bits. In this context, each possible combination of target_bits is treated as a distinct class in the classification task.

We evaluate the model prediction accuracy PMLsubscript𝑃𝑀𝐿P_{ML}italic_P start_POSTSUBSCRIPT italic_M italic_L end_POSTSUBSCRIPT as

PML=ncorrectntotal.subscript𝑃𝑀𝐿subscript𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡subscript𝑛𝑡𝑜𝑡𝑎𝑙P_{ML}=\frac{n_{correct}}{n_{total}}\ .italic_P start_POSTSUBSCRIPT italic_M italic_L end_POSTSUBSCRIPT = divide start_ARG italic_n start_POSTSUBSCRIPT italic_c italic_o italic_r italic_r italic_e italic_c italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT end_ARG . (2)

Here, ncorrectsubscript𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡n_{correct}italic_n start_POSTSUBSCRIPT italic_c italic_o italic_r italic_r italic_e italic_c italic_t end_POSTSUBSCRIPT represents the number of correct predictions made by the model, and ntotalsubscript𝑛𝑡𝑜𝑡𝑎𝑙n_{total}italic_n start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT denotes the total number of evaluations conducted. This measure of accuracy serves as a key indicator of the model’s performance and its ability to accurately predict future bits based on the training received. The estimated min-entropy will be

hML=log(PML).subscript𝑀𝐿subscript𝑃𝑀𝐿h_{ML}=-\log(P_{ML})\ .italic_h start_POSTSUBSCRIPT italic_M italic_L end_POSTSUBSCRIPT = - roman_log ( italic_P start_POSTSUBSCRIPT italic_M italic_L end_POSTSUBSCRIPT ) . (3)

We get a basic approximation of the error using the Wald approximation for the binomial proportion confidence interval, as outlined in [meeker2017statistical]. The propagation of this error yields

ΔhML=ztarget_bitsln(2)1PML1nevals,Δsubscript𝑀𝐿𝑧target_bits21subscript𝑃𝑀𝐿1subscript𝑛evals\Delta h_{ML}=\frac{z}{\texttt{target\_bits}}\ln(2)\sqrt{\frac{\frac{1}{P_{ML}% }-1}{n_{\text{evals}}}}\ ,roman_Δ italic_h start_POSTSUBSCRIPT italic_M italic_L end_POSTSUBSCRIPT = divide start_ARG italic_z end_ARG start_ARG target_bits end_ARG roman_ln ( 2 ) square-root start_ARG divide start_ARG divide start_ARG 1 end_ARG start_ARG italic_P start_POSTSUBSCRIPT italic_M italic_L end_POSTSUBSCRIPT end_ARG - 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT evals end_POSTSUBSCRIPT end_ARG end_ARG , (4)

where nevalssubscript𝑛evalsn_{\text{evals}}italic_n start_POSTSUBSCRIPT evals end_POSTSUBSCRIPT is the number of evaluation sequences.

In addition to accuracy, to assess the performance of the training procedure during the development phase, we have included the evaluation of the binary entropy of the predictions, Pesubscript𝑃𝑒P_{e}italic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, the proportion of zeros in the prediction, Pcsubscript𝑃𝑐P_{c}italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, and the loss.

IV-C3 RCNN Model

We use model based on the RCNN model from the framework presented in [truong2018machine]. The original implementation combines convolutional and LSTM layers followed by fully-connected layers. Initially, input integers are converted into one-hot vectors. These vectors are then processed through convolutional layers with max-pooling to extract features, which are subsequently handled by the LSTM layer to capture temporal dependencies.

Our adaptation transitions from byte-based input processing to bit sequence handling, accommodating classification over a fixed number of 2nsuperscript2𝑛2^{n}2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT classes, where n𝑛nitalic_n is the number of target_bits. This aligns with the original design intended for classification over fixed 2nsuperscript2𝑛2^{n}2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT classes. The architecture employs an output layer with 2nsuperscript2𝑛2^{n}2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT neurons and softmax activation, allowing for multi-class classification. The categorical cross-entropy loss function is used for training. We use the RMSprop optimizer with a learning rate of 0.005.

Regarding the model architecture, we have slightly modified Truong’s model to increase its size, including:

  • Convolution1D layers with 32, 64, and 128 filters, kernel sizes of 12, 6, and 3 respectively, all using ’relu’ activation and ’same’ padding.

  • LSTM layers with 256 and 128 units, featuring return sequences and dropout layers with a rate of 0.2 for regularization.

  • A final Dense layer with an output size equal to target_bits, using sigmoid activation.

This model architecture results in approximately 7.61057.6superscript105~{}7.6\cdot 10^{5}7.6 ⋅ 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT trainable parameters.

IV-C4 GPT-2 Model

The GPT-2 model, referenced in [radford2019language], is adapted from its typical use in natural language processing as provided by the Hugging Face Transformers library [huggingface_transformers]. This adaptation restructures the model for traditional classification over the possible 2nsuperscript2𝑛2^{n}2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT classes for the next n𝑛nitalic_n target_bits.

For processing the binary sequences, we implement a custom BinaryDataset class. Each data entry in this dataset consists of a binary bit sequence with a length defined by the seqlen parameter. This setup facilitates the map** of each bit in the input sequence to the next target_bits, aligning with the classification framework.

In terms of model architecture for the adapted GPT-2, the vocabulary size is set to 2target_bitssuperscript2𝑡𝑎𝑟𝑔𝑒𝑡_𝑏𝑖𝑡𝑠2^{target\_bits}2 start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t _ italic_b italic_i italic_t italic_s end_POSTSUPERSCRIPT, aligning with the number of classes in our classification framework. The specific configuration of the model includes parameters such as n_positions=512, n_ctx=512, n_embd=768, n_layer=3, and n_head=3. This configuration leads to the GPT-2 model having 21106similar-toabsent21superscript106\sim 21\cdot 10^{6}∼ 21 ⋅ 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT trainable parameters.

For the training phase, we use the RMSprop optimizer with a learning rate of 0.0005. The CrossEntropyLoss loss function is chosen as the loss function.

We incorporate gradient scaling and accumulation in our training approach to enhance memory optimization and computational efficiency, especially important under constrained GPU availability.

IV-D Pipeline

We encapsulate in a pipeline the entire data processing for this work, from generating random numbers to saving results.

For each selection of input parameters, the method generates random bytes using the previously described gbAR(p) model. These bytes are saved to a file, which is later used in the data generators within the models to train and evaluate the models. We generate new gbAR(p) sequences for each run to ensure data variability and robustness.

The pipeline runs the NIST SP 800-90B entropy assessment for non i.i.d data in parallel with the model execution over a sample of the generated data (here 107superscript10710^{7}10 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT bytes).

Post-analysis, we meticulously compile various results, including entropy assessments, model parameters, PMLsubscript𝑃𝑀𝐿P_{ML}italic_P start_POSTSUBSCRIPT italic_M italic_L end_POSTSUBSCRIPT values, execution time, and more, into a CSV file.

V Results

Our primary objective is to investigate the relationship between the estimated min-entropy and the number of target_bits. To facilitate this analysis, we focus on low-entropy data for several reasons. Firstly, it ensures that models can effectively learn and capture underlying patterns. Secondly, it enhances the distinction between model predictions and inherent noise, allowing for more robust statistical analysis. In high-entropy scenarios (characterized by small α𝛼\alphaitalic_α values), the entropy per bit approaches 1, resembling uniform noise (see Figure 1). This proximity to maximum entropy poses challenges in assessing model performance due to overlap** confidence intervals. These intervals often encompass both the maximum entropy value of 1 and the expected theoretical value, which is also close to 1. Consequently, a large number of evaluations would be required to reduce measurement uncertainty and achieve statistical distinguishability. To address these challenges, we employ a generalized binary autoregressive model of order p=10𝑝10p=10italic_p = 10 with a uniform α𝛼\alphaitalic_α vector and a uniform noise term β=0.5𝛽0.5\beta=0.5italic_β = 0.5. This configuration provides a balance between learnable patterns and stochastic noise, enabling effective extraction of the target_bits dependence while maintaining statistical significance in our results.

Both the GPT-2 and RCNN models’ training data varied based on the number of target bits to be predicted. For tasks involving 1 to 12 target bits, 20 million bytes of raw data were used, equivalent to 125,000 training sequences of 128 bits each. This increased to 30 million bytes (187,499 sequences) for 13 to 15 target bits, and further to 42 million bytes (262,499 sequences) for 16 target bits. In each case, 80% of the available data was allocated for training, with the remaining 20% reserved for evaluation.

Our primary objective is to compare the min-entropy estimated by these models against the theoretical calculations and the estimations provided by the NIST SP 800-90B predictors and its overall entropy assessment. The results of these experiments are presented in Figure 4.

Refer to caption
Figure 4: Theoretical minimum entropies versus machine learning-based estimations of minimum entropy for gbAR(10) with a uniform α𝛼\alphaitalic_α vector and a uniform noise term β=0.5𝛽0.5\beta=0.5italic_β = 0.5 (representing a low entropy scenario). For clarity in visualization, the markers representing different models are slightly offset along the x-axis. Error bars are derived using a binomial proportion confidence interval with a 95% confidence level. The results from the NIST SP 800-90B global predictor tests are emphasized. The highlighted predictor entropies are the minimum of local and global estimates, in this case predominantly influenced by the local estimate. Bitstring predictors try to predict the next bit, while Literal predictors try to predict the next byte. The entropy_non_iid_nist denotes the final outcome of the NIST SP 800-90B analysis, which is the minimum of all conducted tests in the suite. The h_min_limit is the theoretical limit of the min-entropy, interpreted as the min-entropy per bit of the entire process. For this specific gbAR(10) configuration with positive α𝛼\alphaitalic_α, both min-entropies converge to this limit.

Finally, for illustration purposes, we demonstrate how greedy decoding fails to accurately estimate the min-entropy in data with certain types of correlations. This experiment utilized 20 million bytes of data, equivalent to 125,000 sequences of 128 bits each. Following our standard protocol, 80% (100,000 sequences) were used for training, and 20% (25,000 sequences) for evaluation. In this case, we focused on predicting 1 to 8 target bits. For comparison, we used two gbAR(2) models with alpha vectors 14[+1,+1]1411\frac{1}{4}[+1,+1]divide start_ARG 1 end_ARG start_ARG 4 end_ARG [ + 1 , + 1 ] and 14[+1,1]1411\frac{1}{4}[+1,-1]divide start_ARG 1 end_ARG start_ARG 4 end_ARG [ + 1 , - 1 ]. In the case with alternating correlation signs the global maximum probability cannot be split as the product of the local maximums, so the greedy decoding leads to suboptimal predictions compared to the inference over 2nsuperscript2𝑛2^{n}2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT classes. In the first case (|14[+1,+1]|\frac{1}{4}[+1,+1]| divide start_ARG 1 end_ARG start_ARG 4 end_ARG [ + 1 , + 1 ]), the global maximum can be reached as the product of local maximums at each bit (see Remark 23), so both approaches match. This comparison illustrates how the greedy approach may lead to suboptimal predictions, as it does not consider the joint probability of the complete sequence in all cases. The results of this analysis are illustrated in Figure 5, highlighting how the greedy decoding strategy can fall short in accurately estimating min-entropy under certain correlation conditions, while performing adequately in others.

Refer to caption
Figure 5: Comparison of min-entropy estimates: Greedy decoding vs. direct prediction over n𝑛nitalic_n target_bits𝑡𝑎𝑟𝑔𝑒𝑡_𝑏𝑖𝑡𝑠target\_bitsitalic_t italic_a italic_r italic_g italic_e italic_t _ italic_b italic_i italic_t italic_s for gbAR(2) models. The discrepancy between the experimental estimate of min-entropy and the theoretical is evident for |α|[+1,1]𝛼11|\sqrt{\alpha}|[+1,-1]| square-root start_ARG italic_α end_ARG | [ + 1 , - 1 ], while results align for |α|[+1,+1]𝛼11|\sqrt{\alpha}|[+1,+1]| square-root start_ARG italic_α end_ARG | [ + 1 , + 1 ].

VI Discussion

Our work builds upon Kelsey’s definition of predictors [kelsey2015predictive], showing that these predictors effectively estimate the average min-entropy as long as they can harness correlations and effectively model conditional, rather than joint, probabilities. This distinction becomes significant when dealing with stochastic processes with complex correlation structures.

We show that while min-entropy varies with the number of bits considered, defining min-entropy per bit for the entire process is still possible. Lemma 6 establishes that the average min-entropy per bit is always lower than or equal to the min-entropy for order-p Markov chains. Although generally distinct, in specific cases (Theorem 9), both joint min-entropy and average min-entropy converge towards a process-wide min-entropy per bit. Interestingly, different states may exhibit varied decay laws despite this common limit (Figure 2), which is operationally significant when attackers have access to correlated data.

Figure 1 illustrates the interplay between correlation ’width’ and ’length’ in how average min-entropy approaches the min-entropy limit. This aligns with Remark 17, where average min-entropy per bit equals the min-entropy of uniform noise gbAR(p) models with point-to-point lag-p𝑝pitalic_p correlations, regardless of target_bits, p𝑝pitalic_p, and |𝜶|𝜶|\boldsymbol{\alpha}|| bold_italic_α | values.

As we approach high entropy limits, all entropy forms converge to the process limit, consistent across various alpha levels for the considered gbAR(p) models. This reaffirms that min-entropy consistently exceeds average min-entropy, as per Lemma 6.

Figure 4 shows that our min-entropy estimations from both models align with the average min-entropy and stay within the error interval. Interestingly, the NIST Bitstring global predictors, designed to estimate the entropy of 1 target bit, generally overestimate the average min-entropy for 1 target bit, with the notable exception of the MultiMMC predictor. For 8 target_bits, the NIST global predictors, specifically MultiMCW and Lag, tend to overestimate the average min-entropy. However, LZ78Y and MultiMMC show results that are close to the theoretical calculation. It is important to note that NIST predictors are not designed to estimate entropy in the large n𝑛nitalic_n limit. In the particular case studied, where the min-entropy continues to decay beyond n=8𝑛8n=8italic_n = 8, it is not surprising that the GPT-2 predictor provides a lower estimate in the n=16𝑛16n=16italic_n = 16 run. This estimate is compatible with the theoretically expected value and, moreover, does not overlap with the gray area representing the minimum between the local and global estimates. Consequently, the GPT-2 predictor’s estimate is closer to the min-entropy of the stochastic process, making it a better and more conservative estimation.

Local predictions consistently dominate the min-entropy estimate across all cases. As a result, the overall outcome of the predictor is determined by the local estimate, since the final entropy estimate is the lesser of local and global predictions. The local predictions fall within the theoretical min-entropy for the 9-14 target_bits range and are significantly higher than the min-entropy limit of the process. The overall result of the NIST’s entropy non-IID test, which is the minimum of all tests in the suite, including both predictors and non-predictors, is closer to the min-entropy limit. In this specific scenario, predictors do not significantly contribute to the overall test.

Our analysis of greedy decoding versus direct prediction (Figure 5) reveals limitations in autoregressive inference approaches. For certain correlation structures (e.g., gbAR(2) with |α|[+1,1]𝛼11|\sqrt{\alpha}|[+1,-1]| square-root start_ARG italic_α end_ARG | [ + 1 , - 1 ]), greedy decoding fails to capture global maximum probability. This underscores the importance of multi-token prediction for accurately estimating min-entropy in complex correlation structures. Single-token or greedy approaches may lead to suboptimal predictions and min-entropy overestimation, emphasizing the need for methods capturing joint probabilities over multiple tokens.

In conclusion, our machine learning predictors demonstrate more consistent performance compared to the NIST SP 800-90B in estimating average min-entropy for both 1 and 8 target_bits in low entropy scenarios, while also providing robust estimates for larger values of n𝑛nitalic_n. This superior performance can be attributed, in part, to the non-parametric nature of ML min-entropy estimation. Unlike traditional methods that often assume specific underlying distributions, ML approaches allow for flexible modeling, making them particularly powerful in capturing complex, non-linear dependencies in the data. This flexibility is especially valuable when dealing with stochastic processes exhibiting intricate correlation structures, as it enables the model to adapt to the data’s inherent patterns without being constrained by predetermined statistical assumptions.

High entropy scenarios present additional challenges, potentially requiring larger training runs and models. Augmenting target bits offers improved capture of long-range correlations but at a significant computational cost. As Equation 4 indicates, maintaining constant error rates with increasing target bits requires exponential growth in evaluations, as PML2target_bitssimilar-tosubscript𝑃𝑀𝐿superscript2target_bitsP_{ML}\sim 2^{-\texttt{target\_bits}}italic_P start_POSTSUBSCRIPT italic_M italic_L end_POSTSUBSCRIPT ∼ 2 start_POSTSUPERSCRIPT - target_bits end_POSTSUPERSCRIPT in high entropy limits. This accuracy-computation trade-off necessitates careful balancing of long-range correlation capture and practical computational requirements. Future research should focus on develo** efficient algorithms or approximation methods to handle larger target bits without prohibitive computational costs, addressing the challenges of multi-token prediction, autoregressive inference limitations, and target bit scaling in entropy estimation.

VII Conclusions

Our research has revealed several key insights into the estimation of min-entropy. We have shown that machine learning predictors are good at estimating average min-entropy, as long as they effectively harness correlations by estimating conditional probabilities. This becomes particularly significant in stochastic processes with complex correlation structures, where the difference between average min-entropy and min-entropy is relevant.

Our results highlight that both these entropies depend on the number of target_bits considered. Given this important role of target_bits, especially in scenarios with complex correlation structures, it may be operationally significant to include assessments of both average min-entropy and min-entropy for specific target_bits values relevant to cryptographic (or other) scenarios. Importantly, in the examples studied, we observed that as average min-entropy decays with increasing target_bits, targeting only a few bits could lead to an overestimate of the min-entropy. This finding underscores the potential risks of relying on limited-scope entropy estimates in cryptographic applications. While defining min-entropy per bit for the entire process is feasible, and the entropies studied here converge towards this limit (suggesting a lower bound or worst-case scenario), this bound has not been explicitly demonstrated in this work. Therefore, develo** effective methods to estimate this limit remains essential.

We have also found that machine learning predictors can beat NIST SP 800-90B predictors estimates in some cases, making them suitable tools to be included in entropy assessment suites.

Our research leaves several avenues open for exploration that may be of particular interest in further studies:

  • Development of methods for estimating min-entropy in large target_bits scenarios.

  • Exploration of the relationship between training data min-entropy, model size, and necessary training data size for accurate min-entropy estimation. This investigation goes beyond aligning model prediction accuracy with theoretical curves; it aims to provide a deeper understanding of model learning capacity at various entropy levels. Such knowledge could inform the appropriate scaling of computational resources and potentially offer improved estimates by considering theoretical bounds on min-entropy estimation.

In conclusion, while our work has advanced the understanding of min-entropy estimation through machine learning, it also highlights the practical complexity of this method and the need for more research to address its challenges.

VIII Acknowledgments

This work is part of the R&D project TED2021-130369B-C32, funded by MCIN/AEI/10.13039/501100011033 and by the “European Union NextGenerationEU/PRTR”, and is part of the project COMPROMISE PID2020-113795RB-C32/AEI/10.13039/501100011033. In addition, it was partially supported by project i-SHAPER PRTR-INCIBE - 2023/00623/001, which is being carried out within the framework of the Recovery, Transformation, and Resilience Plan funds, funded by the European Union (Next Generation). The authors want to thank Miguel Angel Hombrados Herrera and Gonzalo Martínez Ruiz de Arcaute for their help and fruitful comments.

The purpose of this Appendix is to provide the proofs for the results stated in Section III and to include some additional auxiliary results needed to prove them.

.1 Order-p Markov Chains

We start by revisiting the results presented in Section III-B.

Proof of Lemma 6.

On the one hand, note that:

maxxt,,xt+nP(xt,,xt+nxt1,,xtp)xt1,,xtp=xtp,,xt1P(xtp,,xt1)maxxt,,xt+nP(xt,,xt+nxt1,,xtp)xtp,,xt1P(xtp,,xt1)maxxtp,,xt+nP(xt,,xt+nxt1,,xtp)=maxxtp,,xt+nP(xt,,xt+nxt1,,xtp).subscriptdelimited-⟨⟩subscriptsubscript𝑥𝑡subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝subscript𝑥𝑡1subscript𝑥𝑡𝑝absentsubscriptsubscript𝑥𝑡𝑝subscript𝑥𝑡1𝑃subscript𝑥𝑡𝑝subscript𝑥𝑡1subscriptsubscript𝑥𝑡subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝absentsubscriptsubscript𝑥𝑡𝑝subscript𝑥𝑡1𝑃subscript𝑥𝑡𝑝subscript𝑥𝑡1subscriptsubscript𝑥𝑡𝑝subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝absentsubscriptsubscript𝑥𝑡𝑝subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝\begin{aligned} \left\langle\max_{x_{t},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}% \mid x_{t-1},\ldots,x_{t-p})\right\rangle_{x_{t-1},\ldots,x_{t-p}}=\\ \sum_{x_{t-p},\ldots,x_{t-1}}P(x_{t-p},\ldots,x_{t-1})\max_{x_{t},\ldots,x_{t+% n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p})\leq\\ \sum_{x_{t-p},\ldots,x_{t-1}}P(x_{t-p},\ldots,x_{t-1})\max_{x_{t-p},\ldots,x_{% t+n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p})=\\ \max_{x_{t-p},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p}% ).\end{aligned}start_ROW start_CELL ⟨ roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ⟩ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT = end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ≤ end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) . end_CELL end_ROW

(5)

The inequality

h~(Xt,,Xt+nXt1,,Xtp)h(Xt,,Xt+nXt1,,Xtp).subscript~subscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝subscriptsubscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝\begin{aligned} \tilde{h}_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{% t-p})\geq h_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})\ .\end{aligned}start_ROW start_CELL over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ≥ italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) . end_CELL end_ROW

follows from (5) taking logarithms, dividing by n+1𝑛1n+1italic_n + 1 and changing the sign.

On the other hand, the inequality

h(Xt,,Xt+n)h~(Xt,,Xt+nXt1,,Xtp)subscriptsubscript𝑋𝑡subscript𝑋𝑡𝑛subscript~subscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝h_{\infty}(X_{t},\ldots,X_{t+n})\geq\tilde{h}_{\infty}(X_{t},\ldots,X_{t+n}% \mid X_{t-1},\ldots,X_{t-p})italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) ≥ over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT )

easily follows writing:

H(Xt,,Xt+n)=log[maxxt,,xt+nxt1,,xtpP(xt1,,xtp)P(xt,,xt+n|xt1,,xtp)]subscript𝐻subscript𝑋𝑡subscript𝑋𝑡𝑛absentsubscriptsubscript𝑥𝑡subscript𝑥𝑡𝑛subscriptsubscript𝑥𝑡1subscript𝑥𝑡𝑝𝑃subscript𝑥𝑡1subscript𝑥𝑡𝑝𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝\begin{aligned} H_{\infty}(X_{t},\ldots,X_{t+n})=\\ -\log\left[\max_{x_{t},\ldots,x_{t+n}}\sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},% \ldots,x_{t-p})P(x_{t},\ldots,x_{t+n}|x_{t-1},\ldots,x_{t-p})\right]\ \end{aligned}start_ROW start_CELL italic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL - roman_log [ roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ] end_CELL end_ROW

and taking into account that:

maxxt,,xt+nxt1,,xtpP(xt1,,xtp)P(xt,,xt+n|xt1,,xtp)xt1,,xtpP(xt1,,xtp)maxxt,,xt+nP(xt,,xt+n|xt1,,xtp).subscriptsubscript𝑥𝑡subscript𝑥𝑡𝑛subscriptsubscript𝑥𝑡1subscript𝑥𝑡𝑝𝑃subscript𝑥𝑡1subscript𝑥𝑡𝑝𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝absentsubscriptsubscript𝑥𝑡1subscript𝑥𝑡𝑝𝑃subscript𝑥𝑡1subscript𝑥𝑡𝑝subscriptsubscript𝑥𝑡subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝\begin{aligned} \max_{x_{t},\ldots,x_{t+n}}\sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t% -1},\ldots,x_{t-p})P(x_{t},\ldots,x_{t+n}|x_{t-1},\ldots,x_{t-p})\leq\\ \sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},\ldots,x_{t-p})\max_{x_{t},\ldots,x_{t+% n}}P(x_{t},\ldots,x_{t+n}|x_{t-1},\ldots,x_{t-p}).\end{aligned}start_ROW start_CELL roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ≤ end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) . end_CELL end_ROW

Proof of Lemma 7.

We have that:

P(xt,,xt+n+mxt1,,xtp)=P(xt+n+1,,xt+n+mxt+n,,xtp)P(xt,,xt+nxt1,,xtp)P(xt,,xt+nxt1,,xtp).𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛𝑚subscript𝑥𝑡1subscript𝑥𝑡𝑝absent𝑃subscript𝑥𝑡𝑛1conditionalsubscript𝑥𝑡𝑛𝑚subscript𝑥𝑡𝑛subscript𝑥𝑡𝑝𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝absent𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝\begin{aligned} P(x_{t},\ldots,x_{t+n+m}\mid x_{t-1},\ldots,x_{t-p})=\\ P(x_{t+n+1},\ldots,x_{t+n+m}\mid x_{t+n},\ldots,x_{t-p})P(x_{t},\ldots,x_{t+n}% \mid x_{t-1},\ldots,x_{t-p})\leq\\ P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p}).\end{aligned}start_ROW start_CELL italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n + italic_m end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL italic_P ( italic_x start_POSTSUBSCRIPT italic_t + italic_n + 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n + italic_m end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ≤ end_CELL end_ROW start_ROW start_CELL italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) . end_CELL end_ROW

Since this inequality is independent of the realizations

xt+n+1,,xt+n+msubscript𝑥𝑡𝑛1subscript𝑥𝑡𝑛𝑚x_{t+n+1},\ldots,x_{t+n+m}italic_x start_POSTSUBSCRIPT italic_t + italic_n + 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n + italic_m end_POSTSUBSCRIPT

it follows that:

maxxt,,xt+n+mP(xt,,xt+n+mxt1,,xtp)subscriptsubscript𝑥𝑡subscript𝑥𝑡𝑛𝑚𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛𝑚subscript𝑥𝑡1subscript𝑥𝑡𝑝absent\displaystyle\max_{x_{t},\ldots,x_{t+n+m}}P(x_{t},\ldots,x_{t+n+m}\mid x_{t-1}% ,\ldots,x_{t-p})\leqroman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n + italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n + italic_m end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ≤
maxxt,,xt+nP(xt,,xt+nxt1,,xtp).subscriptsubscript𝑥𝑡subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝\displaystyle\max_{x_{t},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},% \ldots,x_{t-p}).roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) .

Since logarithms are monotonically increasing functions, we conclude that:

H~(Xt,,Xt+nXt1,,Xtp)=log[xt1,,xtpP(xt1,,xtp)maxxt,,xt+nP(xt,,xt+nxt1,,xtp)]log[xt1,,xtpP(xt1,,xtp)maxxt,,xt+n+mP(xt,,xt+n+mxt1,,xtp)]=H~(Xt,,Xt+n+mXt1,,Xtp).subscript~𝐻subscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝absentsubscriptsubscript𝑥𝑡1subscript𝑥𝑡𝑝𝑃subscript𝑥𝑡1subscript𝑥𝑡𝑝subscriptsubscript𝑥𝑡subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝absentsubscriptsubscript𝑥𝑡1subscript𝑥𝑡𝑝𝑃subscript𝑥𝑡1subscript𝑥𝑡𝑝subscriptsubscript𝑥𝑡subscript𝑥𝑡𝑛𝑚𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛𝑚subscript𝑥𝑡1subscript𝑥𝑡𝑝absentsubscript~𝐻subscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛𝑚subscript𝑋𝑡1subscript𝑋𝑡𝑝\begin{aligned} \tilde{H}_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{% t-p})=\\ -\log\left[\sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},\ldots,x_{t-p})\max_{x_{t},% \ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p})\right]\leq\\ -\log\left[\sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},\ldots,x_{t-p})\max_{x_{t},% \ldots,x_{t+n+m}}P(x_{t},\ldots,x_{t+n+m}\mid x_{t-1},\ldots,x_{t-p})\right]=% \\ \tilde{H}_{\infty}(X_{t},\ldots,X_{t+n+m}\mid X_{t-1},\ldots,X_{t-p}).\end{aligned}start_ROW start_CELL over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL - roman_log [ ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ] ≤ end_CELL end_ROW start_ROW start_CELL - roman_log [ ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n + italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n + italic_m end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ] = end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n + italic_m end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) . end_CELL end_ROW

Proof of Theorem 8.

Since this result is a restatement of [bozorgmanesh2016convergence, Theorem 8], we believe some explanation is necessary. First of all, what [bozorgmanesh2016convergence, Theorem 8] states is that for every x,xt1,,xtpS𝑥subscript𝑥𝑡1subscript𝑥𝑡𝑝𝑆x,x_{t-1},\ldots,x_{t-p}\in Sitalic_x , italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ∈ italic_S:

limnP(Xt+n=xXt1=xt1,,Xtp=xtp)=π(x)\lim_{n\to\infty}P(X_{t+n}=x\mid X_{t-1}=x_{t-1},\ldots,X_{t-p}=x_{t-p})=\pi(x)roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_P ( italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT = italic_x ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = italic_π ( italic_x )

where π𝜋\piitalic_π is a stationary distribution whose existence is required as an hypothesis. In our case, the existence of a stationary distribution π𝜋\piitalic_π follows from [bozorgmanesh2016convergence, Theorem 7] because we are assuming that {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT is irreducible. Having said that, taking the stationarity into account:

P(Xt=x)=limnP(Xt=x)=limnP(Xt+n=x)=limnxt1,,xtnP(Xt1=xt1,,Xtp=xtp)P(Xt+n=xXt1=xt1,,Xtp=xtp)=xt1,,xtnP(Xt1=xt1,,Xtp=xtp)limnP(Xt+n=xXt1=xt1,,Xtp=xtp)=xt1,,xtnP(Xt1=xt1,,Xtp=xtp)π(x)=π(x)\begin{aligned} P(X_{t}=x)=\lim_{n\to\infty}P(X_{t}=x)=\lim_{n\to\infty}P(X_{t% +n}=x)=\\ \lim_{n\to\infty}\sum_{x_{t-1},\ldots,x_{t-n}}P(X_{t-1}=x_{t-1},\ldots,X_{t-p}% =x_{t-p})P(X_{t+n}=x\mid X_{t-1}=x_{t-1},\ldots,X_{t-p}=x_{t-p})=\\ \sum_{x_{t-1},\ldots,x_{t-n}}P(X_{t-1}=x_{t-1},\ldots,X_{t-p}=x_{t-p})\lim_{n% \to\infty}P(X_{t+n}=x\mid X_{t-1}=x_{t-1},\ldots,X_{t-p}=x_{t-p})=\\ \sum_{x_{t-1},\ldots,x_{t-n}}P(X_{t-1}=x_{t-1},\ldots,X_{t-p}=x_{t-p})\pi(x)=% \pi(x)\end{aligned}start_ROW start_CELL italic_P ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_x ) = roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_P ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_x ) = roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_P ( italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT = italic_x ) = end_CELL end_ROW start_ROW start_CELL roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) italic_P ( italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT = italic_x ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_P ( italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT = italic_x ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) italic_π ( italic_x ) = italic_π ( italic_x ) end_CELL end_ROW

and our restatement of [bozorgmanesh2016convergence, Theorem 8] follows. ∎

Proof of Theorem 9.

On the one hand, denoting τ=n𝜏𝑛\tau=\left\lfloor\sqrt{n}\right\rflooritalic_τ = ⌊ square-root start_ARG italic_n end_ARG ⌋, we have that:

P(xt,,xt+nxt1,,xtp)=P(xtp,,xt+n)P(xt1,,xtp)P(xtp,,xt1,xt+τ,,xt+n)P(xt1,,xtp)=P(xt+τ,,xt+nxt1,,xtp).𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝𝑃subscript𝑥𝑡𝑝subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡1subscript𝑥𝑡𝑝absent𝑃subscript𝑥𝑡𝑝subscript𝑥𝑡1subscript𝑥𝑡𝜏subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡1subscript𝑥𝑡𝑝𝑃subscript𝑥𝑡𝜏conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝\begin{aligned} P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p})=\frac{P(x_{% t-p},\ldots,x_{t+n})}{P(x_{t-1},\ldots,x_{t-p})}\leq\\ \frac{P(x_{t-p},\ldots,x_{t-1},x_{t+\tau},\ldots,x_{t+n})}{P(x_{t-1},\ldots,x_% {t-p})}=P(x_{t+\tau},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p}).\end{aligned}start_ROW start_CELL italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = divide start_ARG italic_P ( italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) end_ARG ≤ end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_P ( italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t + italic_τ end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) end_ARG = italic_P ( italic_x start_POSTSUBSCRIPT italic_t + italic_τ end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) . end_CELL end_ROW

(6)

Therefore:

xt1,,xtpP(xt1,,xtp)maxxt,,xt+nP(xt,,xt+nxt1,,xtp)xt1,,xtpP(xt1,,xtp)maxxt+τ,,xt+nP(xt+τ,,xt+nxt1,,xtp).subscriptsubscript𝑥𝑡1subscript𝑥𝑡𝑝𝑃subscript𝑥𝑡1subscript𝑥𝑡𝑝subscriptsubscript𝑥𝑡subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝absentsubscriptsubscript𝑥𝑡1subscript𝑥𝑡𝑝𝑃subscript𝑥𝑡1subscript𝑥𝑡𝑝subscriptsubscript𝑥𝑡𝜏subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡𝜏conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝\begin{aligned} \sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},\ldots,x_{t-p})\max_{x_% {t},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p})\leq\\ \sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},\ldots,x_{t-p})\max_{x_{t+\tau},\ldots,% x_{t+n}}P(x_{t+\tau},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p}).\end{aligned}start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ≤ end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t + italic_τ end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t + italic_τ end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) . end_CELL end_ROW

In particular,

1n+1log(xt1,,xtpP(xt1,,xtp)maxxt,,xt+nP(xt,,xt+nxt1,,xtp))1n+1log(xt1,,xtpP(xt1,,xtp)maxxt+τ,,xt+nP(xt+τ,,xt+nxt1,,xtp)).1𝑛1subscriptsubscript𝑥𝑡1subscript𝑥𝑡𝑝𝑃subscript𝑥𝑡1subscript𝑥𝑡𝑝subscriptsubscript𝑥𝑡subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝absent1𝑛1subscriptsubscript𝑥𝑡1subscript𝑥𝑡𝑝𝑃subscript𝑥𝑡1subscript𝑥𝑡𝑝subscriptsubscript𝑥𝑡𝜏subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡𝜏conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝\begin{aligned} -\frac{1}{n+1}\log\left(\sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1}% ,\ldots,x_{t-p})\max_{x_{t},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},% \ldots,x_{t-p})\right)\geq\\ -\frac{1}{n+1}\log\left(\sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},\ldots,x_{t-p})% \max_{x_{t+\tau},\ldots,x_{t+n}}P(x_{t+\tau},\ldots,x_{t+n}\mid x_{t-1},\ldots% ,x_{t-p})\right).\end{aligned}start_ROW start_CELL - divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG roman_log ( ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ) ≥ end_CELL end_ROW start_ROW start_CELL - divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG roman_log ( ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t + italic_τ end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t + italic_τ end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ) . end_CELL end_ROW

(7)

Now, the left hand side of the inequality (7) is

h~(Xt,,Xt+nXt1,,Xtp)subscript~subscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝\tilde{h}_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT )

and the limit of the right hand side of the inequality (7) as n𝑛nitalic_n goes to infinity is h({Xt}t)subscriptsubscriptsubscript𝑋𝑡𝑡h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT ) because of the Convergence Theorem and the stationarity. Using Lemma 6 we conclude that:

h({Xt}t)limnh~(Xt,,Xt+nXt1,,Xtp)h({Xt}t)subscriptsubscriptsubscript𝑋𝑡𝑡subscript𝑛subscript~subscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝subscriptsubscriptsubscript𝑋𝑡𝑡\begin{aligned} h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})\geq\lim_{n\to\infty}% \tilde{h}_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})\geq h_{% \infty}(\{X_{t}\}_{t\in\mathbb{Z}})\end{aligned}start_ROW start_CELL italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT ) ≥ roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ≥ italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT ) end_CELL end_ROW

and the equality

h({Xt}t)=limnh~(Xt,,Xt+nXt1,,Xtp)subscriptsubscriptsubscript𝑋𝑡𝑡subscript𝑛subscript~subscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})=\lim_{n\to\infty}\tilde{h}_{\infty}(X_{% t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT ) = roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT )

follows. On the other hand:

h(Xt,,Xt+nXt1,,Xtp)=1n+1log[maxxtp,,xt+nP(xt,,xt+nxt1,,xtp)]1n+1log[maxxtp,,xt+nP(xt+τ,,xt+nxt1,,xtp)]nh({Xt}t).subscriptsubscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝absent1𝑛1subscriptsubscript𝑥𝑡𝑝subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝absent1𝑛1subscriptsubscript𝑥𝑡𝑝subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡𝜏conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝𝑛subscriptsubscriptsubscript𝑋𝑡𝑡\begin{aligned} h_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})=\\ -\frac{1}{n+1}\log\left[\max_{x_{t-p},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}% \mid x_{t-1},\ldots,x_{t-p})\right]\geq\\ -\frac{1}{n+1}\log\left[\max_{x_{t-p},\ldots,x_{t+n}}P(x_{t+\tau},\ldots,x_{t+% n}\mid x_{t-1},\ldots,x_{t-p})\right]\underset{n\to\infty}{\longrightarrow}h_{% \infty}(\{X_{t}\}_{t\in\mathbb{Z}}).\end{aligned}start_ROW start_CELL italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL - divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG roman_log [ roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ] ≥ end_CELL end_ROW start_ROW start_CELL - divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG roman_log [ roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t + italic_τ end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ] start_UNDERACCENT italic_n → ∞ end_UNDERACCENT start_ARG ⟶ end_ARG italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT ) . end_CELL end_ROW

(8)

Note that the inequality of (8) holds because of (6) and the limit of (8) holds because of the Convergence Theorem and the stationarity. Using Lemma 6 we conclude that:

h({Xt}t)limnh(Xt,,Xt+nXt1,,Xtp)h({Xt}t)subscriptsubscriptsubscript𝑋𝑡𝑡subscript𝑛subscriptsubscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝subscriptsubscriptsubscript𝑋𝑡𝑡\begin{aligned} h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})\geq\lim_{n\to\infty}h_{% \infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})\geq h_{\infty}(\{X_{t% }\}_{t\in\mathbb{Z}})\end{aligned}start_ROW start_CELL italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT ) ≥ roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ≥ italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT ) end_CELL end_ROW

and the result follows. ∎

.2 State-Independent Maximum Transition Probability and Bitflip Symmetric Order-p𝑝pitalic_p Markov Chains

The following results are targeted at proving that the average min-entropy of SIMTP models coincides with their min-entropy, as claimed in Proposition 11.

Lemma 20.

Let {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT be a SIMTP order-p𝑝pitalic_p Markov chain. Then:

maxxt,,xt+nP(xt,,xt+nxt1,,xtp)xt1,,xtp=subscriptdelimited-⟨⟩subscriptsubscript𝑥𝑡subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝subscript𝑥𝑡1subscript𝑥𝑡𝑝absent\displaystyle\left\langle\max_{x_{t},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x% _{t-1},\ldots,x_{t-p})\right\rangle_{x_{t-1},\ldots,x_{t-p}}=⟨ roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ⟩ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT =
maxxtp,,xt+nP(xt,,xt+nxt1,,xtp).subscriptsubscript𝑥𝑡𝑝subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝\displaystyle\max_{x_{t-p},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},% \ldots,x_{t-p})\ .roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) .
Proof.

Let us write:

P(xt,,xt+nxt1,,xtp)=𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝absent\displaystyle P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p})=italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = (9)
i=0nP(xt+ixt+i1,,xt+ip).superscriptsubscriptproduct𝑖0𝑛𝑃conditionalsubscript𝑥𝑡𝑖subscript𝑥𝑡𝑖1subscript𝑥𝑡𝑖𝑝\displaystyle\prod_{i=0}^{n}P(x_{t+i}\mid x_{t+i-1},\ldots,x_{t+i-p}).∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t + italic_i - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_i - italic_p end_POSTSUBSCRIPT ) .

Since P(xt+ixt+i1,,xt+ip)𝑃conditionalsubscript𝑥𝑡𝑖subscript𝑥𝑡𝑖1subscript𝑥𝑡𝑖𝑝P(x_{t+i}\mid x_{t+i-1},\ldots,x_{t+i-p})italic_P ( italic_x start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t + italic_i - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_i - italic_p end_POSTSUBSCRIPT ) is independent of xt+i1,,xt+ipsubscript𝑥𝑡𝑖1subscript𝑥𝑡𝑖𝑝x_{t+i-1},\ldots,x_{t+i-p}italic_x start_POSTSUBSCRIPT italic_t + italic_i - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_i - italic_p end_POSTSUBSCRIPT for every i{0,,n}𝑖0𝑛i\in\{0,\ldots,n\}italic_i ∈ { 0 , … , italic_n }, it follows from (9) that:

maxxt,,xt+nP(xt,,xt+nxt1,,xtp)=subscriptsubscript𝑥𝑡subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝absent\displaystyle\max_{x_{t},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},% \ldots,x_{t-p})=roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = (10)
i=0nmaxxt+iP(xt+ixt+i1,,xt+ip).superscriptsubscriptproduct𝑖0𝑛subscriptsubscript𝑥𝑡𝑖𝑃conditionalsubscript𝑥𝑡𝑖subscript𝑥𝑡𝑖1subscript𝑥𝑡𝑖𝑝\displaystyle\prod_{i=0}^{n}\max_{x_{t+i}}P(x_{t+i}\mid x_{t+i-1},\ldots,x_{t+% i-p}).∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t + italic_i - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_i - italic_p end_POSTSUBSCRIPT ) .

Using (10), the independence of P(xt+ixt+i1,,xt+ip)𝑃conditionalsubscript𝑥𝑡𝑖subscript𝑥𝑡𝑖1subscript𝑥𝑡𝑖𝑝P(x_{t+i}\mid x_{t+i-1},\ldots,x_{t+i-p})italic_P ( italic_x start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t + italic_i - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_i - italic_p end_POSTSUBSCRIPT ) with respect to xt+i1,,xt+ipsubscript𝑥𝑡𝑖1subscript𝑥𝑡𝑖𝑝x_{t+i-1},\ldots,x_{t+i-p}italic_x start_POSTSUBSCRIPT italic_t + italic_i - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_i - italic_p end_POSTSUBSCRIPT for every i{0,,n}𝑖0𝑛i\in\{0,\ldots,n\}italic_i ∈ { 0 , … , italic_n } and the fact that the sum of probabilities of all the outcomes within a sample space is 1111, we get:

maxxt,,xt+nP(xt,,xt+nxt1,,xtp)xt1,,xtp=xt1,,xtpP(xt1,,xtp)maxxt,,xt+nP(xt,,xt+nxt1,,xtp)=xt1,,xtpP(xt1,,xtp)i=0nmaxxt+iP(xt+ixt+i1,,xt+ip)=i=0nmaxxt+iP(xt+ixt+i1,,xt+ip)=maxxt,,xt+nP(xt,,xt+nxt1,,xtp).subscriptdelimited-⟨⟩subscriptsubscript𝑥𝑡subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝subscript𝑥𝑡1subscript𝑥𝑡𝑝absentsubscriptsubscript𝑥𝑡1subscript𝑥𝑡𝑝𝑃subscript𝑥𝑡1subscript𝑥𝑡𝑝subscriptsubscript𝑥𝑡subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝absentsubscriptsubscript𝑥𝑡1subscript𝑥𝑡𝑝𝑃subscript𝑥𝑡1subscript𝑥𝑡𝑝superscriptsubscriptproduct𝑖0𝑛subscriptsubscript𝑥𝑡𝑖𝑃conditionalsubscript𝑥𝑡𝑖subscript𝑥𝑡𝑖1subscript𝑥𝑡𝑖𝑝absentsuperscriptsubscriptproduct𝑖0𝑛subscriptsubscript𝑥𝑡𝑖𝑃conditionalsubscript𝑥𝑡𝑖subscript𝑥𝑡𝑖1subscript𝑥𝑡𝑖𝑝absentsubscriptsubscript𝑥𝑡subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝\begin{aligned} \left\langle\max_{x_{t},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}% \mid x_{t-1},\ldots,x_{t-p})\right\rangle_{x_{t-1},\ldots,x_{t-p}}=\\ \sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},\ldots,x_{t-p})\max_{x_{t},\ldots,x_{t+% n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p})=\\ \sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},\ldots,x_{t-p})\prod_{i=0}^{n}\max_{x_{% t+i}}P(x_{t+i}\mid x_{t+i-1},\ldots,x_{t+i-p})=\\ \prod_{i=0}^{n}\max_{x_{t+i}}P(x_{t+i}\mid x_{t+i-1},\ldots,x_{t+i-p})=\\ \max_{x_{t},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p}).% \end{aligned}start_ROW start_CELL ⟨ roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ⟩ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT = end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t + italic_i - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_i - italic_p end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t + italic_i - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_i - italic_p end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) . end_CELL end_ROW

Proposition 21.

Any order-p𝑝pitalic_p SIMTP satisfies the following decomposition:

H(Xt,,Xt+n)=H(Xt,,Xt+p1)+H(Xt+p,,Xt+nXt+p1,,Xt)subscript𝐻subscript𝑋𝑡subscript𝑋𝑡𝑛absentsubscript𝐻subscript𝑋𝑡subscript𝑋𝑡𝑝1subscript𝐻subscript𝑋𝑡𝑝conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡𝑝1subscript𝑋𝑡\begin{aligned} H_{\infty}(X_{t},\ldots,X_{t+n})=\\ H_{\infty}(X_{t},\ldots,X_{t+p-1})+H_{\infty}(X_{t+p},\ldots,X_{t+n}\mid X_{t+% p-1},\ldots,X_{t})\end{aligned}start_ROW start_CELL italic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL italic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_p - 1 end_POSTSUBSCRIPT ) + italic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t + italic_p end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t + italic_p - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW

Proof.

Let us write:

H(Xt,,Xt+n)=log[maxxt,,xt+nP(xt,,xn)]=log[maxxt,,xt+nP(xt,,xt+p1)P(xt+p,,xt+nxt+p1,,xt)].subscript𝐻subscript𝑋𝑡subscript𝑋𝑡𝑛subscriptsubscript𝑥𝑡subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡subscript𝑥𝑛absentsubscriptsubscript𝑥𝑡subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡subscript𝑥𝑡𝑝1𝑃subscript𝑥𝑡𝑝conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡𝑝1subscript𝑥𝑡\begin{aligned} H_{\infty}(X_{t},\ldots,X_{t+n})=-\log\left[\max_{x_{t},\ldots% ,x_{t+n}}P(x_{t},\ldots,x_{n})\right]=\\ -\log\left[\max_{x_{t},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+p-1})P(x_{t+p},% \ldots,x_{t+n}\mid x_{t+p-1},\ldots,x_{t})\right].\end{aligned}start_ROW start_CELL italic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) = - roman_log [ roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ] = end_CELL end_ROW start_ROW start_CELL - roman_log [ roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_p - 1 end_POSTSUBSCRIPT ) italic_P ( italic_x start_POSTSUBSCRIPT italic_t + italic_p end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t + italic_p - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] . end_CELL end_ROW

If the process is SIMTP then we can reach the maximum over P(xt+p,,xt+n)𝑃subscript𝑥𝑡𝑝subscript𝑥𝑡𝑛P(x_{t+p},\ldots,x_{t+n})italic_P ( italic_x start_POSTSUBSCRIPT italic_t + italic_p end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) independently of the values required to maximize over P(xt,,xt+p1)𝑃subscript𝑥𝑡subscript𝑥𝑡𝑝1P(x_{t},\ldots,x_{t+p-1})italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_p - 1 end_POSTSUBSCRIPT ). In other words, we can maximize both probabilities independently:

H(Xt,,Xt+n)=log[maxxt,,xt+p1P(xt,,xt+p1)]log[maxxt,,xt+nP(xt+p,,xt+nxt+p1,,xt)]=H(Xt,,Xt+p1)+H(Xt+p,,Xt+nXt+p1,,Xt).subscript𝐻subscript𝑋𝑡subscript𝑋𝑡𝑛absentsubscriptsubscript𝑥𝑡subscript𝑥𝑡𝑝1𝑃subscript𝑥𝑡subscript𝑥𝑡𝑝1subscriptsubscript𝑥𝑡subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡𝑝conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡𝑝1subscript𝑥𝑡absentsubscript𝐻subscript𝑋𝑡subscript𝑋𝑡𝑝1subscript𝐻subscript𝑋𝑡𝑝conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡𝑝1subscript𝑋𝑡\begin{aligned} H_{\infty}(X_{t},\ldots,X_{t+n})=\\ -\log\left[\max_{x_{t},\ldots,x_{t+p-1}}P(x_{t},\ldots,x_{t+p-1})\right]-\log% \left[\max_{x_{t},\ldots,x_{t+n}}P(x_{t+p},\ldots,x_{t+n}\mid x_{t+p-1},\ldots% ,x_{t})\right]=\\ H_{\infty}(X_{t},\ldots,X_{t+p-1})+H_{\infty}(X_{t+p},\ldots,X_{t+n}\mid X_{t+% p-1},\ldots,X_{t}).\end{aligned}start_ROW start_CELL italic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL - roman_log [ roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_p - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_p - 1 end_POSTSUBSCRIPT ) ] - roman_log [ roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t + italic_p end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t + italic_p - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] = end_CELL end_ROW start_ROW start_CELL italic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_p - 1 end_POSTSUBSCRIPT ) + italic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t + italic_p end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t + italic_p - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . end_CELL end_ROW

Proof of Proposition 11.

On the one hand, by Proposition 21 we have:

H(Xt,,Xt+n)=H(Xt,,Xt+p1)+H(Xt+p,,Xt+nXt+p1,,Xt)=H(Xt,,Xt+p1)log[maxxtP(xtxt1,,xtp)np].subscript𝐻subscript𝑋𝑡subscript𝑋𝑡𝑛absentsubscript𝐻subscript𝑋𝑡subscript𝑋𝑡𝑝1subscript𝐻subscript𝑋𝑡𝑝conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡𝑝1subscript𝑋𝑡absentsubscript𝐻subscript𝑋𝑡subscript𝑋𝑡𝑝1subscriptsubscript𝑥𝑡𝑃superscriptconditionalsubscript𝑥𝑡subscript𝑥𝑡1subscript𝑥𝑡𝑝𝑛𝑝\begin{aligned} H_{\infty}(X_{t},\ldots,X_{t+n})=\\ H_{\infty}(X_{t},\ldots,X_{t+p-1})+H_{\infty}(X_{t+p},\ldots,X_{t+n}\mid X_{t+% p-1},\ldots,X_{t})=\\ H_{\infty}(X_{t},\ldots,X_{t+p-1})-\log\left[\max_{x_{t}}P(x_{t}\mid x_{t-1},% \ldots,x_{t-p})^{n-p}\right].\end{aligned}start_ROW start_CELL italic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL italic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_p - 1 end_POSTSUBSCRIPT ) + italic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t + italic_p end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t + italic_p - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL italic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_p - 1 end_POSTSUBSCRIPT ) - roman_log [ roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_n - italic_p end_POSTSUPERSCRIPT ] . end_CELL end_ROW

Then, for finite p𝑝pitalic_p, as the first term is bounded

h({Xt}t)=limn1n+1H(Xt,,Xt+n)=subscriptsubscriptsubscript𝑋𝑡𝑡subscript𝑛1𝑛1subscript𝐻subscript𝑋𝑡subscript𝑋𝑡𝑛absent\displaystyle h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})=\lim_{n\to\infty}\frac{1}% {n+1}H_{\infty}(X_{t},\ldots,X_{t+n})=italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT ) = roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG italic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) = (11)
log[maxxtP(xtxt1,,xtp)].subscriptsubscript𝑥𝑡𝑃conditionalsubscript𝑥𝑡subscript𝑥𝑡1subscript𝑥𝑡𝑝\displaystyle-\log\left[\max_{x_{t}}P(x_{t}\mid x_{t-1},\ldots,x_{t-p})\right].- roman_log [ roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ] .

On the other hand:

h~(Xt,,Xt+nXt1,,Xtp)=log[(xt1,,xtpP(xt1,,xtp)maxxt,,xt+nP(xt,,xt+nxt1,,xtp))1n+1]=log[(xt1,,xtpP(xt1,,xtp)maxxtP(xtxt1,,xtp)n)1n+1]=log[(xt1,,xtpP(xt1,,xtp))1n+1maxxtP(xtxt1,,xtp)]=log[maxxtP(xtxt1,,xtp)].subscript~subscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝absentsuperscriptsubscriptsubscript𝑥𝑡1subscript𝑥𝑡𝑝𝑃subscript𝑥𝑡1subscript𝑥𝑡𝑝subscriptsubscript𝑥𝑡subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝1𝑛1absentsuperscriptsubscriptsubscript𝑥𝑡1subscript𝑥𝑡𝑝𝑃subscript𝑥𝑡1subscript𝑥𝑡𝑝subscriptsubscript𝑥𝑡𝑃superscriptconditionalsubscript𝑥𝑡subscript𝑥𝑡1subscript𝑥𝑡𝑝𝑛1𝑛1absentsuperscriptsubscriptsubscript𝑥𝑡1subscript𝑥𝑡𝑝𝑃subscript𝑥𝑡1subscript𝑥𝑡𝑝1𝑛1subscriptsubscript𝑥𝑡𝑃conditionalsubscript𝑥𝑡subscript𝑥𝑡1subscript𝑥𝑡𝑝absentsubscriptsubscript𝑥𝑡𝑃conditionalsubscript𝑥𝑡subscript𝑥𝑡1subscript𝑥𝑡𝑝\begin{aligned} \tilde{h}_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{% t-p})=\\ -\log\left[\left(\sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},\ldots,x_{t-p})\max_{x% _{t},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p})\right)^% {\frac{1}{n+1}}\right]=\\ -\log\left[\left(\sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},\ldots,x_{t-p})\max_{x% _{t}}P(x_{t}\mid x_{t-1},\ldots,x_{t-p})^{n}\right)^{\frac{1}{n+1}}\right]=\\ -\log\left[\left(\sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},\ldots,x_{t-p})\right)% ^{\frac{1}{n+1}}\max_{x_{t}}P(x_{t}\mid x_{t-1},\ldots,x_{t-p})\right]=\\ -\log\left[\max_{x_{t}}P(x_{t}\mid x_{t-1},\ldots,x_{t-p})\right].\end{aligned}start_ROW start_CELL over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL - roman_log [ ( ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG end_POSTSUPERSCRIPT ] = end_CELL end_ROW start_ROW start_CELL - roman_log [ ( ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG end_POSTSUPERSCRIPT ] = end_CELL end_ROW start_ROW start_CELL - roman_log [ ( ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ] = end_CELL end_ROW start_ROW start_CELL - roman_log [ roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ] . end_CELL end_ROW

(12)

The result follows putting (11) and (12) together.

We end this subsection of the Appendix by proving that Bitflip-Symmetric Markov Chains with lag-p point-to-point correlations are SIMTP, as stated in Lemma 13.

Proof of Lemma 13.

Let xtp{0,1}subscript𝑥𝑡𝑝01x_{t-p}\in\{0,1\}italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ∈ { 0 , 1 }. Then, by the definition of bitflip symmetry:

maxxtP(xtxtp)=max{P(Xt=0xtp),P(Xt=1xtp)}=max{P(Xt=11xtp),P(Xt=01xtp)}=maxxtP(xt1xtp)subscriptsubscript𝑥𝑡𝑃conditionalsubscript𝑥𝑡subscript𝑥𝑡𝑝𝑃subscript𝑋𝑡conditional0subscript𝑥𝑡𝑝𝑃subscript𝑋𝑡conditional1subscript𝑥𝑡𝑝absent𝑃subscript𝑋𝑡conditional1direct-sum1subscript𝑥𝑡𝑝𝑃subscript𝑋𝑡conditional0direct-sum1subscript𝑥𝑡𝑝absentsubscriptsubscript𝑥𝑡𝑃conditionalsubscript𝑥𝑡direct-sum1subscript𝑥𝑡𝑝\begin{aligned} \max_{x_{t}}P(x_{t}\mid x_{t-p})=\max\{P(X_{t}=0\mid x_{t-p}),% P(X_{t}=1\mid x_{t-p})\}=\\ \max\{P(X_{t}=1\mid 1\oplus x_{t-p}),P(X_{t}=0\mid 1\oplus x_{t-p})\}=\\ \max_{x_{t}}P(x_{t}\mid 1\oplus x_{t-p})\end{aligned}start_ROW start_CELL roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = roman_max { italic_P ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0 ∣ italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) , italic_P ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 ∣ italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) } = end_CELL end_ROW start_ROW start_CELL roman_max { italic_P ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 ∣ 1 ⊕ italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) , italic_P ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0 ∣ 1 ⊕ italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) } = end_CELL end_ROW start_ROW start_CELL roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ 1 ⊕ italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) end_CELL end_ROW

and the result follows. ∎

.3 Generalized Binary Autoregressive Models

The last subsection of the Appendix is devoted to gather some interesting properties of gbAR(p) models. In particular, they will allow to prove the formula for calculating the min-entropy of uniform noise and positive gbAR(p) models stated in Proposition 16.

Remark 22.

Let {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT be a uniform noise gbAR(p) model. Then P(Xt=0)=P(Xt=1)=12𝑃subscript𝑋𝑡0𝑃subscript𝑋𝑡112P(X_{t}=0)=P(X_{t}=1)=\frac{1}{2}italic_P ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0 ) = italic_P ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG for every t𝑡t\in\mathbb{Z}italic_t ∈ blackboard_Z by [jentsch2019generalized, Lemma 1].

Remark 23.

By [jentsch2019generalized, Lemma 1] the transition probabilities of a gbAR(p) model {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT can be written as:

P(xtxt1,,xtp)=i=1p[𝟙{αi0}|αi|(1xtxti)+𝟙{αi<0}|αi|(xtxti)]+βP(et=xt).𝑃conditionalsubscript𝑥𝑡subscript𝑥𝑡1subscript𝑥𝑡𝑝absentsuperscriptsubscript𝑖1𝑝delimited-[]subscript1subscript𝛼𝑖0subscript𝛼𝑖direct-sum1subscript𝑥𝑡subscript𝑥𝑡𝑖subscript1subscript𝛼𝑖0subscript𝛼𝑖direct-sumsubscript𝑥𝑡subscript𝑥𝑡𝑖𝛽𝑃subscript𝑒𝑡subscript𝑥𝑡\begin{aligned} P(x_{t}\mid x_{t-1},\ldots,x_{t-p})=\\ \sum_{i=1}^{p}[\mathds{1}_{\{\alpha_{i}\geq 0\}}|\alpha_{i}|\cdot(1\oplus x_{t% }\oplus x_{t-i})+\mathds{1}_{\{\alpha_{i}<0\}}|\alpha_{i}|\cdot(x_{t}\oplus x_% {t-i})]+\beta\cdot P(e_{t}=x_{t})\ .\end{aligned}start_ROW start_CELL italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT [ blackboard_1 start_POSTSUBSCRIPT { italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0 } end_POSTSUBSCRIPT | italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ⋅ ( 1 ⊕ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊕ italic_x start_POSTSUBSCRIPT italic_t - italic_i end_POSTSUBSCRIPT ) + blackboard_1 start_POSTSUBSCRIPT { italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < 0 } end_POSTSUBSCRIPT | italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ⋅ ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊕ italic_x start_POSTSUBSCRIPT italic_t - italic_i end_POSTSUBSCRIPT ) ] + italic_β ⋅ italic_P ( italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . end_CELL end_ROW

If {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT is a uniform noise and positive gbAR(p) model this writes as

P(xtxt1,,xtp)=i=1pαi(1xtxti)+β2𝑃conditionalsubscript𝑥𝑡subscript𝑥𝑡1subscript𝑥𝑡𝑝superscriptsubscript𝑖1𝑝subscript𝛼𝑖direct-sum1subscript𝑥𝑡subscript𝑥𝑡𝑖𝛽2\displaystyle P(x_{t}\mid x_{t-1},\ldots,x_{t-p})=\sum_{i=1}^{p}\alpha_{i}% \cdot(1\oplus x_{t}\oplus x_{t-i})+\frac{\beta}{2}italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ ( 1 ⊕ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊕ italic_x start_POSTSUBSCRIPT italic_t - italic_i end_POSTSUBSCRIPT ) + divide start_ARG italic_β end_ARG start_ARG 2 end_ARG

and this conditional probability reaches the maximum value

maxxt1,,xtpP(xtxt1,,xtp)=i=1pαi+β2=1β2subscriptsubscript𝑥𝑡1subscript𝑥𝑡𝑝𝑃conditionalsubscript𝑥𝑡subscript𝑥𝑡1subscript𝑥𝑡𝑝superscriptsubscript𝑖1𝑝subscript𝛼𝑖𝛽21𝛽2\max_{x_{t-1},\ldots,x_{t-p}}P(x_{t}\mid x_{t-1},\ldots,x_{t-p})=\sum_{i=1}^{p% }\alpha_{i}+\frac{\beta}{2}=1-\frac{\beta}{2}\ roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + divide start_ARG italic_β end_ARG start_ARG 2 end_ARG = 1 - divide start_ARG italic_β end_ARG start_ARG 2 end_ARG

when xt=xt1==xtpsubscript𝑥𝑡subscript𝑥𝑡1subscript𝑥𝑡𝑝x_{t}=x_{t-1}=\cdots=x_{t-p}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = ⋯ = italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT. In particular, a realization xtp,,xt+nsubscript𝑥𝑡𝑝subscript𝑥𝑡𝑛x_{t-p},\ldots,x_{t+n}italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT of Xtp,,Xt+nsubscript𝑋𝑡𝑝subscript𝑋𝑡𝑛X_{t-p},\ldots,X_{t+n}italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT such that

xtp==xt+nsubscript𝑥𝑡𝑝subscript𝑥𝑡𝑛x_{t-p}=\cdots=x_{t+n}italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT = ⋯ = italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT

maximizes the conditional probability P(xkxk1,,xkp)𝑃conditionalsubscript𝑥𝑘subscript𝑥𝑘1subscript𝑥𝑘𝑝P(x_{k}\mid x_{k-1},\ldots,x_{k-p})italic_P ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_k - italic_p end_POSTSUBSCRIPT ) for every k{t,,t+n}𝑘𝑡𝑡𝑛k\in\{t,\ldots,t+n\}italic_k ∈ { italic_t , … , italic_t + italic_n } and it follows that:

maxxtp,,xt+nP(xt,,xt+nxt1,,xtp)=subscriptsubscript𝑥𝑡𝑝subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝absent\displaystyle\max_{x_{t-p},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},% \ldots,x_{t-p})=roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) =
maxxtp,,xt+nk=tt+nP(xkxk1,,xkp)=subscriptsubscript𝑥𝑡𝑝subscript𝑥𝑡𝑛superscriptsubscriptproduct𝑘𝑡𝑡𝑛𝑃conditionalsubscript𝑥𝑘subscript𝑥𝑘1subscript𝑥𝑘𝑝absent\displaystyle\max_{x_{t-p},\ldots,x_{t+n}}\prod_{k=t}^{t+n}P(x_{k}\mid x_{k-1}% ,\ldots,x_{k-p})=roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_k = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + italic_n end_POSTSUPERSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_k - italic_p end_POSTSUBSCRIPT ) =
k=tt+nmaxxk,,xkpP(xkxk1,,xkp)=(1β2)n+1.superscriptsubscriptproduct𝑘𝑡𝑡𝑛subscriptsubscript𝑥𝑘subscript𝑥𝑘𝑝𝑃conditionalsubscript𝑥𝑘subscript𝑥𝑘1subscript𝑥𝑘𝑝superscript1𝛽2𝑛1\displaystyle\prod_{k=t}^{t+n}\max_{x_{k},\ldots,x_{k-p}}P(x_{k}\mid x_{k-1},% \ldots,x_{k-p})=\left(1-\frac{\beta}{2}\right)^{n+1}.∏ start_POSTSUBSCRIPT italic_k = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + italic_n end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_k - italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_k - italic_p end_POSTSUBSCRIPT ) = ( 1 - divide start_ARG italic_β end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT .
Remark 24.

Note that a gbAR(p) model has point-to-point lag-p correlations if at(i)=δi,psuperscriptsubscript𝑎𝑡𝑖subscript𝛿𝑖𝑝a_{t}^{(i)}=\delta_{i,p}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = italic_δ start_POSTSUBSCRIPT italic_i , italic_p end_POSTSUBSCRIPT for every i{1,,p}𝑖1𝑝i\in\{1,\ldots,p\}italic_i ∈ { 1 , … , italic_p } and every t𝑡t\in\mathbb{Z}italic_t ∈ blackboard_Z. This is achieved when αi=0subscript𝛼𝑖0\alpha_{i}=0italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 for every i{1,,p1}𝑖1𝑝1i\in\{1,\ldots,p-1\}italic_i ∈ { 1 , … , italic_p - 1 }.

Lemma 25.

A uniform noise gbAR(p) model with point-to-point lag-p correlations is bitflip-symmetric.

Proof.

By the definition of conditional probability and the definition of gbAR(p) model with point-to-point lag-p correlations:

P(xt,,xt+nxt1,,xtp)=i=0nP(xt+ixt+ip).𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝superscriptsubscriptproduct𝑖0𝑛𝑃conditionalsubscript𝑥𝑡𝑖subscript𝑥𝑡𝑖𝑝P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p})=\prod_{i=0}^{n}P(x_{t+i}% \mid x_{t+i-p}).italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t + italic_i - italic_p end_POSTSUBSCRIPT ) . (13)

Then, by Remark 22 and Remark 23:

P(xt+ixt+ip)=𝑃conditionalsubscript𝑥𝑡𝑖subscript𝑥𝑡𝑖𝑝absent\displaystyle P(x_{t+i}\mid x_{t+i-p})=italic_P ( italic_x start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t + italic_i - italic_p end_POSTSUBSCRIPT ) = (14)
[𝟙{αp0}|αp|(1xt+ixt+ip)+\displaystyle[\mathds{1}_{\{\alpha_{p}\geq 0\}}|\alpha_{p}|\cdot(1\oplus x_{t+% i}\oplus x_{t+i-p})+[ blackboard_1 start_POSTSUBSCRIPT { italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≥ 0 } end_POSTSUBSCRIPT | italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT | ⋅ ( 1 ⊕ italic_x start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT ⊕ italic_x start_POSTSUBSCRIPT italic_t + italic_i - italic_p end_POSTSUBSCRIPT ) +
𝟙{αp<0}|αp|(xt+ixt+ip)]+β2=\displaystyle\mathds{1}_{\{\alpha_{p}<0\}}|\alpha_{p}|\cdot(x_{t+i}\oplus x_{t% +i-p})]+\frac{\beta}{2}=blackboard_1 start_POSTSUBSCRIPT { italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT < 0 } end_POSTSUBSCRIPT | italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT | ⋅ ( italic_x start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT ⊕ italic_x start_POSTSUBSCRIPT italic_t + italic_i - italic_p end_POSTSUBSCRIPT ) ] + divide start_ARG italic_β end_ARG start_ARG 2 end_ARG =
[𝟙{αp0}|αp|(11xt+i1xt+ip)+\displaystyle[\mathds{1}_{\{\alpha_{p}\geq 0\}}|\alpha_{p}|\cdot(1\oplus 1% \oplus x_{t+i}\oplus 1\oplus x_{t+i-p})+[ blackboard_1 start_POSTSUBSCRIPT { italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≥ 0 } end_POSTSUBSCRIPT | italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT | ⋅ ( 1 ⊕ 1 ⊕ italic_x start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT ⊕ 1 ⊕ italic_x start_POSTSUBSCRIPT italic_t + italic_i - italic_p end_POSTSUBSCRIPT ) +
𝟙{αp<0}|αp|(1xt+i1xt+ip)]+β2=\displaystyle\mathds{1}_{\{\alpha_{p}<0\}}|\alpha_{p}|\cdot(1\oplus x_{t+i}% \oplus 1\oplus x_{t+i-p})]+\frac{\beta}{2}=blackboard_1 start_POSTSUBSCRIPT { italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT < 0 } end_POSTSUBSCRIPT | italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT | ⋅ ( 1 ⊕ italic_x start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT ⊕ 1 ⊕ italic_x start_POSTSUBSCRIPT italic_t + italic_i - italic_p end_POSTSUBSCRIPT ) ] + divide start_ARG italic_β end_ARG start_ARG 2 end_ARG =
P(1xt+i1xt+ip).𝑃direct-sum1conditionalsubscript𝑥𝑡𝑖direct-sum1subscript𝑥𝑡𝑖𝑝\displaystyle P(1\oplus x_{t+i}\mid 1\oplus x_{t+i-p}).italic_P ( 1 ⊕ italic_x start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT ∣ 1 ⊕ italic_x start_POSTSUBSCRIPT italic_t + italic_i - italic_p end_POSTSUBSCRIPT ) .

Putting (13) and (14) together:

P(xt,,xt+nxt1,,xtp)=𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝absent\displaystyle P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p})=italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) =
i=0nP(xt+ixt+ip)=i=0nP(1xt+i1xt+ip)=superscriptsubscriptproduct𝑖0𝑛𝑃conditionalsubscript𝑥𝑡𝑖subscript𝑥𝑡𝑖𝑝superscriptsubscriptproduct𝑖0𝑛𝑃direct-sum1conditionalsubscript𝑥𝑡𝑖direct-sum1subscript𝑥𝑡𝑖𝑝absent\displaystyle\prod_{i=0}^{n}P(x_{t+i}\mid x_{t+i-p})=\prod_{i=0}^{n}P(1\oplus x% _{t+i}\mid 1\oplus x_{t+i-p})=∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t + italic_i - italic_p end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_P ( 1 ⊕ italic_x start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT ∣ 1 ⊕ italic_x start_POSTSUBSCRIPT italic_t + italic_i - italic_p end_POSTSUBSCRIPT ) =
P(1xt,,1xt+n1xt1,,1xtp).𝑃direct-sum1subscript𝑥𝑡direct-sum1conditionalsubscript𝑥𝑡𝑛direct-sum1subscript𝑥𝑡1direct-sum1subscript𝑥𝑡𝑝\displaystyle P(1\oplus x_{t},\ldots,1\oplus x_{t+n}\mid 1\oplus x_{t-1},% \ldots,1\oplus x_{t-p}).italic_P ( 1 ⊕ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , 1 ⊕ italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ 1 ⊕ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , 1 ⊕ italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) .

Proposition 26.

Let {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT be a uniform noise and positive gbAR(p) model. Then {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT satisfies the hypothesis of the Convergence Theorem, i.e. {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT is an irreducible and aperiodic stationary order-p𝑝pitalic_p Markov chain with finite state-space.

Proof.

{Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT is a stationary order-p𝑝pitalic_p Markov chain with finite state-space S={0,1}𝑆01S=\{0,1\}italic_S = { 0 , 1 } by Definition 14. Now, {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT is irreducible because for every xtp,,xtSsubscript𝑥𝑡𝑝subscript𝑥𝑡𝑆x_{t-p},\ldots,x_{t}\in Sitalic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_S we have:

P(xtxt1,,xtp)=i=1pαi(1xtxti)+β2>0.𝑃conditionalsubscript𝑥𝑡subscript𝑥𝑡1subscript𝑥𝑡𝑝superscriptsubscript𝑖1𝑝subscript𝛼𝑖direct-sum1subscript𝑥𝑡subscript𝑥𝑡𝑖𝛽20\displaystyle P(x_{t}\mid x_{t-1},\ldots,x_{t-p})=\sum_{i=1}^{p}\alpha_{i}% \cdot(1\oplus x_{t}\oplus x_{t-i})+\frac{\beta}{2}>0.italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ ( 1 ⊕ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊕ italic_x start_POSTSUBSCRIPT italic_t - italic_i end_POSTSUBSCRIPT ) + divide start_ARG italic_β end_ARG start_ARG 2 end_ARG > 0 . (15)

Finally, since

xt2,,xtpP(xt2,,xtp)=1,subscriptsubscript𝑥𝑡2subscript𝑥𝑡𝑝𝑃subscript𝑥𝑡2subscript𝑥𝑡𝑝1\sum_{x_{t-2},\ldots,x_{t-p}}P(x_{t-2},\ldots,x_{t-p})=1,∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = 1 ,

there exist yt2,,ytpSsubscript𝑦𝑡2subscript𝑦𝑡𝑝𝑆y_{t-2},\ldots,y_{t-p}\in Sitalic_y start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ∈ italic_S such that P(yt2,,ytp)>0𝑃subscript𝑦𝑡2subscript𝑦𝑡𝑝0P(y_{t-2},\ldots,y_{t-p})>0italic_P ( italic_y start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) > 0. Then, by equation (15):

P(xtxt1)=xt2,,xtpP(xt2,,xtp)P(xtxt1,xt2,,xtp)P(yt2,,ytp)P(xtxt1,yt2,,ytp)>0𝑃conditionalsubscript𝑥𝑡subscript𝑥𝑡1subscriptsubscript𝑥𝑡2subscript𝑥𝑡𝑝𝑃subscript𝑥𝑡2subscript𝑥𝑡𝑝𝑃conditionalsubscript𝑥𝑡subscript𝑥𝑡1subscript𝑥𝑡2subscript𝑥𝑡𝑝absent𝑃subscript𝑦𝑡2subscript𝑦𝑡𝑝𝑃conditionalsubscript𝑥𝑡subscript𝑥𝑡1subscript𝑦𝑡2subscript𝑦𝑡𝑝0\begin{aligned} P(x_{t}\mid x_{t-1})=\sum_{x_{t-2},\ldots,x_{t-p}}P(x_{t-2},% \ldots,x_{t-p})P(x_{t}\mid x_{t-1},x_{t-2},\ldots,x_{t-p})\geq\\ P(y_{t-2},\ldots,y_{t-p})P(x_{t}\mid x_{t-1},y_{t-2},\ldots,y_{t-p})>0\end{aligned}start_ROW start_CELL italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ≥ end_CELL end_ROW start_ROW start_CELL italic_P ( italic_y start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) > 0 end_CELL end_ROW

and we conclude that {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT is aperiodic. ∎

Proof of Proposition 16.

Since {Xt}tsubscriptsubscript𝑋𝑡𝑡\{X_{t}\}_{t\in\mathbb{Z}}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT satisfies the hypothesis of the Convergence Theorem 8 by Proposition 26, the equalities

h({Xt}t)=limnh~(Xt,,Xt+nXt1,,Xtp)=limnh(Xt,,Xt+nXt1,,Xtp)subscriptsubscriptsubscript𝑋𝑡𝑡subscript𝑛subscript~subscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝absentsubscript𝑛subscriptsubscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝\begin{aligned} h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})=\lim_{n\to\infty}\tilde% {h}_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})=\\ \lim_{n\to\infty}h_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})% \end{aligned}start_ROW start_CELL italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ blackboard_Z end_POSTSUBSCRIPT ) = roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) end_CELL end_ROW

follow from Theorem 9. Now,

h(Xt,,Xt+nXt1,,Xtp)=1n+1log[maxxtp,,xt+nP(xt,,xt+nxt1,,xtp)]=1n+1log[maxxtp,,xt+ni=0nP(xt+ixt+i1,,xt+ip)]=1n+1log[i=0n(i=1pαi+β2)]=1n+1log[(i=1pαi+β2)n+1]=1n+1log[(1β2)n+1]=log(1β2).subscriptsubscript𝑋𝑡conditionalsubscript𝑋𝑡𝑛subscript𝑋𝑡1subscript𝑋𝑡𝑝absent1𝑛1subscriptsubscript𝑥𝑡𝑝subscript𝑥𝑡𝑛𝑃subscript𝑥𝑡conditionalsubscript𝑥𝑡𝑛subscript𝑥𝑡1subscript𝑥𝑡𝑝absent1𝑛1subscriptsubscript𝑥𝑡𝑝subscript𝑥𝑡𝑛superscriptsubscriptproduct𝑖0𝑛𝑃conditionalsubscript𝑥𝑡𝑖subscript𝑥𝑡𝑖1subscript𝑥𝑡𝑖𝑝absent1𝑛1superscriptsubscriptproduct𝑖0𝑛superscriptsubscript𝑖1𝑝subscript𝛼𝑖𝛽21𝑛1superscriptsuperscriptsubscript𝑖1𝑝subscript𝛼𝑖𝛽2𝑛1absent1𝑛1superscript1𝛽2𝑛11𝛽2\begin{aligned} h_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})=\\ -\frac{1}{n+1}\log\left[\max_{x_{t-p},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}% \mid x_{t-1},\ldots,x_{t-p})\right]=\\ -\frac{1}{n+1}\log\left[\max_{x_{t-p},\ldots,x_{t+n}}\prod_{i=0}^{n}P(x_{t+i}% \mid x_{t+i-1},\ldots,x_{t+i-p})\right]=\\ -\frac{1}{n+1}\log\left[\prod_{i=0}^{n}\left(\sum_{i=1}^{p}\alpha_{i}+\frac{% \beta}{2}\right)\right]=-\frac{1}{n+1}\log\left[\left(\sum_{i=1}^{p}\alpha_{i}% +\frac{\beta}{2}\right)^{n+1}\right]=\\ -\frac{1}{n+1}\log\left[\left(1-\frac{\beta}{2}\right)^{n+1}\right]=-\log\left% (1-\frac{\beta}{2}\right).\end{aligned}start_ROW start_CELL italic_h start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) = end_CELL end_ROW start_ROW start_CELL - divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG roman_log [ roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) ] = end_CELL end_ROW start_ROW start_CELL - divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG roman_log [ roman_max start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t + italic_i - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t + italic_i - italic_p end_POSTSUBSCRIPT ) ] = end_CELL end_ROW start_ROW start_CELL - divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG roman_log [ ∏ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + divide start_ARG italic_β end_ARG start_ARG 2 end_ARG ) ] = - divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG roman_log [ ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + divide start_ARG italic_β end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ] = end_CELL end_ROW start_ROW start_CELL - divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG roman_log [ ( 1 - divide start_ARG italic_β end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ] = - roman_log ( 1 - divide start_ARG italic_β end_ARG start_ARG 2 end_ARG ) . end_CELL end_ROW

\printbibliography

Appendix A Biography Section

[Uncaptioned image] Francisco Javier Blanco Romero received his B.Sc. in Physics from the Universidad Complutense de Madrid and is finalizing his M.Sc. degree in Robotics at Universidad Miguel Hernández de Elche. He has worked in the private sector as a software engineer and researcher, focusing on Internet of Things (IoT) projects in the environmental and mobility domains. Currently, he is a Research Support Technician at Universidad Carlos III de Madrid, involved in the Quantum-based Resistant Architectures and Techniques project.
[Uncaptioned image] Vicente Lorenzo received the B.Sc. degree in mathematics from the Universidad Complutense de Madrid, the M.Sc. degree in mathematics and applications from the Universidad Autónoma de Madrid and the Ph.D. degree in mathematics from Instituto Superior Técnico - Universidade de Lisboa. He was a Lecturer with Instituto Superior de Economia e Gestão - Universidade de Lisboa, Instituto Superior de Ciências do Trabalho e da Empresa - Instituto Universitário de Lisboa and Universidad CEU San Pablo. He is currently a Project Researcher with Universidad Carlos III de Madrid where he is involved in the Quantum-based Resistant Architectures and Techniques project.
[Uncaptioned image] Florina Almenares Mendoza received the M.Sc. degree in Telematic and the Ph.D. degree from the University Carlos III of Madrid, in 2003 and 2006, respectively. Since 2008, she has worked as an Associate Professor with the Department of Telematic Engineering, University Carlos III of Madrid. Her research interests include post-quantum cryptography, cybersecurity, machine learning, trust and reputation management models, identity management, secure architectures, and risk assessment. This research has been recently applied to ubiquitous computing and IoT, smart grids, and smart cities. She received the IEEE Chester Sall Award (2012).
[Uncaptioned image] Daniel Díaz Sánchez received the B. Eng. degree in Telecommunications in 2003, M.Sc. in Telematics in 2007, and Ph.D. in 2009 from Carlos III University of Madrid. From 2004 to 2006, he was a Researcher and became a Teaching Assistant in 2005. Since 2010, he has been an associate professor at the University Carlos III of Madrid. His research interests include distributed authentication/authorization, cybersecurity, distributed and fog computing, IoT, and related concepts. Dr. Díaz-Sánchez has received several awards and honors, including the Especial Ph.D. award from University Carlos III of Madrid (2009), the best Ph.D. thesis award on electronic commerce by the National Telecommunication Engineering Association (2009), and the IEEE Chester Sall Award (2012).