Machine Learning Predictors
for Min-Entropy Estimation

Javier Blanco-Romero

{}^{\href https://orcid.org/0009-0004-0635-953X}

, Vicente Lorenzo

{}^{\href https://orcid.org/0000-0003-2077-6095}

, Florina Almenares Mendoza

{}^{\href https://orcid.org/0000-0002-5232-2031}

, Daniel Díaz-Sánchez

{}^{\href https://orcid.org/0000-0002-3323-6453}

Javier Blanco-Romero is with the Department of Telematic Engineering, Universidad Carlos III de Madrid, Leganés, Madrid, 28911, Spain (e-mail: [email protected]).Vicente Lorenzo is with the Department of Telematic Engineering, Universidad Carlos III de Madrid, Leganés, Madrid, 28911, Spain (e-mail: [email protected]).Florina Almenares Mendoza is with the Department of Telematic Engineering, Universidad Carlos III de Madrid, Leganés, Madrid, 28911, Spain (e-mail: [email protected]).Daniel Díaz-Sánchez is with the Department of Telematic Engineering, Universidad Carlos III de Madrid, Leganés, Madrid, 28911, Spain (e-mail: [email protected]).

Abstract

This study investigates the application of machine learning predictors for min-entropy estimation in Random Number Generators (RNGs), a key component in cryptographic applications where accurate entropy assessment is essential for cybersecurity. Our research indicates that these predictors, and indeed any predictor that leverages sequence correlations, primarily estimate average min-entropy, a metric not extensively studied in this context. We explore the relationship between average min-entropy and the traditional min-entropy, focusing on their dependence on the number of target bits being predicted. Utilizing data from Generalized Binary Autoregressive Models, a subset of Markov processes, we demonstrate that machine learning models (including a hybrid of convolutional and recurrent Long Short-Term Memory layers and the transformer-based GPT-2 model) outperform traditional NIST SP 800-90B predictors in certain scenarios. Our findings underscore the importance of considering the number of target bits in min-entropy assessment for RNGs and highlight the potential of machine learning approaches in enhancing entropy estimation techniques for improved cryptographic security.

Index Terms:

Min-entropy Estimation, Machine Learning Predictors, Random Number Generators, Autoregressive Processes, Generalized Binary Autoregressive Models

I Introduction

The security of cryptographic systems often hinges on the generation of random values. Although there is a broad spectrum of algorithms and devices used to generate these random values, they are all generically denoted by Random Number Generators (RNGs). Given the important role that RNGs play in the context of cybersecurity, it becomes evident that rigorous criteria are necessary for evaluating the reliability and performance of an RNG.

Multiple approaches are commonly employed to assess the quality of the output of an RNG (cf. [bassham2010sp], [marsaglia2008marsaglia], [abbott2019experimentally], [calude2010experimental], [kavulich2021searching], [bird2020effects], etc.). In this paper the emphasis will be put on:

•

Entropy tests, as those found in NIST Special Publication 800-90B [turan2018recommendation], which estimate the entropy of a noise source based on appropriate samples (cf. [abraham2022high], [islam2022using], [li2020jitter], etc.).
•

Machine Learning models trained with the output of an RNG aiming to guess the bit or set of bits that follow a given sequence, which can give an insight into how predictable the output of the RNG is (cf. [truong2018machine], [yang2018neural], [lv2020high], [li2020deep], [feng2020testing], [li2023improvement], etc.).

The fact that the entropy of a given source and the predictability of its output are correlated was already noticed by Shannon [shannon1951prediction]. Nevertheless, the link between these two concepts is far from being completely understood, specially if one takes into account the heterogeneity of entropy definitions that can be found in the literature and how much the predictability of the output of an entropy source relies on the predictor being considered. Building on the evidence provided by [kelsey2015predictive] that the entropy estimators considered by NIST Special Publication 800-90B [turan2018recommendation] tend to underestimate min-entropy, an attempt to reinforce the argument [zhu2017analysis] that predictors are better suited to estimate average min-entropy [dodis2004fuzzy] than min-entropy is carried out in this paper. In the first stage, the theoretical framework required to support our thesis is developed. In the second stage, experimental validation of the theoretical analysis is conducted.

Whereas [kelsey2015predictive] concentrates on Ensemble, Categorical Data, and Numerical predictors, the focus of this paper will be on machine learning predictors. In particular, a hybrid model that integrates convolutional and recurrent Long Short-Term Memory (LSTM) layers and the transformer-based GPT-2 model will be considered. As in [zhu2017analysis], we generate sets of data for which a theoretical entropy can be calculated so that the machine learning entropy estimation can be compared to the theoretical value. Nevertheless, while the data generated in [zhu2017analysis] comes from an oscillator-based model and Markov processes of order at most $2$ , our data comes from Generalized Binary Autoregressive Models [jentsch2019generalized], a subclass of Markov chains that allows us to easily parameterize correlations and anticorrelations at the bit level and compute min-entropies.

Our research also investigates the influence of the number of target bits on the estimation of min-entropy. We demonstrate that the relationship between average min-entropy and min-entropy is significantly affected by the number of target bits being predicted. This finding highlights the importance of considering the target bit count when assessing the min-entropy of RNGs using machine learning predictors.

The remainder of this paper is structured as follows: Section 2 presents a literature review, discussing the current state of the art in the application of predictors for min-entropy estimation. In Section 3, we establish the theoretical framework, where we study the concept of average min-entropy and its relationship with min-entropy, deriving a series of results for order-p Markov chains and gbAR(p). Section 4 outlines our experimental methodology, aimed at validating the theoretical findings. Section 5 presents the results of our experiments, followed by Section 6, which offers a discussion of these results and their implications. Finally, Section 7 concludes the paper with a summary of our findings and suggestions for future research.

I-A Notation and Conventions

The following notation and conventions will be considered throughout this paper:

•

Random Variables: Uppercase letters $X_{1},X_{2},A,\ldots$ represent random variables, while their corresponding realizations are represented by lowercase letters $x_{1},x_{2},a,\ldots$ and by abusing the notation $P(A=a)$ will be denoted by $P(a)$ . Furthermore, by $X\sim B(n,p)$ we mean that $X$ is a random variable that follows a binomial distribution with number of trials $n$ and a success probability $p$ , and by $(X_{1},\ldots,X_{k})\sim\text{Mult}(n;p_{1},\ldots,p_{k})$ we mean that $(X_{1},\ldots,X_{k})$ is a multivariate random variable that follows a multinomial distribution with number of trials $n$ and probability vector $(p_{1},\ldots,p_{k})$ .

•

Expected Values: The notation $\langle\cdot\rangle$ is used to indicate expected values. Given discrete random variables $X_{t-p},\ldots,X_{t+n}$ where $t,p,n\in\mathbb{Z}$ and $p,n$ are non-negative, we will be particularly interested in the following type of expression:

•

Logarithms: All logarithmic functions are considered to be base 2 and are denoted by $\log$ .

II State of the Art (Literature Review)

II-A Entropies

The relationship between entropy and the predictability of a sequence was first investigated by Shannon [shannon1951prediction], who noticed that the problem of prediction is fundamentally connected to the concept of entropy. Min-entropy, denoted as $H_{\infty}(X)$ , represents the negative logarithm of the probability of a correct guess on the random variable $X$ under an optimal strategy [cachin1997entropy]. Mathematically, the probability of guessing the most likely output of an entropy source can be expressed as:

2^{-H_{\infty}(X)}=\max_{x\in X}P_{X}(x)\ .

In cryptography, min-entropy is an important measure, as it provides a conservative estimate of the difficulty of guessing or predicting the most likely output of the entropy source, as emphasized in the NIST Recommendation [turan2018recommendation].

Moreover, entropy estimation is complex when the output distribution is unknown, and typical assumptions like outputs being independent and identically distributed (i.i.d.) do not apply. Good entropy estimation needs understanding of the underlying nondeterministic process of the entropy source, and statistical tests, as referenced, can only act as a sanity check on such an estimate [kelsey2015predictive].

In this context, the concept of predictors has been introduced. As described by Kelsey et al., a predictor contains a dynamic model that operates through a four-step process: 1) Assume a probability model for the source, 2) Estimate the model’s parameters from the input sequence on-the-fly, 3) Use these parameters to attempt to predict the still-unseen values in the input sequence, and 4) Estimate the min-entropy of the source from the performance of these predictions [kelsey2015predictive]. Unlike traditional machine learning methods, this approach is parametric and relies on a model of the underlying probability distribution. Another difference from traditional supervised learning methods, which separate training and testing sets, is that predictors remain in the training phase indefinitely, allowing for continuous adaptation and improvement in prediction accuracy. Predictors are characterized by two primary performance metrics. The first, global predictability, gauges the long-term accuracy of predictions. Specifically, a predictor’s global accuracy $p_{\text{acc}}$ represents the probability that it will correctly predict a given sample from a noise source over an extended sequence, effectively measuring the percentage of correct predictions. The second, local predictability, emphasizes the length of the longest streak of correct predictions, becoming important when the source produces highly predictable outputs in short spurts. The final entropy estimate for a predictor is determined by the lesser value between the global and local entropy estimates, represented by $\hat{H}=\min(\hat{H}_{\text{global}},\hat{H}_{\text{local}})$ .

Hence, predictors play a significant role in setting bounds on an attacker’s performance, linking predictability to min-entropy. For a description of the evolution of the introduction of predictors on the NIST SP 800-90B see [lv2020high].

Zhu et al. examined the issue of underestimation in non-IID data pertaining to the NIST collision and compression test, proposing an enhanced method to address the underestimation of min-entropy [zhu2017analysis]. They introduced a novel formula specifically aimed at the high-order Markov process, founded on the principles of conditional probability. Furthermore, they highlighted that the correct prediction probability within a predictor can also be understood as a form of conditional probability.

Zhu’s min-entropy formula for the Markov process can be related to the concept of average min-entropy as defined by Dodis [dodis2004fuzzy]. Average min-entropy considers the predictability of a random variable given another possibly correlated random variable and can be expressed as:

	$\displaystyle\tilde{H}_{\infty}(A\mid B)=-\log\left(\langle\max_{a}P(a\mid b)% \rangle_{b}\right)=$
	$\displaystyle-\log\left(\langle 2^{-H_{\infty}(A\mid B=b)}\rangle_{b}\right)\ ,$

where

H_{\infty}(A\mid B=b)=P(b)\cdot\max_{a}P(a\mid b).

Dodis’ definition on average min-entropy offers valuable insights into the logarithm of predictability, presenting it as a “worst-case” entropy measure [dodis2004fuzzy].

Several works have considered the problem of designing machine learning predictors for the evaluating RNGs. Here, we examine some of the most relevant contributions in relation to min-entropy estimation.

Truong et al. [truong2018machine] introduced the use of a recurrent convolutional neural network (RCNN) to analyze quantum random number generators (QRNGs). This RCNN model was employed to evaluate different stages of an optical continuous variable QRNG. The study focused on detecting inherent correlations, particularly under the influence of deterministic classical noise. Their methodology included a comprehensive analysis, from examining the robustness of QRNGs against machine learning attacks to benchmarking with a congruential pseudo-random number generator (CRNG). Their model’s prediction accuracy was compared with the guessing probability of the data distribution, effectively entailing a comparison with the min-entropy.

Yang et al. and Lv et al. explored neural network-based min-entropy estimation for random number generators [yang2018neural], [lv2020high]. Their approach involved training predictive models on simulated data, where the correct entropy was ascertainable due to the known output distributions. Additionally, their study included a performance analysis and comparison of their results with the NIST SP 800-90B’s predictors, providing a detailed examination of the efficacy and accuracy of their neural network-based approach in entropy estimation.

Li et al. [li2020deep] proposed a deep learning-based predictive analysis to assess the security of a non-deterministic random number generator (NRNG) using white chaos. They employed a temporal pattern attention (TPA)-based deep learning model to analyze data from both the chaotic external-cavity semiconductor laser (ECL) stage and the final output of the NRNG. The model effectively detected correlations in the ECL stage, but not in the post-processed output, suggesting the NRNG’s resistance to predictive modeling. Prior to this, the model’s predictive power was validated on a linear congruential algorithm-based RNG. The study also compared the model’s prediction accuracy with the baseline probability, aligning with Truong et al.’s approach of using the guessing probability as a comparative metric for min-entropy estimation.

Finally, Haohao Li et al. [li2023improvement] proposed a method for min-entropy evaluation using a pruned and quantized deep neural network. They developed a temporal pattern attention-based long short-term memory (TPA-LSTM) model, which they then optimized through pruning and quantization. This optimized model was retrained and tested on various simulated datasets with known min-entropy values. Their results demonstrated greater accuracy in min-entropy estimation compared to NIST SP 800-90B predictors. This study also investigated why NIST predictors often underestimate min-entropy, attributing it to the sensitivity of local predictability probability to parameter variations. This work parallels Yang et al. and Lv et al.’s in comparing neural network-based min-entropy estimations with NIST SP 800-90B’s predictors.

II-B Autoregressive Inference and Multi-Token Prediction Strategies

In autoregressive inference, various sampling strategies can be employed to generate sequences, such as greedy decoding [radford2019language], beam search [vijayakumar2016diverse, shao2017generating], top-k sampling [fan2018hierarchical, radford2019language] or top-p/nucleus sampling [holtzman2019curious], with or without temperature-based sampling techniques [ackley1985learning]. However, these techniques may not always yield the globally optimal sequence, as they rely on local decisions at each step.

Our goal is to approximate the global maximum probability for the complete sequence of $n$ bits, given the previous $p$ bits:

\max_{x_{t},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p})\ .

(1)

To illustrate the potential limitations of autoregressive inference strategies, let us consider greedy decoding as an example. Greedy decoding selects the most probable bit at each step, conditioned on the previously generated bits. This can be expressed as:

$\begin{aligned} \prod_{k=t}^{t+n}\max_{x_{k}}P(x_{k}\mid x_{k-1},\ldots,x_{k-p% })\bigg{\rvert}_{\begin{array}[]{c}x_{k-i}=\arg\max_{x_{k-i}}P(x_{k-i}\mid x_{% k-i-1},\ldots,x_{k-i-p}),\\ \forall i\in[1,k\leq p]\end{array}\ .}\end{aligned}$

However, the product of the maximum conditional probabilities at each step does not necessarily equal the global maximum probability over the entire sequence. In other words, the greedy decoding approach may lead to suboptimal sequences, as it does not consider the joint probability of the complete sequence.

While other search methods, such as beam search, top-k sampling, or top-p/nucleus sampling, can perform better than greedy decoding, they still face the same fundamental challenge. Ultimately, the effectiveness of these methods in approximating the global maximum depends on the data and the search space. As the sequence length $n$ increases, the search space grows exponentially, making it increasingly difficult to find the globally optimal sequence efficiently.

Recently, incorporating future information into language generation tasks has gained attention. Li et al. (2017) [li2017learning] proposed an actor-critic model that integrates a value function to estimate future success, combining MLE-based learning with an RL-based value function during decoding. Oord et al. (2018) [oord2018representation] aimed to preserve mutual information between context and future tokens by modeling a density ratio, rather than directly predicting future tokens. Serdyuk et al. (2018) [serdyuk2017twin] addressed the challenge of long-term dependency learning in RNNs by running forward and backward RNNs in parallel to better capture future information. Lawrence et al. (2019) [lawrence2019attending] trained an encoder by concatenating source and target sequences and using placeholder tokens in the target sequence, which are replaced during inference to generate the final output. These advancements illustrate the growing interest in and potential for optimizing future token predictions in natural language processing tasks.

Qi et al. (2020) [qi2020prophetnet] introduce ProphetNet, a sequence-to-sequence pre-training model that employs a novel self-supervised objective called future n-gram prediction and an n-stream self-attention mechanism. Unlike traditional sequence-to-sequence models that optimize one-step-ahead prediction, ProphetNet predicts the next $n$ tokens simultaneously based on previous context tokens at each time step. This approach explicitly encourages the model to plan for future tokens and prevents overfitting on strong local correlations. The authors pre-train ProphetNet using base and large-scale datasets and demonstrate state-of-the-art results on abstractive summarization and question generation tasks.

Recent works have explored the use of multi-token prediction to improve the efficiency and performance of large language models. Gloeckle et al. propose a memory-efficient implementation and demonstrate the effectiveness of this approach on various tasks, showcasing strong performance on summarization, speeding up inference by a factor of 3×, and promoting the learning of longer-term patterns [gloeckle2024better]. This method has been shown to improve sample efficiency, downstream capabilities, and inference speed, especially for larger model sizes and generative benchmarks like coding. Similarly, Stern et al. [stern2018blockwise] and Cai et al. [cai2024medusa] introduce methods that augment LLM inference by adding extra decoding heads to predict multiple subsequent tokens in parallel. Cai et al. refine the concept introduced by Stern et al. and propose MEDUSA, which uses a tree-based attention mechanism to construct and verify multiple candidate continuations simultaneously. While all three approaches leverage multi-token prediction, Gloeckle et al. focus on the effects of such a loss during pretraining, whereas Stern et al. and Cai et al. propose model finetunings for faster inference without studying the pretraining effects [gloeckle2024better].

III Theoretical Framework

In this section, we establish the theoretical framework for our study. We focus on investigating the concept of average min-entropy and its relationship with min-entropy, particularly within the context of gbAR(p) models. The proofs and auxiliary results supporting our findings can be found in the Appendix.

III-A Entropies

The different entropies that will be considered throughout this paper are gathered in the following:

Definition 1.

Let $\{X_{t}\}_{t\in\mathbb{Z}}$ be a stochastic process with discrete state-space and let $(X_{t-p},\ldots,X_{t-1},X_{t},\ldots,X_{t+n})$ be a subset of $\{X_{t}\}_{t\in\mathbb{Z}}$ where $t,p,n\in\mathbb{Z}$ and $p,n$ are non-negative. Then:

The min-entropy of $(X_{t},\ldots,X_{t+n})$ is:

$H_{\infty}(X_{t},\ldots,X_{t+n})=-\log\left[\max_{x_{t},\ldots,x_{t+n}}P(x_{t}% ,\ldots,x_{t+n})\right].$

The min-entropy per bit of $(X_{t},\ldots,X_{t+n})$ is:

h_{\infty}(X_{t},\ldots,X_{t+n})=\frac{1}{n+1}H_{\infty}(X_{t},\ldots,X_{t+n}).

The min-entropy per bit of $\{X_{t}\}_{t\in\mathbb{Z}}$ is:

$\begin{aligned} h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})=-\lim_{k\to\infty}\frac% {1}{k+1}\log\left[\max_{x_{t},\mid t\mid\leq\mid k/2\mid}P(\{x_{t}\}_{\mid t% \mid\leq\mid k/2\mid})\right]\ ,\\ k\in\mathbb{Z}.\end{aligned}$

The worst-case min-entropy of $(X_{t},\ldots,X_{t+n})$ is:

	$\displaystyle H_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})=$
	$\displaystyle-\log\left[\max_{x_{t-p},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}% \mid x_{t-1},\ldots,x_{t-p})\right].$

The worst-case min-entropy per bit of $(X_{t},\ldots,X_{t+n})$ is:

	$\displaystyle h_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})=$
	$\displaystyle\frac{1}{n+1}H_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X% _{t-p}).$

The average min-entropy of $(X_{t},\ldots,X_{t+n})$ is:

$\begin{aligned} \tilde{H}_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{% t-p})=\\ -\log\left(\left\langle\max_{x_{t},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x% _{t-1},\ldots,x_{t-p})\right\rangle_{x_{t-1},\ldots,x_{t-p}}\right)=\\ -\log\left[\sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},\ldots,x_{t-p})\max_{x_{t},% \ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p})\right].\end{aligned}$

The average min-entropy per bit of $(X_{t},\ldots,X_{t+n})$ is:

	$\displaystyle\tilde{h}_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p% })=$
	$\displaystyle\frac{1}{n+1}\tilde{H}_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},% \ldots,X_{t-p}).$

Remark 2.

When determining the min-entropy of an entire binary stochastic process $\{X_{t}\}_{t\in\mathbb{Z}}$ , the direct evaluation

H_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})=-\log\left[\max_{x_{t},t\in\mathbb{Z}}P% (\{x_{t}\}_{t\in\mathbb{Z}})\right]

can lead to undefined behaviour. Indeed, if we write this as the limit

\displaystyle H_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})=\lim_{k\to\infty}H_{% \infty}(\{X_{t}\}_{\mid t\mid\leq\mid k/2\mid})\ ,k\ \text{even}

the maximum probability decay with k is bounded below by $\frac{1}{2^{k+1}}$ corresponding to the uniform noise, so in that case the limit

\displaystyle H_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})=-\lim_{k\to\infty}\log% \left[\frac{1}{2^{k+1}}\right]=1+\lim_{k\to\infty}k

diverges, growing as $\sim k$ with the number of elements $k$ . Then the limit

is bounded by 1 for all $\{X_{t}\}_{t\in\mathbb{Z}}$ . For this reason we are going to refer to this as the min-entropy per bit of the stochastic process $\{X_{t}\}_{t\in\mathbb{Z}}$ .

III-B Order-p Markov Chains

Let us begin by defining order- $p$ Markov Chains:

Definition 3 (cf. [ching2006markov], [raftery1985model]).

An order- $p$ Markov Chain is a stochastic process $\{X_{t}\}_{t\in\mathbb{Z}}$ with discrete state-space $S$ such that:

P(x_{t_{0}}\mid\{x_{t}\}_{t<t_{0}})=P(x_{t_{0}}\mid x_{t_{0}-1},\ldots,x_{t_{0% }-p})

for every $t_{0}\in\mathbb{Z}$ and every $x_{t}\in S$ such that $t\leq t_{0}$ .

The aim of this subsection is to define special types of Markov chains and to prove some formulas that are applicable to them regarding the entropies of Definition 1. Although the experiments performed in this paper mostly involve Generalized Binary Autoregressive Models (see Definition 14 below), other types of Markov chains that have connections with Generalized Binary Autoregressive Models are also explored (see Figure 3) because they share certain properties with them and entropy formulas that are interesting on their own can be derived with relatively little additional effort for these processes.

Definition 4.

Let $\{X_{t}\}_{t\in\mathbb{Z}}$ be an order- $p$ Markov chain with state-space $S$ . Then:

i)

$\{X_{t}\}_{t\in\mathbb{Z}}$ is said to be binary if $S=\{0,1\}$ .

ii)

$\{X_{t}\}_{t\in\mathbb{Z}}$ is said to be stationary if

$P(X_{t_{1}}=x_{1},\ldots,X_{t_{n}}=x_{n})=P(X_{t_{1}+\tau}=x_{1},\ldots,X_{t_{% n}+\tau}=x_{n})$

for every $\tau,t_{1},\ldots,t_{n}\in\mathbb{Z}$ , every $x_{1},\ldots,x_{n}\in S$ and every positive integer $n$ .

iii)

$\{X_{t}\}_{t\in\mathbb{Z}}$ is said to have lag-p point-to-point correlations if

P(x_{t}\mid x_{t-1},\ldots,x_{t-p})=P(x_{t}\mid x_{t-p})\text{ for every $t\in% \mathbb{Z}$.}

iv)

$\{X_{t}\}_{t\in\mathbb{Z}}$ is said to be irreducible if it is stationary, $S$ is finite and for every $x,x_{1},\ldots,x_{p}\in S$ there exists a non-negative integer $k$ such that

P(X_{t+k}=x\mid X_{t-1}=x_{1},\ldots,X_{t-p}=x_{p})>0.

$\{X_{t}\}_{t\in\mathbb{Z}}$ is said to be aperiodic if it is stationary, $S$ is finite and for every $x\in S$ ,

\gcd\{n\geq 1:P(X_{t+n}=x\mid X_{t}=x)>0\}=1.

Remark 5.

Note that if $\{X_{t}\}_{t\in\mathbb{Z}}$ is a stationary order- $p$ Markov chain then:

h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})=\lim_{n\to\infty}\frac{1}{n+1}H(X_{t},% \ldots,X_{t+n}).

III-B1 Some Min-Entropy Inequalities for Order-p Markov Chains

Here we establish several inequalities involving min-entropy, average min-entropy, and worst-case min-entropy.

We start by noting that for a fixed $n$ , the following inequality between min-entropy, average min-entropy, and worst-case min-entropy holds.

Lemma 6.

Let $\{X_{t}\}_{t\in\mathbb{Z}}$ be an order- $p$ Markov chain. Then:

$\begin{aligned} h_{\infty}(X_{t},\ldots,X_{t+n})\geq\tilde{h}_{\infty}(X_{t},% \ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})\geq\\ h_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})\ .\end{aligned}$

We conclude this part of Section III-B1 with a result regarding order- $p$ Markov chains that establishes a form of monotonicity for their average min-entropy.

Lemma 7.

Let $\{X_{t}\}_{t\in\mathbb{Z}}$ be an order-p Markov chain. Then

	$\displaystyle\tilde{H}_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p% })\leq$
	$\displaystyle\tilde{H}_{\infty}(X_{t},\ldots,X_{t+n+m}\mid X_{t-1},\ldots,X_{t% -p}).$

This lemma establishes that the average min-entropy of an order- $p$ Markov chain is non-decreasing as the length of the future sequence, $n$ , increases. This property reflects the intuitive notion that the uncertainty about future states cannot decrease when considering longer future sequences. This result will be particularly useful later when we discuss an interesting property of Generalized Binary Autoregressive Models (see Remark 19).

III-B2 Convergence Theorem for the Min-Entropy and Average Min-Entropy of order- $p$ Markov chains

The purpose of the following results, which is materialized in Theorem 9 below, is to establish conditions under which an asymptotical equivalence between the average min-entropy and the min-entropy of an order- $p$ Markov Chain can be guaranteed.

Theorem 8 (Convergence Theorem [bozorgmanesh2016convergence]).

Let $\{X_{t}\}_{t\in\mathbb{Z}}$ be an irreducible and aperiodic stationary order- $p$ Markov chain with finite state-space $S$ . Then for every $x,x_{t-1},\ldots,x_{t-p}\in S$ :

$\lim_{n\to\infty}P(X_{t+n}=x\mid X_{t-1}=x_{t-1},\ldots,X_{t-p}=x_{t-p})=P(X_{% t}=x).$

Building upon this theorem, we can now establish the asymptotic equivalence between the min-entropy and the average min-entropy for order-p Markov chains satisfying certain conditions.

Theorem 9.

If $\{X_{t}\}_{t\in\mathbb{Z}}$ satisfies the hypothesis of the Convergence Theorem, i.e. $\{X_{t}\}_{t\in\mathbb{Z}}$ is an irreducible and aperiodic stationary order- $p$ Markov chain with finite state-space, then

$\begin{aligned} h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})=\lim_{n\to\infty}\tilde% {h}_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})=\\ \lim_{n\to\infty}h_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p}).% \end{aligned}$

This theorem shows that having conditional information about the process provides no advantage asymptotically under the stated conditions, as the min-entropy and the average min-entropy converge to the same value.

III-B3 State-Independent Maximum Transition Probability and Bitflip Symmetric Order- $p$ Markov Chains

This section introduces two related classes of Markov chains: State-Independent Maximum Transition Probability (SIMTP) and Bitflip Symmetric Order- $p$ Markov Chains. We investigate the properties of these chains, with a particular focus on their average min-entropy behavior. Bitflip symmetric chains are of interest as they could represent a physical symmetry of the random number generator, such as the symmetry between the two polarization states of a quantum random number generator (QRNG). Additionally, the SIMTP property enables us to perform exact min-entropy calculations for the process.

Definition 10 (State-Independent Maximum Transition Probability Order- $p$ Markov Chain).

A stationary order- $p$ Markov chain with state-space $S$ is said to be a State-Independent Maximum Transition Probability (SIMTP) Markov Chain if it satisfies the following property:

	$\displaystyle\max_{x_{t}\in S}P(x_{t}\mid y_{t-1},\ldots,y_{t-p})=\max_{x_{t}% \in S}P(x_{t}\mid z_{t-1},\ldots,z_{t-p})$
	$\displaystyle\text{for every }y_{t-1},\ldots,y_{t-p},z_{t-1},\ldots,z_{t-p}\in S.$

SIMTP models are those stationary Markov chains for which the maximum transition probability is independent of the initial state sequence of length $p$ in the chain.

Proposition 11.

Let $\{X_{t}\}_{t\in\mathbb{Z}}$ be an order- $p$ SIMTP model with state-space $S$ . Then, for every non-negative integer $n$ and every $x_{t-1},\ldots,x_{t-p}\in S$ :

	$\displaystyle h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})=\tilde{h}_{\infty}(X_{t},% \ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})=$
	$\displaystyle-\log\left[\max_{x_{t}}P(x_{t}\mid x_{t-1},\ldots,x_{t-p})\right].$

Hence, the min-entropy of the SIMTP process can be computed straightforwardly from its transition probability.

Definition 12 (Bitflip Symmetry in Binary Order- $p$ Markov Chains).

A binary order- $p$ Markov chain exhibits Bitflip Symmetry if for all states $x_{t-p},\ldots,x_{t-1},x_{t},\ldots,x_{t+n}\in\{0,1\}$ and for all non-negative integer $n$ the following property holds:

	$\displaystyle P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p})=$
	$\displaystyle P(1\oplus x_{t},\ldots,1\oplus x_{t+n}\mid 1\oplus x_{t-1},% \ldots,1\oplus x_{t-p})$

where $\oplus$ represents the XOR operation.

Bitflip symmetric order- $p$ Markov chains are those binary order- $p$ Markov chains for which flip** the bits of all the variables in the conditional probability statement does not change the transition probability. These chains do not distinguish between 0 and 1 but still exhibit some correlation. Our interest in Bitflip symmetric order-p Markov chains is due to the following:

Lemma 13.

Let $\{X_{t}\}_{t\in\mathbb{Z}}$ be an order- $p$ Bitflip-Symmetric Markov Chain with lag-p point-to-point correlations. Then $\{X_{t}\}_{t\in\mathbb{Z}}$ is a SIMTP order- $p$ Markov chain.

III-B4 Generalized Binary Autoregressive Models

The gbAR(p) model [jentsch2019generalized] is an autoregressive (AR) model for binary time series data. It allows the autoregressive parameters to take values in the range (-1, 1), enabling the model to capture negative autocorrelations and alternating patterns. Despite this flexibility, the gbAR(p) model maintains a parsimonious parameterization, making it a compact yet powerful model for binary data. The gbAR(p) model is a parsimonious subclass of p-th order Markov chains for binary data. While sacrificing some flexibility compared to a full p-th order Markov chain, the gbAR(p) model offers a much more compact representation.

Definition 14 (Generalized Binary Autoregressive Models [jentsch2019generalized]).

Given $t\in\mathbb{Z}$ let $\left(a_{t}^{(1)},\ldots,a_{t}^{(p)},b_{t}\right)\sim M(1;|\alpha_{1}|,\ldots,% |\alpha_{p}|,\beta)$ for some $\alpha_{1},\ldots,\alpha_{p}\in(-1,1),\beta\in(0,1]$ such that:

\sum_{i=1}^{p}|\alpha_{i}|+\beta=1

and let $e_{t}\sim B(1,\epsilon_{t})$ for some $\epsilon_{t}\in(0,1)$ . A stationary binary order-p Markov chain $\{X_{t}\}_{t\in\mathbb{Z}}$ that can be written in operator form as

$X_{t}=\sum_{i=1}^{p}\left[a_{t}^{(i)}\mathds{1}_{\{\alpha_{i}\geq 0\}}(0\oplus% \cdot)+a_{t}^{(i)}\mathds{1}_{\{\alpha_{i}<0\}}(1\oplus\cdot)\right]X_{t-i}+b_% {t}e_{t}$

where $\mathds{1}$ is the indicator function and $\oplus$ is the XOR gate is said to be a Generalized Binary Autoregressive or gbAR(p) model.

We will denote the array of coefficients $\alpha_{1},\ldots,\alpha_{p}$ as $\boldsymbol{\alpha}$ , and its L1 norm (i.e. the sum of the absolute values of its components) as $|\boldsymbol{\alpha}|$ .

Our experiments are performed on data generated from gbAR(p) models. The rest of this section is devoted to define the type of gbAR(p) models we will be most interested in, to prove they satisfy the hypothesis of the Convergence Theorem 8 and to obtain a formula for their min-entropy that will allow to evaluate machine learning predictors entropy estimations (see Proposition 26 and Proposition 16 below).

Definition 15.

Let $\{X_{t}\}_{t\in\mathbb{Z}}$ be a gbAR(p) model. Then:

i)

$\{X_{t}\}_{t\in\mathbb{Z}}$ is said to be positive if $\alpha_{i}\geq 0$ for every $i\in\{1,\ldots,p\}$ .
ii)

$\{X_{t}\}_{t\in\mathbb{Z}}$ is said to be a uniform noise gbAR(p) model if $e_{t}\sim B\left(1,\frac{1}{2}\right)$ for every $t\in\mathbb{Z}$ .

Special attention will be paid to uniform noise and positive gbAR(p) models.

Proposition 16.

Let $\{X_{t}\}_{t\in\mathbb{Z}}$ be a uniform noise and positive gbAR(p) model. Then

Remark 17.

Let $\{X_{t}\}_{t\in\mathbb{Z}}$ be a uniform noise gbAR(p) model with point-to-point lag- $p$ correlations. Apart from having point-to-point lag- $p$ correlations, $\{X_{t}\}_{t\in\mathbb{Z}}$ is bitflip-symmetric by Lemma 25. Hence, $\{X_{t}\}_{t\in\mathbb{Z}}$ is SIMTP by Lemma 13 and therefore for every non-negative integer $n$ Proposition 11 yields

h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})=\tilde{h}_{\infty}(X_{t},\ldots,X_{t+n}% \mid X_{t-1},\ldots,X_{t-p}).

The argumentation above is illustrated in the first two plots of Figure 1, where the equivalence of the average min-entropy per bit and the min-entropy of uniform noise gbAR(p) models with point-to-point lag- $p$ correlations is observed regardless of the values $n,p,\alpha_{p}$ .

Refer to caption — Figure 1: $|\boldsymbol{\alpha}|$ dependance of average min-entropy compared with min-entropy and min-entropy limit for several sequence lengths $n$ , correlation scales $p$ and autocorrelation functions (uniform and point-to-point). The data points have been evaluated numerically (see Subsection IV-B).

Remark 18.

Let $\{X_{t}\}_{t\in\mathbb{Z}}$ be a uniform noise and positive gbAR(p) model. Since $\{X_{t}\}_{t\in\mathbb{Z}}$ is stationary by Definition 14 and it satisfies the hypothesis of the Convergence Theorem 8 by Proposition 26, it follows that

\lim_{n\to\infty}h(X_{t},\ldots,X_{t+n})=h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})

by Remark 5 and

\lim_{n\to\infty}\tilde{h}_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_% {t-p})=h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})

by Theorem 9. Moreover,

h_{\infty}(X_{t},\ldots,X_{t+n})\geq\tilde{h}_{\infty}(X_{t},\ldots,X_{t+n}% \mid X_{t-1},\ldots,X_{t-p})

by Lemma 6. The three (in)equations above are illustrated in Figure 2, where we can observe that both the average min-entropy per bit of $(X_{t},\ldots,X_{t+n})$ and the min-entropy per bit of $(X_{t},\ldots,X_{t+n})$ tend to the min-entropy of $\{X_{t}\}_{t\in\mathbb{Z}}$ when $n$ goes to infinity, being the former partial entropy lower than the latter.

Remark 19.

Let $\{X_{t}\}_{t\in\mathbb{Z}}$ be an order- $p$ Markov chain. It is worth noting that although the average min-entropy of $(X_{t},\ldots,X_{t+n})$ cannot decrease with $n$ by Lemma 7, the average min-entropy per bit of $(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})$ can actually do it (see Figure 2).

Figure 3: Hierarchy of the main models considered in this paper. An arrow from model

M_{1}

to model

M_{2}

means that every model of type

M_{1}

is of type

M_{2}

IV Experimental Methodology

In this section, we outline the experimental methodology carried out, which is primarily based on code implementations. Our main goal is to validate our theoretical findings, for which we generate correlated data using gbAR(p) models (see Definition 15).

Building upon Kelsey’s predictor concept [kelsey2015predictive], we use machine learning as a tool for min-entropy estimation. Our methodology adopts the traditional machine learning approach, marked by separate training and evaluation phases. This strategy deviates from Kelsey’s model of continuous updates, which we consider a non-essential aspect of predictor concepts for min-entropy evaluation. Thus, our methodology, termed as machine learning predictors, streamlines the process by clearly separating these stages, focusing on essential predictive capabilities without the need for constant updates.

Contrasting with the approach in [truong2018machine], which examines processes failing randomness due to large periods, we focus on processes with shorter-range, bit-level correlations since such correlations could be more similar to the realistic failure modes of physical and hardware-based RNGs, in line with the use of order- $k$ or order-2 Markov chains in [kelsey2015predictive] and [zhu2017analysis] respectively. Given this requirement for modeling realistic RNG failures with shorter-range dependencies, gbAR(p) models provide a parsimonious parameterization that allows us to control correlations and anticorrelations, making them a suitable choice for our analysis.

This data serves as the training set for two distinct types of neural networks, which are tasked with predicting the next target_bits bits. As highlighted in the theory section (17 and Figure 1), order-1 Markov Chains may present trivial cases where min-entropy and average min-entropy match. Therefore, it is important to analyze the behavior of our predictors in scenarios where this equivalence does not hold, forming the basis of our experimental approach.

All the experiments are conducted on a NVIDIA GeForce RTX 3090 with 24.576 GiB of memory and CUDA Version 12.3.

The experimental framework can be structured around four primary components: the data generation process using the gbAR(p) model, the Monte Carlo simulation for the evaluation of minimum entropies, the implementation of machine learning models training and evaluation, specifically GPT-2 and a variation of RCNN (a model taken from [truong2018machine]), and the integration of all data processing steps. This pipeline encompasses the generation of gbAR(p) data, its evaluation using the NIST SP 800-90B test suite, the execution of machine learning predictions, and the compilation of relevant results.

For detailed documentation on code usage, parameter explanations, and further technical details, please refer to the README.md file in the associated code repository [github_code_repo].

IV-A Data Generation

The data generation is an implementation of the gbAR(p) model (Definition 14) in the gbAR() function. The call to this function is wrapped in the function generate_gbAR_random_bytes(), which leverages different autocorrelation functions, namely point-to-point, uniform (all the components being equal), exponential, and Gaussian, to define the autocorrelation pattern through the $\boldsymbol{\alpha}$ parameter (as defined in Definiton 14).

It computes binary sequences by considering the autocorrelation defined by $\boldsymbol{\alpha}$ and the (here always uniform) random noise term (weighted by $\beta$ ), sourced from high-entropy random numbers generated by OpenSSL’s rand command as the source of high entropy random numbers. Random numbers from OpenSSL are also employed in the ossl_rand_mn_rvs() function, that generates samples from a multinomial distribution as required by the gbAR(p) model. These samples are then used to construct the final binary sequence according to the autocorrelation characteristics defined by the model parameters.

The gbAR() function includes a mechanism that discards an initial segment (here with size $10^{4}$ bytes) of the generated binary sequence. The rationale behind this is to allow the sequence to reach a state of statistical stationarity, thereby minimizing initial transient effects introduced in the generation.

IV-B Min-Entropies Calculation

In our approach to numerically evaluate the average min-entropy and min-entropy of the gbAR(p) processes, we employ a Monte Carlo simulation. This involves creating a program that generates 100 samples, each comprising $10^{5}$ bytes. These samples are used to empirically estimate the joint frequencies. The computed frequencies form the basis for calculating both the min-entropy and the average min-entropy (using the known transition probabilities specific to the gbAR(p) processes).

IV-C Machine Learning Predictors

In this work, we use two distinct machine learning models to tackle the task of predicting binary sequences. The first model is an adaptation of the RCNN, while the second model is based on the GPT-2 architecture.

The selection of the RCNN and GPT-2 architectures is driven by our goal to explore prediction capabilities on binary sequences generated from autoregressive models with short-range correlations. The RCNN, as used by Truong et al., has proven effective in detecting correlations in quantum random number generators under deterministic classical noise influence. Its convolutional and recurrent layers are well suited to capture local patterns and short-term dependencies. In contrast, the transformer-based GPT-2 model, with its self-attention mechanism, offers a different approach. Although originally designed for natural language processing, we adapt it to our binary sequence prediction task to examine how it captures order- $p$ Markov chain characteristics. Using these two models enables us to validate the theoretical finding that machine learning predictors tend to estimate average min-entropy independently of architecture, provided they can learn from the data’s correlations.

IV-C1 Target Space Representation and Inference Strategies

As discussed in Section II-B, various approaches exist for multi-token prediction in language models. While recent works like Gloeckle et al. [gloeckle2024better], Stern et al. [stern2018blockwise], and Cai et al. [cai2024medusa] have shown promising results in natural language processing tasks, our research focuses on a different domain. We aim to explore the relationship between model predictions and the min-entropy of the data, specifically for data with different correlations, rather than natural language.

To address the limitations of autoregressive inference strategies that rely on local decisions at each step, we propose directly predicting the entire sequence of n bits simultaneously. This approach allows us to obtain the global maximum probability for the complete sequence from the model, rather than relying on step-by-step decisions. By doing so, we aim to capture long-range dependencies and avoid the potential pitfalls of greedy, beam search or other methods in finding globally optimal sequences.

Our method involves using different tokenization strategies for input and target spaces:

•

Input space: We use binary tokenization where each token represents a single bit (0 or 1).
•

Output space: We employ a tokenization where each token represents $n$ bits, resulting in $2^{n}$ unique classes.

This tokenization approach allows the model to predict groups of bits as single tokens, considering the joint probability of the entire sequence. We believe this method will lead to more accurate and globally optimal predictions, as it forces the model to consider the interdependencies between bits in the sequence.

IV-C2 Model Training and Evaluation Methodology

Our training dataset consists of sequences generated from the gbAR(p) model. From these sequences, we extract subsequences of length $\texttt{seqlen}-\texttt{target\_bits}$ bits Both models are adapted to classify over $2^{n}$ classes, corresponding to all possible sequences of $n$ bits. In this context, each possible combination of target_bits is treated as a distinct class in the classification task.

We evaluate the model prediction accuracy $P_{ML}$ as

P_{ML}=\frac{n_{correct}}{n_{total}}\ .

(2)

Here, $n_{correct}$ represents the number of correct predictions made by the model, and $n_{total}$ denotes the total number of evaluations conducted. This measure of accuracy serves as a key indicator of the model’s performance and its ability to accurately predict future bits based on the training received. The estimated min-entropy will be

h_{ML}=-\log(P_{ML})\ .

(3)

We get a basic approximation of the error using the Wald approximation for the binomial proportion confidence interval, as outlined in [meeker2017statistical]. The propagation of this error yields

\Delta h_{ML}=\frac{z}{\texttt{target\_bits}}\ln(2)\sqrt{\frac{\frac{1}{P_{ML}% }-1}{n_{\text{evals}}}}\ ,

(4)

where $n_{\text{evals}}$ is the number of evaluation sequences.

In addition to accuracy, to assess the performance of the training procedure during the development phase, we have included the evaluation of the binary entropy of the predictions, $P_{e}$ , the proportion of zeros in the prediction, $P_{c}$ , and the loss.

IV-C3 RCNN Model

We use model based on the RCNN model from the framework presented in [truong2018machine]. The original implementation combines convolutional and LSTM layers followed by fully-connected layers. Initially, input integers are converted into one-hot vectors. These vectors are then processed through convolutional layers with max-pooling to extract features, which are subsequently handled by the LSTM layer to capture temporal dependencies.

Our adaptation transitions from byte-based input processing to bit sequence handling, accommodating classification over a fixed number of $2^{n}$ classes, where $n$ is the number of target_bits. This aligns with the original design intended for classification over fixed $2^{n}$ classes. The architecture employs an output layer with $2^{n}$ neurons and softmax activation, allowing for multi-class classification. The categorical cross-entropy loss function is used for training. We use the RMSprop optimizer with a learning rate of 0.005.

Regarding the model architecture, we have slightly modified Truong’s model to increase its size, including:

•

Convolution1D layers with 32, 64, and 128 filters, kernel sizes of 12, 6, and 3 respectively, all using ’relu’ activation and ’same’ padding.
•

LSTM layers with 256 and 128 units, featuring return sequences and dropout layers with a rate of 0.2 for regularization.
•

A final Dense layer with an output size equal to target_bits, using sigmoid activation.

This model architecture results in approximately $~{}7.6\cdot 10^{5}$ trainable parameters.

IV-C4 GPT-2 Model

The GPT-2 model, referenced in [radford2019language], is adapted from its typical use in natural language processing as provided by the Hugging Face Transformers library [huggingface_transformers]. This adaptation restructures the model for traditional classification over the possible $2^{n}$ classes for the next $n$ target_bits.

For processing the binary sequences, we implement a custom BinaryDataset class. Each data entry in this dataset consists of a binary bit sequence with a length defined by the seqlen parameter. This setup facilitates the map** of each bit in the input sequence to the next target_bits, aligning with the classification framework.

In terms of model architecture for the adapted GPT-2, the vocabulary size is set to $2^{target\_bits}$ , aligning with the number of classes in our classification framework. The specific configuration of the model includes parameters such as n_positions=512, n_ctx=512, n_embd=768, n_layer=3, and n_head=3. This configuration leads to the GPT-2 model having $\sim 21\cdot 10^{6}$ trainable parameters.

For the training phase, we use the RMSprop optimizer with a learning rate of 0.0005. The CrossEntropyLoss loss function is chosen as the loss function.

We incorporate gradient scaling and accumulation in our training approach to enhance memory optimization and computational efficiency, especially important under constrained GPU availability.

IV-D Pipeline

We encapsulate in a pipeline the entire data processing for this work, from generating random numbers to saving results.

For each selection of input parameters, the method generates random bytes using the previously described gbAR(p) model. These bytes are saved to a file, which is later used in the data generators within the models to train and evaluate the models. We generate new gbAR(p) sequences for each run to ensure data variability and robustness.

The pipeline runs the NIST SP 800-90B entropy assessment for non i.i.d data in parallel with the model execution over a sample of the generated data (here $10^{7}$ bytes).

Post-analysis, we meticulously compile various results, including entropy assessments, model parameters, $P_{ML}$ values, execution time, and more, into a CSV file.

V Results

Our primary objective is to investigate the relationship between the estimated min-entropy and the number of target_bits. To facilitate this analysis, we focus on low-entropy data for several reasons. Firstly, it ensures that models can effectively learn and capture underlying patterns. Secondly, it enhances the distinction between model predictions and inherent noise, allowing for more robust statistical analysis. In high-entropy scenarios (characterized by small $\alpha$ values), the entropy per bit approaches 1, resembling uniform noise (see Figure 1). This proximity to maximum entropy poses challenges in assessing model performance due to overlap** confidence intervals. These intervals often encompass both the maximum entropy value of 1 and the expected theoretical value, which is also close to 1. Consequently, a large number of evaluations would be required to reduce measurement uncertainty and achieve statistical distinguishability. To address these challenges, we employ a generalized binary autoregressive model of order $p=10$ with a uniform $\alpha$ vector and a uniform noise term $\beta=0.5$ . This configuration provides a balance between learnable patterns and stochastic noise, enabling effective extraction of the target_bits dependence while maintaining statistical significance in our results.

Both the GPT-2 and RCNN models’ training data varied based on the number of target bits to be predicted. For tasks involving 1 to 12 target bits, 20 million bytes of raw data were used, equivalent to 125,000 training sequences of 128 bits each. This increased to 30 million bytes (187,499 sequences) for 13 to 15 target bits, and further to 42 million bytes (262,499 sequences) for 16 target bits. In each case, 80% of the available data was allocated for training, with the remaining 20% reserved for evaluation.

Our primary objective is to compare the min-entropy estimated by these models against the theoretical calculations and the estimations provided by the NIST SP 800-90B predictors and its overall entropy assessment. The results of these experiments are presented in Figure 4.

Finally, for illustration purposes, we demonstrate how greedy decoding fails to accurately estimate the min-entropy in data with certain types of correlations. This experiment utilized 20 million bytes of data, equivalent to 125,000 sequences of 128 bits each. Following our standard protocol, 80% (100,000 sequences) were used for training, and 20% (25,000 sequences) for evaluation. In this case, we focused on predicting 1 to 8 target bits. For comparison, we used two gbAR(2) models with alpha vectors $\frac{1}{4}[+1,+1]$ and $\frac{1}{4}[+1,-1]$ . In the case with alternating correlation signs the global maximum probability cannot be split as the product of the local maximums, so the greedy decoding leads to suboptimal predictions compared to the inference over $2^{n}$ classes. In the first case ( $|\frac{1}{4}[+1,+1]$ ), the global maximum can be reached as the product of local maximums at each bit (see Remark 23), so both approaches match. This comparison illustrates how the greedy approach may lead to suboptimal predictions, as it does not consider the joint probability of the complete sequence in all cases. The results of this analysis are illustrated in Figure 5, highlighting how the greedy decoding strategy can fall short in accurately estimating min-entropy under certain correlation conditions, while performing adequately in others.

VI Discussion

Our work builds upon Kelsey’s definition of predictors [kelsey2015predictive], showing that these predictors effectively estimate the average min-entropy as long as they can harness correlations and effectively model conditional, rather than joint, probabilities. This distinction becomes significant when dealing with stochastic processes with complex correlation structures.

We show that while min-entropy varies with the number of bits considered, defining min-entropy per bit for the entire process is still possible. Lemma 6 establishes that the average min-entropy per bit is always lower than or equal to the min-entropy for order-p Markov chains. Although generally distinct, in specific cases (Theorem 9), both joint min-entropy and average min-entropy converge towards a process-wide min-entropy per bit. Interestingly, different states may exhibit varied decay laws despite this common limit (Figure 2), which is operationally significant when attackers have access to correlated data.

Figure 1 illustrates the interplay between correlation ’width’ and ’length’ in how average min-entropy approaches the min-entropy limit. This aligns with Remark 17, where average min-entropy per bit equals the min-entropy of uniform noise gbAR(p) models with point-to-point lag- $p$ correlations, regardless of target_bits, $p$ , and $|\boldsymbol{\alpha}|$ values.

As we approach high entropy limits, all entropy forms converge to the process limit, consistent across various alpha levels for the considered gbAR(p) models. This reaffirms that min-entropy consistently exceeds average min-entropy, as per Lemma 6.

Figure 4 shows that our min-entropy estimations from both models align with the average min-entropy and stay within the error interval. Interestingly, the NIST Bitstring global predictors, designed to estimate the entropy of 1 target bit, generally overestimate the average min-entropy for 1 target bit, with the notable exception of the MultiMMC predictor. For 8 target_bits, the NIST global predictors, specifically MultiMCW and Lag, tend to overestimate the average min-entropy. However, LZ78Y and MultiMMC show results that are close to the theoretical calculation. It is important to note that NIST predictors are not designed to estimate entropy in the large $n$ limit. In the particular case studied, where the min-entropy continues to decay beyond $n=8$ , it is not surprising that the GPT-2 predictor provides a lower estimate in the $n=16$ run. This estimate is compatible with the theoretically expected value and, moreover, does not overlap with the gray area representing the minimum between the local and global estimates. Consequently, the GPT-2 predictor’s estimate is closer to the min-entropy of the stochastic process, making it a better and more conservative estimation.

Local predictions consistently dominate the min-entropy estimate across all cases. As a result, the overall outcome of the predictor is determined by the local estimate, since the final entropy estimate is the lesser of local and global predictions. The local predictions fall within the theoretical min-entropy for the 9-14 target_bits range and are significantly higher than the min-entropy limit of the process. The overall result of the NIST’s entropy non-IID test, which is the minimum of all tests in the suite, including both predictors and non-predictors, is closer to the min-entropy limit. In this specific scenario, predictors do not significantly contribute to the overall test.

Our analysis of greedy decoding versus direct prediction (Figure 5) reveals limitations in autoregressive inference approaches. For certain correlation structures (e.g., gbAR(2) with $|\sqrt{\alpha}|[+1,-1]$ ), greedy decoding fails to capture global maximum probability. This underscores the importance of multi-token prediction for accurately estimating min-entropy in complex correlation structures. Single-token or greedy approaches may lead to suboptimal predictions and min-entropy overestimation, emphasizing the need for methods capturing joint probabilities over multiple tokens.

In conclusion, our machine learning predictors demonstrate more consistent performance compared to the NIST SP 800-90B in estimating average min-entropy for both 1 and 8 target_bits in low entropy scenarios, while also providing robust estimates for larger values of $n$ . This superior performance can be attributed, in part, to the non-parametric nature of ML min-entropy estimation. Unlike traditional methods that often assume specific underlying distributions, ML approaches allow for flexible modeling, making them particularly powerful in capturing complex, non-linear dependencies in the data. This flexibility is especially valuable when dealing with stochastic processes exhibiting intricate correlation structures, as it enables the model to adapt to the data’s inherent patterns without being constrained by predetermined statistical assumptions.

High entropy scenarios present additional challenges, potentially requiring larger training runs and models. Augmenting target bits offers improved capture of long-range correlations but at a significant computational cost. As Equation 4 indicates, maintaining constant error rates with increasing target bits requires exponential growth in evaluations, as $P_{ML}\sim 2^{-\texttt{target\_bits}}$ in high entropy limits. This accuracy-computation trade-off necessitates careful balancing of long-range correlation capture and practical computational requirements. Future research should focus on develo** efficient algorithms or approximation methods to handle larger target bits without prohibitive computational costs, addressing the challenges of multi-token prediction, autoregressive inference limitations, and target bit scaling in entropy estimation.

VII Conclusions

Our research has revealed several key insights into the estimation of min-entropy. We have shown that machine learning predictors are good at estimating average min-entropy, as long as they effectively harness correlations by estimating conditional probabilities. This becomes particularly significant in stochastic processes with complex correlation structures, where the difference between average min-entropy and min-entropy is relevant.

Our results highlight that both these entropies depend on the number of target_bits considered. Given this important role of target_bits, especially in scenarios with complex correlation structures, it may be operationally significant to include assessments of both average min-entropy and min-entropy for specific target_bits values relevant to cryptographic (or other) scenarios. Importantly, in the examples studied, we observed that as average min-entropy decays with increasing target_bits, targeting only a few bits could lead to an overestimate of the min-entropy. This finding underscores the potential risks of relying on limited-scope entropy estimates in cryptographic applications. While defining min-entropy per bit for the entire process is feasible, and the entropies studied here converge towards this limit (suggesting a lower bound or worst-case scenario), this bound has not been explicitly demonstrated in this work. Therefore, develo** effective methods to estimate this limit remains essential.

We have also found that machine learning predictors can beat NIST SP 800-90B predictors estimates in some cases, making them suitable tools to be included in entropy assessment suites.

Our research leaves several avenues open for exploration that may be of particular interest in further studies:

•

Development of methods for estimating min-entropy in large target_bits scenarios.
•

Exploration of the relationship between training data min-entropy, model size, and necessary training data size for accurate min-entropy estimation. This investigation goes beyond aligning model prediction accuracy with theoretical curves; it aims to provide a deeper understanding of model learning capacity at various entropy levels. Such knowledge could inform the appropriate scaling of computational resources and potentially offer improved estimates by considering theoretical bounds on min-entropy estimation.

In conclusion, while our work has advanced the understanding of min-entropy estimation through machine learning, it also highlights the practical complexity of this method and the need for more research to address its challenges.

VIII Acknowledgments

This work is part of the R&D project TED2021-130369B-C32, funded by MCIN/AEI/10.13039/501100011033 and by the “European Union NextGenerationEU/PRTR”, and is part of the project COMPROMISE PID2020-113795RB-C32/AEI/10.13039/501100011033. In addition, it was partially supported by project i-SHAPER PRTR-INCIBE - 2023/00623/001, which is being carried out within the framework of the Recovery, Transformation, and Resilience Plan funds, funded by the European Union (Next Generation). The authors want to thank Miguel Angel Hombrados Herrera and Gonzalo Martínez Ruiz de Arcaute for their help and fruitful comments.

The purpose of this Appendix is to provide the proofs for the results stated in Section III and to include some additional auxiliary results needed to prove them.

.1 Order-p Markov Chains

We start by revisiting the results presented in Section III-B.

Proof of Lemma 6.

On the one hand, note that:

$\begin{aligned} \left\langle\max_{x_{t},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}% \mid x_{t-1},\ldots,x_{t-p})\right\rangle_{x_{t-1},\ldots,x_{t-p}}=\\ \sum_{x_{t-p},\ldots,x_{t-1}}P(x_{t-p},\ldots,x_{t-1})\max_{x_{t},\ldots,x_{t+% n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p})\leq\\ \sum_{x_{t-p},\ldots,x_{t-1}}P(x_{t-p},\ldots,x_{t-1})\max_{x_{t-p},\ldots,x_{% t+n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p})=\\ \max_{x_{t-p},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p}% ).\end{aligned}$

(5)

The inequality

$\begin{aligned} \tilde{h}_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{% t-p})\geq h_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})\ .\end{aligned}$

follows from (5) taking logarithms, dividing by $n+1$ and changing the sign.

On the other hand, the inequality

h_{\infty}(X_{t},\ldots,X_{t+n})\geq\tilde{h}_{\infty}(X_{t},\ldots,X_{t+n}% \mid X_{t-1},\ldots,X_{t-p})

easily follows writing:

$\begin{aligned} H_{\infty}(X_{t},\ldots,X_{t+n})=\\ -\log\left[\max_{x_{t},\ldots,x_{t+n}}\sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},% \ldots,x_{t-p})P(x_{t},\ldots,x_{t+n}|x_{t-1},\ldots,x_{t-p})\right]\ \end{aligned}$

and taking into account that:

$\begin{aligned} \max_{x_{t},\ldots,x_{t+n}}\sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t% -1},\ldots,x_{t-p})P(x_{t},\ldots,x_{t+n}|x_{t-1},\ldots,x_{t-p})\leq\\ \sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},\ldots,x_{t-p})\max_{x_{t},\ldots,x_{t+% n}}P(x_{t},\ldots,x_{t+n}|x_{t-1},\ldots,x_{t-p}).\end{aligned}$

∎

Proof of Lemma 7.

We have that:

$\begin{aligned} P(x_{t},\ldots,x_{t+n+m}\mid x_{t-1},\ldots,x_{t-p})=\\ P(x_{t+n+1},\ldots,x_{t+n+m}\mid x_{t+n},\ldots,x_{t-p})P(x_{t},\ldots,x_{t+n}% \mid x_{t-1},\ldots,x_{t-p})\leq\\ P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p}).\end{aligned}$

Since this inequality is independent of the realizations

x_{t+n+1},\ldots,x_{t+n+m}

it follows that:

	$\displaystyle\max_{x_{t},\ldots,x_{t+n+m}}P(x_{t},\ldots,x_{t+n+m}\mid x_{t-1}% ,\ldots,x_{t-p})\leq$
	$\displaystyle\max_{x_{t},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},% \ldots,x_{t-p}).$

Since logarithms are monotonically increasing functions, we conclude that:

$\begin{aligned} \tilde{H}_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{% t-p})=\\ -\log\left[\sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},\ldots,x_{t-p})\max_{x_{t},% \ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p})\right]\leq\\ -\log\left[\sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},\ldots,x_{t-p})\max_{x_{t},% \ldots,x_{t+n+m}}P(x_{t},\ldots,x_{t+n+m}\mid x_{t-1},\ldots,x_{t-p})\right]=% \\ \tilde{H}_{\infty}(X_{t},\ldots,X_{t+n+m}\mid X_{t-1},\ldots,X_{t-p}).\end{aligned}$

∎

Proof of Theorem 8.

Since this result is a restatement of [bozorgmanesh2016convergence, Theorem 8], we believe some explanation is necessary. First of all, what [bozorgmanesh2016convergence, Theorem 8] states is that for every $x,x_{t-1},\ldots,x_{t-p}\in S$ :

\lim_{n\to\infty}P(X_{t+n}=x\mid X_{t-1}=x_{t-1},\ldots,X_{t-p}=x_{t-p})=\pi(x)

where $\pi$ is a stationary distribution whose existence is required as an hypothesis. In our case, the existence of a stationary distribution $\pi$ follows from [bozorgmanesh2016convergence, Theorem 7] because we are assuming that $\{X_{t}\}_{t\in\mathbb{Z}}$ is irreducible. Having said that, taking the stationarity into account:

$\begin{aligned} P(X_{t}=x)=\lim_{n\to\infty}P(X_{t}=x)=\lim_{n\to\infty}P(X_{t% +n}=x)=\\ \lim_{n\to\infty}\sum_{x_{t-1},\ldots,x_{t-n}}P(X_{t-1}=x_{t-1},\ldots,X_{t-p}% =x_{t-p})P(X_{t+n}=x\mid X_{t-1}=x_{t-1},\ldots,X_{t-p}=x_{t-p})=\\ \sum_{x_{t-1},\ldots,x_{t-n}}P(X_{t-1}=x_{t-1},\ldots,X_{t-p}=x_{t-p})\lim_{n% \to\infty}P(X_{t+n}=x\mid X_{t-1}=x_{t-1},\ldots,X_{t-p}=x_{t-p})=\\ \sum_{x_{t-1},\ldots,x_{t-n}}P(X_{t-1}=x_{t-1},\ldots,X_{t-p}=x_{t-p})\pi(x)=% \pi(x)\end{aligned}$

and our restatement of [bozorgmanesh2016convergence, Theorem 8] follows. ∎

Proof of Theorem 9.

On the one hand, denoting $\tau=\left\lfloor\sqrt{n}\right\rfloor$ , we have that:

$\begin{aligned} P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p})=\frac{P(x_{% t-p},\ldots,x_{t+n})}{P(x_{t-1},\ldots,x_{t-p})}\leq\\ \frac{P(x_{t-p},\ldots,x_{t-1},x_{t+\tau},\ldots,x_{t+n})}{P(x_{t-1},\ldots,x_% {t-p})}=P(x_{t+\tau},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p}).\end{aligned}$

(6)

Therefore:

$\begin{aligned} \sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},\ldots,x_{t-p})\max_{x_% {t},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p})\leq\\ \sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},\ldots,x_{t-p})\max_{x_{t+\tau},\ldots,% x_{t+n}}P(x_{t+\tau},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p}).\end{aligned}$

In particular,

$\begin{aligned} -\frac{1}{n+1}\log\left(\sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1}% ,\ldots,x_{t-p})\max_{x_{t},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},% \ldots,x_{t-p})\right)\geq\\ -\frac{1}{n+1}\log\left(\sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},\ldots,x_{t-p})% \max_{x_{t+\tau},\ldots,x_{t+n}}P(x_{t+\tau},\ldots,x_{t+n}\mid x_{t-1},\ldots% ,x_{t-p})\right).\end{aligned}$

(7)

Now, the left hand side of the inequality (7) is

\tilde{h}_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})

and the limit of the right hand side of the inequality (7) as $n$ goes to infinity is $h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})$ because of the Convergence Theorem and the stationarity. Using Lemma 6 we conclude that:

$\begin{aligned} h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})\geq\lim_{n\to\infty}% \tilde{h}_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})\geq h_{% \infty}(\{X_{t}\}_{t\in\mathbb{Z}})\end{aligned}$

and the equality

h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})=\lim_{n\to\infty}\tilde{h}_{\infty}(X_{% t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})

follows. On the other hand:

(8)

Note that the inequality of (8) holds because of (6) and the limit of (8) holds because of the Convergence Theorem and the stationarity. Using Lemma 6 we conclude that:

$\begin{aligned} h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})\geq\lim_{n\to\infty}h_{% \infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})\geq h_{\infty}(\{X_{t% }\}_{t\in\mathbb{Z}})\end{aligned}$

and the result follows. ∎

.2 State-Independent Maximum Transition Probability and Bitflip Symmetric Order- $p$ Markov Chains

The following results are targeted at proving that the average min-entropy of SIMTP models coincides with their min-entropy, as claimed in Proposition 11.

Lemma 20.

Let $\{X_{t}\}_{t\in\mathbb{Z}}$ be a SIMTP order- $p$ Markov chain. Then:

	$\displaystyle\left\langle\max_{x_{t},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x% _{t-1},\ldots,x_{t-p})\right\rangle_{x_{t-1},\ldots,x_{t-p}}=$
	$\displaystyle\max_{x_{t-p},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},% \ldots,x_{t-p})\ .$

Proof.

Let us write:

	$\displaystyle P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p})=$		(9)
	$\displaystyle\prod_{i=0}^{n}P(x_{t+i}\mid x_{t+i-1},\ldots,x_{t+i-p}).$		(9)

Since $P(x_{t+i}\mid x_{t+i-1},\ldots,x_{t+i-p})$ is independent of $x_{t+i-1},\ldots,x_{t+i-p}$ for every $i\in\{0,\ldots,n\}$ , it follows from (9) that:

	$\displaystyle\max_{x_{t},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},% \ldots,x_{t-p})=$		(10)
	$\displaystyle\prod_{i=0}^{n}\max_{x_{t+i}}P(x_{t+i}\mid x_{t+i-1},\ldots,x_{t+% i-p}).$		(10)

Using (10), the independence of $P(x_{t+i}\mid x_{t+i-1},\ldots,x_{t+i-p})$ with respect to $x_{t+i-1},\ldots,x_{t+i-p}$ for every $i\in\{0,\ldots,n\}$ and the fact that the sum of probabilities of all the outcomes within a sample space is $1$ , we get:

$\begin{aligned} \left\langle\max_{x_{t},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}% \mid x_{t-1},\ldots,x_{t-p})\right\rangle_{x_{t-1},\ldots,x_{t-p}}=\\ \sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},\ldots,x_{t-p})\max_{x_{t},\ldots,x_{t+% n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p})=\\ \sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},\ldots,x_{t-p})\prod_{i=0}^{n}\max_{x_{% t+i}}P(x_{t+i}\mid x_{t+i-1},\ldots,x_{t+i-p})=\\ \prod_{i=0}^{n}\max_{x_{t+i}}P(x_{t+i}\mid x_{t+i-1},\ldots,x_{t+i-p})=\\ \max_{x_{t},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p}).% \end{aligned}$

∎

Proposition 21.

Any order- $p$ SIMTP satisfies the following decomposition:

$\begin{aligned} H_{\infty}(X_{t},\ldots,X_{t+n})=\\ H_{\infty}(X_{t},\ldots,X_{t+p-1})+H_{\infty}(X_{t+p},\ldots,X_{t+n}\mid X_{t+% p-1},\ldots,X_{t})\end{aligned}$

Proof.

Let us write:

$\begin{aligned} H_{\infty}(X_{t},\ldots,X_{t+n})=-\log\left[\max_{x_{t},\ldots% ,x_{t+n}}P(x_{t},\ldots,x_{n})\right]=\\ -\log\left[\max_{x_{t},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+p-1})P(x_{t+p},% \ldots,x_{t+n}\mid x_{t+p-1},\ldots,x_{t})\right].\end{aligned}$

If the process is SIMTP then we can reach the maximum over $P(x_{t+p},\ldots,x_{t+n})$ independently of the values required to maximize over $P(x_{t},\ldots,x_{t+p-1})$ . In other words, we can maximize both probabilities independently:

$\begin{aligned} H_{\infty}(X_{t},\ldots,X_{t+n})=\\ -\log\left[\max_{x_{t},\ldots,x_{t+p-1}}P(x_{t},\ldots,x_{t+p-1})\right]-\log% \left[\max_{x_{t},\ldots,x_{t+n}}P(x_{t+p},\ldots,x_{t+n}\mid x_{t+p-1},\ldots% ,x_{t})\right]=\\ H_{\infty}(X_{t},\ldots,X_{t+p-1})+H_{\infty}(X_{t+p},\ldots,X_{t+n}\mid X_{t+% p-1},\ldots,X_{t}).\end{aligned}$

∎

Proof of Proposition 11.

On the one hand, by Proposition 21 we have:

$\begin{aligned} H_{\infty}(X_{t},\ldots,X_{t+n})=\\ H_{\infty}(X_{t},\ldots,X_{t+p-1})+H_{\infty}(X_{t+p},\ldots,X_{t+n}\mid X_{t+% p-1},\ldots,X_{t})=\\ H_{\infty}(X_{t},\ldots,X_{t+p-1})-\log\left[\max_{x_{t}}P(x_{t}\mid x_{t-1},% \ldots,x_{t-p})^{n-p}\right].\end{aligned}$

Then, for finite $p$ , as the first term is bounded

	$\displaystyle h_{\infty}(\{X_{t}\}_{t\in\mathbb{Z}})=\lim_{n\to\infty}\frac{1}% {n+1}H_{\infty}(X_{t},\ldots,X_{t+n})=$		(11)
	$\displaystyle-\log\left[\max_{x_{t}}P(x_{t}\mid x_{t-1},\ldots,x_{t-p})\right].$		(11)

On the other hand:

$\begin{aligned} \tilde{h}_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{% t-p})=\\ -\log\left[\left(\sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},\ldots,x_{t-p})\max_{x% _{t},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p})\right)^% {\frac{1}{n+1}}\right]=\\ -\log\left[\left(\sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},\ldots,x_{t-p})\max_{x% _{t}}P(x_{t}\mid x_{t-1},\ldots,x_{t-p})^{n}\right)^{\frac{1}{n+1}}\right]=\\ -\log\left[\left(\sum_{x_{t-1},\ldots,x_{t-p}}P(x_{t-1},\ldots,x_{t-p})\right)% ^{\frac{1}{n+1}}\max_{x_{t}}P(x_{t}\mid x_{t-1},\ldots,x_{t-p})\right]=\\ -\log\left[\max_{x_{t}}P(x_{t}\mid x_{t-1},\ldots,x_{t-p})\right].\end{aligned}$

(12)

The result follows putting (11) and (12) together.

∎

We end this subsection of the Appendix by proving that Bitflip-Symmetric Markov Chains with lag-p point-to-point correlations are SIMTP, as stated in Lemma 13.

Proof of Lemma 13.

Let $x_{t-p}\in\{0,1\}$ . Then, by the definition of bitflip symmetry:

$\begin{aligned} \max_{x_{t}}P(x_{t}\mid x_{t-p})=\max\{P(X_{t}=0\mid x_{t-p}),% P(X_{t}=1\mid x_{t-p})\}=\\ \max\{P(X_{t}=1\mid 1\oplus x_{t-p}),P(X_{t}=0\mid 1\oplus x_{t-p})\}=\\ \max_{x_{t}}P(x_{t}\mid 1\oplus x_{t-p})\end{aligned}$

and the result follows. ∎

.3 Generalized Binary Autoregressive Models

The last subsection of the Appendix is devoted to gather some interesting properties of gbAR(p) models. In particular, they will allow to prove the formula for calculating the min-entropy of uniform noise and positive gbAR(p) models stated in Proposition 16.

Remark 22.

Let $\{X_{t}\}_{t\in\mathbb{Z}}$ be a uniform noise gbAR(p) model. Then $P(X_{t}=0)=P(X_{t}=1)=\frac{1}{2}$ for every $t\in\mathbb{Z}$ by [jentsch2019generalized, Lemma 1].

Remark 23.

By [jentsch2019generalized, Lemma 1] the transition probabilities of a gbAR(p) model $\{X_{t}\}_{t\in\mathbb{Z}}$ can be written as:

$\begin{aligned} P(x_{t}\mid x_{t-1},\ldots,x_{t-p})=\\ \sum_{i=1}^{p}[\mathds{1}_{\{\alpha_{i}\geq 0\}}|\alpha_{i}|\cdot(1\oplus x_{t% }\oplus x_{t-i})+\mathds{1}_{\{\alpha_{i}<0\}}|\alpha_{i}|\cdot(x_{t}\oplus x_% {t-i})]+\beta\cdot P(e_{t}=x_{t})\ .\end{aligned}$

If $\{X_{t}\}_{t\in\mathbb{Z}}$ is a uniform noise and positive gbAR(p) model this writes as

\displaystyle P(x_{t}\mid x_{t-1},\ldots,x_{t-p})=\sum_{i=1}^{p}\alpha_{i}% \cdot(1\oplus x_{t}\oplus x_{t-i})+\frac{\beta}{2}

and this conditional probability reaches the maximum value

\max_{x_{t-1},\ldots,x_{t-p}}P(x_{t}\mid x_{t-1},\ldots,x_{t-p})=\sum_{i=1}^{p% }\alpha_{i}+\frac{\beta}{2}=1-\frac{\beta}{2}\

when $x_{t}=x_{t-1}=\cdots=x_{t-p}$ . In particular, a realization $x_{t-p},\ldots,x_{t+n}$ of $X_{t-p},\ldots,X_{t+n}$ such that

x_{t-p}=\cdots=x_{t+n}

maximizes the conditional probability $P(x_{k}\mid x_{k-1},\ldots,x_{k-p})$ for every $k\in\{t,\ldots,t+n\}$ and it follows that:

	$\displaystyle\max_{x_{t-p},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}\mid x_{t-1},% \ldots,x_{t-p})=$
	$\displaystyle\max_{x_{t-p},\ldots,x_{t+n}}\prod_{k=t}^{t+n}P(x_{k}\mid x_{k-1}% ,\ldots,x_{k-p})=$
	$\displaystyle\prod_{k=t}^{t+n}\max_{x_{k},\ldots,x_{k-p}}P(x_{k}\mid x_{k-1},% \ldots,x_{k-p})=\left(1-\frac{\beta}{2}\right)^{n+1}.$

Remark 24.

Note that a gbAR(p) model has point-to-point lag-p correlations if $a_{t}^{(i)}=\delta_{i,p}$ for every $i\in\{1,\ldots,p\}$ and every $t\in\mathbb{Z}$ . This is achieved when $\alpha_{i}=0$ for every $i\in\{1,\ldots,p-1\}$ .

Lemma 25.

A uniform noise gbAR(p) model with point-to-point lag-p correlations is bitflip-symmetric.

Proof.

By the definition of conditional probability and the definition of gbAR(p) model with point-to-point lag-p correlations:

P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p})=\prod_{i=0}^{n}P(x_{t+i}% \mid x_{t+i-p}).

(13)

Then, by Remark 22 and Remark 23:

	$\displaystyle P(x_{t+i}\mid x_{t+i-p})=$		(14)
	$\displaystyle[\mathds{1}_{\{\alpha_{p}\geq 0\}}\|\alpha_{p}\|\cdot(1\oplus x_{t+% i}\oplus x_{t+i-p})+$
	$\displaystyle\mathds{1}_{\{\alpha_{p}<0\}}\|\alpha_{p}\|\cdot(x_{t+i}\oplus x_{t% +i-p})]+\frac{\beta}{2}=$
	$\displaystyle[\mathds{1}_{\{\alpha_{p}\geq 0\}}\|\alpha_{p}\|\cdot(1\oplus 1% \oplus x_{t+i}\oplus 1\oplus x_{t+i-p})+$
	$\displaystyle\mathds{1}_{\{\alpha_{p}<0\}}\|\alpha_{p}\|\cdot(1\oplus x_{t+i}% \oplus 1\oplus x_{t+i-p})]+\frac{\beta}{2}=$
	$\displaystyle P(1\oplus x_{t+i}\mid 1\oplus x_{t+i-p}).$

Putting (13) and (14) together:

	$\displaystyle P(x_{t},\ldots,x_{t+n}\mid x_{t-1},\ldots,x_{t-p})=$
	$\displaystyle\prod_{i=0}^{n}P(x_{t+i}\mid x_{t+i-p})=\prod_{i=0}^{n}P(1\oplus x% _{t+i}\mid 1\oplus x_{t+i-p})=$
	$\displaystyle P(1\oplus x_{t},\ldots,1\oplus x_{t+n}\mid 1\oplus x_{t-1},% \ldots,1\oplus x_{t-p}).$

∎

Proposition 26.

Let $\{X_{t}\}_{t\in\mathbb{Z}}$ be a uniform noise and positive gbAR(p) model. Then $\{X_{t}\}_{t\in\mathbb{Z}}$ satisfies the hypothesis of the Convergence Theorem, i.e. $\{X_{t}\}_{t\in\mathbb{Z}}$ is an irreducible and aperiodic stationary order- $p$ Markov chain with finite state-space.

Proof.

$\{X_{t}\}_{t\in\mathbb{Z}}$ is a stationary order- $p$ Markov chain with finite state-space $S=\{0,1\}$ by Definition 14. Now, $\{X_{t}\}_{t\in\mathbb{Z}}$ is irreducible because for every $x_{t-p},\ldots,x_{t}\in S$ we have:

\displaystyle P(x_{t}\mid x_{t-1},\ldots,x_{t-p})=\sum_{i=1}^{p}\alpha_{i}% \cdot(1\oplus x_{t}\oplus x_{t-i})+\frac{\beta}{2}>0.

(15)

Finally, since

\sum_{x_{t-2},\ldots,x_{t-p}}P(x_{t-2},\ldots,x_{t-p})=1,

there exist $y_{t-2},\ldots,y_{t-p}\in S$ such that $P(y_{t-2},\ldots,y_{t-p})>0$ . Then, by equation (15):

$\begin{aligned} P(x_{t}\mid x_{t-1})=\sum_{x_{t-2},\ldots,x_{t-p}}P(x_{t-2},% \ldots,x_{t-p})P(x_{t}\mid x_{t-1},x_{t-2},\ldots,x_{t-p})\geq\\ P(y_{t-2},\ldots,y_{t-p})P(x_{t}\mid x_{t-1},y_{t-2},\ldots,y_{t-p})>0\end{aligned}$

and we conclude that $\{X_{t}\}_{t\in\mathbb{Z}}$ is aperiodic. ∎

Proof of Proposition 16.

Since $\{X_{t}\}_{t\in\mathbb{Z}}$ satisfies the hypothesis of the Convergence Theorem 8 by Proposition 26, the equalities

follow from Theorem 9. Now,

$\begin{aligned} h_{\infty}(X_{t},\ldots,X_{t+n}\mid X_{t-1},\ldots,X_{t-p})=\\ -\frac{1}{n+1}\log\left[\max_{x_{t-p},\ldots,x_{t+n}}P(x_{t},\ldots,x_{t+n}% \mid x_{t-1},\ldots,x_{t-p})\right]=\\ -\frac{1}{n+1}\log\left[\max_{x_{t-p},\ldots,x_{t+n}}\prod_{i=0}^{n}P(x_{t+i}% \mid x_{t+i-1},\ldots,x_{t+i-p})\right]=\\ -\frac{1}{n+1}\log\left[\prod_{i=0}^{n}\left(\sum_{i=1}^{p}\alpha_{i}+\frac{% \beta}{2}\right)\right]=-\frac{1}{n+1}\log\left[\left(\sum_{i=1}^{p}\alpha_{i}% +\frac{\beta}{2}\right)^{n+1}\right]=\\ -\frac{1}{n+1}\log\left[\left(1-\frac{\beta}{2}\right)^{n+1}\right]=-\log\left% (1-\frac{\beta}{2}\right).\end{aligned}$

∎

\printbibliography

Appendix A Biography Section

	$\displaystyle P(x_{t+i}\mid x_{t+i-p})=$		(14)
	$\displaystyle[\mathds{1}_{\{\alpha_{p}\geq 0\}}\|\alpha_{p}\|\cdot(1\oplus x_{t+% i}\oplus x_{t+i-p})+$
	$\displaystyle\mathds{1}_{\{\alpha_{p}<0\}}\|\alpha_{p}\|\cdot(x_{t+i}\oplus x_{t% +i-p})]+\frac{\beta}{2}=$
	$\displaystyle[\mathds{1}_{\{\alpha_{p}\geq 0\}}\|\alpha_{p}\|\cdot(1\oplus 1% \oplus x_{t+i}\oplus 1\oplus x_{t+i-p})+$
	$\displaystyle\mathds{1}_{\{\alpha_{p}<0\}}\|\alpha_{p}\|\cdot(1\oplus x_{t+i}% \oplus 1\oplus x_{t+i-p})]+\frac{\beta}{2}=$
	$\displaystyle P(1\oplus x_{t+i}\mid 1\oplus x_{t+i-p}).$

Machine Learning Predictors for Min-Entropy Estimation

Abstract

Index Terms:

I Introduction

I-A Notation and Conventions

II State of the Art (Literature Review)

II-A Entropies

II-B Autoregressive Inference and Multi-Token Prediction Strategies

III Theoretical Framework

III-A Entropies

Definition 1.

Remark 2.

III-B Order-p Markov Chains

Definition 3 (cf. [ching2006markov], [raftery1985model]).

Definition 4.

Remark 5.

III-B1 Some Min-Entropy Inequalities for Order-p Markov Chains

Lemma 6.

Lemma 7.

III-B2 Convergence Theorem for the Min-Entropy and Average Min-Entropy of order-p𝑝pitalic_p Markov chains

Theorem 8 (Convergence Theorem [bozorgmanesh2016convergence]).

Theorem 9.

III-B3 State-Independent Maximum Transition Probability and Bitflip Symmetric Order-p𝑝pitalic_p Markov Chains

Definition 10 (State-Independent Maximum Transition Probability Order-p𝑝pitalic_p Markov Chain).

Proposition 11.

Definition 12 (Bitflip Symmetry in Binary Order-p𝑝pitalic_p Markov Chains).

Lemma 13.

III-B4 Generalized Binary Autoregressive Models

Definition 14 (Generalized Binary Autoregressive Models [jentsch2019generalized]).

Definition 15.

Proposition 16.

Remark 17.

Remark 18.

Remark 19.

IV Experimental Methodology

IV-A Data Generation

IV-B Min-Entropies Calculation

IV-C Machine Learning Predictors

IV-C1 Target Space Representation and Inference Strategies

IV-C2 Model Training and Evaluation Methodology

IV-C3 RCNN Model

IV-C4 GPT-2 Model

IV-D Pipeline

V Results

VI Discussion

VII Conclusions

VIII Acknowledgments

.1 Order-p Markov Chains

Proof of Lemma 6.

Proof of Lemma 7.

Proof of Theorem 8.

Proof of Theorem 9.

.2 State-Independent Maximum Transition Probability and Bitflip Symmetric Order-p𝑝pitalic_p Markov Chains

Lemma 20.

Proof.

Proposition 21.

Proof.

Proof of Proposition 11.

Proof of Lemma 13.

.3 Generalized Binary Autoregressive Models

Remark 22.

Remark 23.

Remark 24.

Lemma 25.

Proof.

Proposition 26.

Proof.

Proof of Proposition 16.

Appendix A Biography Section

Machine Learning Predictors
for Min-Entropy Estimation

III-B2 Convergence Theorem for the Min-Entropy and Average Min-Entropy of order- $p$ Markov chains

III-B3 State-Independent Maximum Transition Probability and Bitflip Symmetric Order- $p$ Markov Chains

Definition 10 (State-Independent Maximum Transition Probability Order- $p$ Markov Chain).

Definition 12 (Bitflip Symmetry in Binary Order- $p$ Markov Chains).

.2 State-Independent Maximum Transition Probability and Bitflip Symmetric Order- $p$ Markov Chains