HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: filecontents
  • failed: contour
  • failed: nicematrix

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2401.12250v1 [quant-ph] 20 Jan 2024

Does True Randomness Exist? - Efficacy Testing IBM Quantum Computers via Statistical Randomness

Owen Root11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT [email protected]    Maria Becker11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT 11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTDepartment of Physics, Nebraska Wesleyan University, Lincoln, NE 68504, USA
(January 20, 2024)
Abstract

The fundamental principles of quantum mechanics, such as its probabilistic nature, allow for the theoretical ability of quantum computers to generate statistically random numbers, as opposed to classical computers which are only able to generate pseudo-random numbers. This ability of quantum computers has a variety of applications, one of which provides the basis for a method of efficacy testing Quantum Computers themselves. We introduce this testing method and utilize it to investigate the efficacy of nine IBM Quantum Computer systems. The testing method utilized four different quantum random number generator algorithms and a battery of eighteen statistical tests. Only a single quantum computer-algorithm combination was found to be statistically random, demonstrating the power of the testing method as well as indicating that further work is needed for these computers to reach their theoretical potential.

I Introduction

Quantum Computers (QCs) exploit the principles of quantum mechanics to enable the existence of quantum bit (qubit) based computers which possess advantages over transistor bit-based Classical Computers (CCs). These quantum mechanical properties give QCs the ability to perform certain tasks on time scales that would be impossible for CCs to accomplish in a similar time. A famous example of this is Shor’s Algorithm, which lets QCs factor numbers in polynomial time, something that Classical computers can only do in exponential time shor's . Recently, Shor’s algorithm has been experimentally realized google_shor . This particular development of QCs has drawn attention to the field for its relevance to cryptography and cybersecurity. A primary form of computer security, RSA (Rivest–Shamir–Adleman) encryption, relies on the inability of CCs to factor large numbers quickly. Shor’s Algorithm could allow a QC to factor the large numbers necessary to break RSA encryption quantum_curious . For this reason and others, many entities, both private and public, are actively researching and develo** QCs. Additionally, QCs have become accessible to the general public and third-party researchers through applications such as IBM Quantum and its Quantum Educators programIBM_Quantum ; quantum_educators .

(a)

Refer to caption

(b)

Refer to caption

(c)

Refer to caption

(d)

Refer to caption
Figure 1: The QRNG algorithms used in this work in standard quantum circuit notation. Blue circles with ’+’ are NOT gates. (a) the Basic QRNG, (b) QRNG Type 1, (c) QRNG Type 2, (d) and QRNG Type 3. Created using IBM_Quantum .

With such widespread development of QCs, methods must be developed to assess the functional capabilities of QCs to ensure that their performance matches expectations. Here we present such a method, which operates by testing a QC’s ability to act as a Quantum Random Number Generator (QRNG).

QRNGs stand in contrast to the Random Number Generators (RNGs) operated by CCs, as, due to CC’s deterministic nature, RNGs are only pseudo-random. RNGs have a significant role in classical computing, where they are used in various applications, such as cryptography, gambling, statistical sampling, computer simulation, and computer-based games RNG_uses . This is another area in which QCs are theoretically superior to CCs, as the probabilistic nature of quantum mechanics allows for a QC to operate as a Quantum Random Number Generator (QRNG). A QRNG would produce truly random numbers.

QRNGs have been widely investigated and discussed in scientific literature. A comprehensive review and classification of QRNGs was recently performed in QRNG_review . Much work has been done to investigate the efficacy of optical-based QRNG apparatus and found promising results, such as in Bayesian_Criteria ; optical_QRNG_2 . Much less work has been done to explore QRNGs in more complex systems, such as existing large-scale quantum computers.

In this work, We discuss the development of a QC efficacy testing method based on QRNGs and the application of the method to several of IBM’s Quantum Computers. The rest of the paper will be structured as follows. In Section II, We discuss the theory behind the testing method, including the QRNG algorithms and statistical tests utilized. In Section III, We discuss the interpretation of the statistical tests and report their results. In Section IV, we discuss the results of the testing method and their implications.

II Theory and application of the Method

II.1 The Basic QRNG

Many different QRNGs may be conceived, but the simplest one requires only a single qubit. At the start of the algorithm, the qubit is initialized in either the 1 or the 0 states (IBM’s QCs initialize in the 0 state). Next, the qubit is operated on by a Hadamard Gate, whose function is to put the qubit into a superposition, such that each basis state of the qubit (a 1 or a 0) is equally likely to be the result of measurement. Then the qubit is operated on by a measurement gate, which makes a measurement of the state of the qubit, collapsing the superposition. The result of the measurement is then stored classically. In this work, we shall refer to this algorithm as the Basic QRNG (see Fig. 1 a). Repeating this algorithm results in a random binary sequence. This ability of QRNGs to be truly random is a direct consequence of the probabilistic nature of the measurement collapse principle. After the operation of the Hadamard Gate, the qubit is in the state.

|Ψ=12|0+12|1,ketΨ12ket012ket1|\Psi\rangle=\frac{1}{\sqrt{2}}|0\rangle+\frac{1}{\sqrt{2}}|1\rangle,| roman_Ψ ⟩ = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG | 0 ⟩ + divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG | 1 ⟩ , (1)

where each state carries 1212\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG of the probability. In an ideal system, measurement of the qubit’s state has exactly 50%percent5050\%50 % probability of resulting in either basis state.

II.2 Efficacy Testing Method

However, any physical system, such as real QC, is non-ideal and introduces biases such that the probability is not exactly 50%percent5050\%50 % for each state Bayesian_Criteria . A high-efficacy QC, if properly implemented and constructed, should be sufficiently devoid of defects and biases so that they have a negligible impact on the output of a QRNG. Whereas a low-efficacy QC could have significant biases which would cause a skew in its outputs. Therefore, one can investigate the efficacy of a QC by analyzing the randomness of QRNGs run on the QC to detect any bias or patterns. This is the premise of the QC efficacy testing method presented here.

The method is accomplished by means of statistical tests, which are algorithms developed to examine a data sequence for certain properties, in particular, detectable statistical patternsBayesian_Criteria ; NIST ; TBT . In line with standard terminology from related works, we will use the following terms: "statistically random" (SR) and "statistically non-random" (SNR). SR, in this context, means that no statistical patterns have been found in the data, whereas SNR indicates that a pattern or a predictable element has been observed. Each statistical test looks for a particular pattern or property and reports if the sequence passes and should be considered SR, or fails and should be considered SNR. The degree to which it passes or fails is also reported by the tests.

Different statistical tests are developed from different formulations of probability theory. Even tests developed from similar theories look for different patterns and properties and may be considered more or less "strict" than other tests. Recognizing these variations, our method applies a battery of different statistical tests. Additionally, given that a truly random source could produce a sequence that would fail statistical tests (i.e. a sequence of all ones is just as likely as any other sequence but would be found as SNR), our methods performs statistical tests on multiple QRNG outputs before a conclusion can be made about a particular source.

II.3 Additional QRNGs

While the simplest QRNG is the Basic QRNG, other QRNG algorithms may be used to investigate certain abilities of QCs or come closer to true randomness. We outline three additional QRNGs that will be utilized in this work.

Preliminary results suggested that the QCs (running the Basic QRNG) are biased towards 0s in the output. We hypothesize that this bias is an artifact of the qubits being initialized in the 0 states and that the Hadamard Gate does an imperfect job of placing the qubit into an even 50-50 superposition from that state. This can be investigated by using the following algorithm in the efficacy testing method.

First, the qubit is initialized. Then it is operated on by a NOT Gate (whose function is to flip the qubit from a 1 to a 0 or from a 0 to a 1) before passing it to the Hadamard Gate, and then to the Measurement Gate. Now all the qubits will be in the 1 state when they are operated on by the Hadamard Gate. We will refer to this algorithm as QRNG Type 1 (see Fig. 1b). If the bias towards 0s is due to the initialization of the qubit, then we should see a bias towards 1s from this algorithm.

Because the functionalities of a QC rely on superposition, it is natural to investigate various aspects of the performance of the Hadamard gate (as it is the mechanism by which superposition is obtained in a QC). One such aspect is the ability of a Hadamard gate to return a qubit to a superposition after it has already been measured. This can be done by utilizing the following algorithm. After initialization, the qubit is operated on by a Hadamard gate, then a Measurement gate (whose results are not stored), then another Hadamard gate, and then a final Measurement Gate (storing the result this time). We will refer to this algorithm as QRNG Type 2 (see Fig. 1c).

The previous algorithms have utilized only a single qubit, whereas the QCs used for this study have a total of either 5 or 7 qubits (more on these QC systems in Section 3.). Utilizing more qubits may bring us closer to realizing true randomness. This can be investigated using a four-qubit algorithm where two of the qubits follow the process of the Basic QRNG and two of the qubits follow the process of QRNG Type 1 (see Fig. 1d). We will refer to this algorithm as QRNG Type 3. This construction may counteract some of the previously observed bias, as well as create more data to enable a more thorough analysis (as now each execution of the algorithm returns four bits of data instead of one).

II.4 Application of the Method

We applied the aforementioned QC efficacy test to nine IBM Quantum systems, utilizing a battery of 18 statistical tests detailed in Bayesian_Criteria ; NIST ; TBT . See the Table 4 for details on each statistical test. These IBM systems are different physical QCs, using 3 different processor architectures. Each "trial" represents one execution of the given QRNG, which repeats the circuit 8,192 times (the number of "shots" allowed at our access level) and returns a data sequence consisting of the results of each run, chronologically. This yields each trial as a binary sequence of 8,192 digits, or 32,768 digits in the case of QRNG Type 3. By analyzing a large number of trials, we can determine how SR the systems are, and therefore the efficacy of each system. See Table 1 for the different QCs, their processors, the number of qubits, and the number of trials for each QRNG for that QC. Note that some QCs have more trials for a given QRNG, this is due to extra trials having been run before the decision to generally use 128 trials was made.

The battery of statistical tests was applied to each individual trial. Additionally, trials were strung together (to form sequences longer than 8,192 bits) and tested, to fit the minimum sequence lengths required by certain tests. We refer to these longer sequences as "combined," and are explained further in section III.A.

Table 1: The QCs utilized in this work and supplemental information about them and the number of trials used.

QC

Qubits

Processor

No. of Trials for Basic QRNG

No. of Trials for QRNG Type 1

No. of Trials for QRNG Type 2

No. of Trials for QRNG Type 3

Perth

7

Falcon r5.11H

128

128

128

128

Lagos

7

Falcon r5.11H

128

128

128

128

Nairobi

7

Falcon r5.11H

206

128

128

128

Oslo

7

Falcon r5.11H

256

128

128

255

Jakarta

7

Falcon r5.11H

128

128

128

128

Manila

5

Falcon r5.11L

325

128

128

131

Quito

5

Falcon r4T

128

128

128

128

Belem

5

Falcon r4T

128

128

128

128

Lima

5

Falcon r4T

319

128

128

133

.

Table 1: The QCs utilized in this work and supplemental information about them and the number of trials used.
Refer to caption
Refer to caption
Figure 2: Results from tests 1-15 for the Basic QRNG from the Perth IBM QC system at the trial (top) and combined (bottom) levels. The box plots represent the results of individual trials for each statistical test. The horizontal axis labels for the statistical tests reference Table 4. The red line is at a P-value of 0.01.
Refer to caption
Refer to caption
Figure 3: Results from test 16 for QRNG Type 1 from the Jakarta IBM QC System at the trial level (top) and the combined level (bottom). The bars represent the LHS values for each subsequence and are color-coded by each subsequence length. The RHS bound is represented by the red lines.

III Results

III.1 Interpretation and Usage of Statistical tests

Many times in this section, the results from a multitude of trials are presented together. For these cases, a system is considered to have failed a statistical test when the median result for its trials would be found as SNR. The results have been summarized in Table 2. In this section, we consider the interpretation of the statistical tests that led the summarized results.

A detailed example of the results of statistical tests 1-15, from NIST , is given in Figure 2 for the Basic QRNG from Perth QC. Note that the statistical tests included in the top of this Figure 2 are the ones that function for sequence lengths of 8,192, or 32,768 digits for QRNG Type 3. Here each test reports a "P-value" that indicates how SR a test is. The "decision point" is a P-value of 0.01; scoring below this threshold represents a failure and indicates the sequence as SNR. The P-value is a statement about how likely the sequence was to be generated randomly.

There are several items to note regarding tests 1-15. First, tests 7, 14, and 15 return multiple P-values (see NIST for further details). We have chosen to include the strict interpretation of these P-values, meaning if at least one of the P-values from these tests is below the decision point, then the sequence is found to be SNR by the given test. The lowest P-value for tests 7, 14, and 15 is shown in the figures in this work.

Second, tests 2, 10, 11, and 12 operate by examining all the sections of the sequence that are of a certain size, which is an input parameter for each test. Here we write the parameter generally as M𝑀Mitalic_M, despite there being a different parameter for each of these tests. For example, if M=5,000𝑀5000M=5,000italic_M = 5 , 000 for test 10, the Linear Complexity Test, the test would look at each section of up to 5,000 digits in the sequence. Each of these tests has a range of allowed values for M𝑀Mitalic_M (specified in their source material documentation) which the test is valid for. Using larger values of M𝑀Mitalic_M increases the strictness of each test. As before, We have chosen to include the strict interpretation for these tests and utilize the largest allowed M𝑀Mitalic_M values.

Refer to caption
Refer to caption
Figure 4: Results of test 17 for the combined QRNG Type 3 trials for the Lagos IBM QC system at the trial level (top) and the combined level (bottom). The bars represent the LHS values for each i-value. The RHS values are represented by the red line.

A detailed example of the results of test 16, the Borel Normality Criteria from Bayesian_Criteria , are reported in Figure 3 for the Type 1 QRNG from the Jakarta QC. The Borel Normality Criteria functions by calculating a value based on the frequency for each possible i-lengthed subsequence. Only i-values up to a certain limit are required, based on how long the sequence is, with longer sequences requiring the use of larger i values. We refer to this first value as the "LHS value." The LHS value is compared with the "RHS value," calculated from expectations about the subsequence’s frequency for a random case and the length of the sequence. A larger positive LHS value indicates an overabundance of the respective subsequence and corresponds to a larger negative LHS value for another subsequence (which appears less). If the LHS value is within the LHS value bounds, the sequence is considered to be SR.

A detailed example of the results of test 17, the Bayesian Criteria from Bayesian_Criteria , are shown in Figure 4 for the Type 3 QRNG from the Lagos QC. The Bayesian Criteria test operates similarly to the Borel Normality Criteria except that instead of considering the frequency of individual subsequences for its LHS value, it considers the relative frequency of all subsequences of length i (again, up to a certain appropriate length). It also calculates a different RHS value for each i-value. If the LHS value is less than the RHS value, the sequence is considered SR. The Bayesian Criteria is considered stricter than the Borel Normality Criteria.

A detailed example for the results of test 18, the Topological Binary Test (TBT) from TBT , are reported in Figures 5 for the Belem QC. TBT operates by comparing the number of unique subsequences of a certain length m (determined by the length of the sequence) and comparing it to a certain critical value. This threshold is found from an expectation of how many unique subsequences would be present in a randomly generated sequence of the same length. If there are more unique subsequences than the critical value, the sequence is considered SR. Note that the TBT test only operates on sequences of exactly certain lengths for different values of m𝑚mitalic_m (i.e. for m=8𝑚8m=8italic_m = 8 the sequence must be 2,048 digits, or for m=15𝑚15m=15italic_m = 15 the sequence must be 491,520 digits).

For the case of trials that are 8,192 digits long, each trial was partitioned into four sequences of 2,048 digits, or for QRNG Type 3 trials (which are 32,768 digits long) 16 sequences of 2,048 digits. This works because 8,192 happens to be an exact multiple of 2,048, which, as far as the authors are aware, is purely coincidental (none of the other sequence lengths allowed by the TBT test are exact factors of the lengths of the trials). These partitioned sequences were then tested by the TBT test using m=8. The number of unique subsequences for each of the partitioned sequences was then averaged and compared to the m=8 critical value of 150 unique subsequences. This was done to most efficiently utilize the information in all of the trials.

As mentioned earlier, in addition to analyzing all single trials from a QC-QRNG combination via statistical tests, longer "combined" sequences were constructed by stringing together (chronologically) all the trials from each of the QC-QRNG pairings. These combined sequences allow for the use of statistical tests that require longer sequences to function. Additionally, some tests become more effective at diagnosing statistical non-randomness when applied to longer sequences. These are also shown in Figures 2 - 5 for each of the statistical tests, respectively.

In regards to how to interpret the results from the statistical tests, there are several things that I will note. Because the tests tell us how likely it is that a sequence was generated randomly, failing a small portion of the tests does not necessarily mean that the sequence was not generated randomly. In particular, failing only one or two of tests 2, 7, 10, 11, 12, 14, and 15 (as we are representing the strict interpretation of these tests) does not necessarily exclude a sequence from being considered SR. It is only if a sequence fails many tests, or if many sequences from a given source repeatedly fail a test, that we can conclude them to be SNR. Ideally, a truly random sequence will pass all the statistical tests, but given that a truly random source is just as likely to generate a sequence that would fail the tests (say a sequence of all 1s), then we can only conclude that a source is SR after observing that it is rare for it to fail a test.

The results of all the statistical tests for every QC-QRNG combination have been compiled in Table 2. Note that Table 2 only shows whether a combination passed or not and does not include information on "how much" a combination passed or failed by.

Refer to caption

System

QRNG

Unique Sub-sequences

SR or SNR?

Belem

Basic

36018

SNR

1

41155

SNR

2

41189

SNR

3

41040.25

SNR

Figure 5: Results of test 18 for all of the QRNGs from the Belem QC at the trial level (top) and the combined level (bottom). For the trial level, the box plots represent the results of individual trials for each statistical test. The red line represents the threshold of 150 unique subsequences.
[Uncaptioned image]
Table 2: The results of the statistical test battery for each of the QC-QRNG pairings. The color coding indicates as follows: red, failed a given test at both the trial and combined sequence levels; yellow, passed at the trial level but failed at the combined; green, passed at both levels; blue, failed at the trial level but passed at the combined. Tests 5, 7-10, 14, and 15 only operate at the combined level and therefore are only yellow and green if the QC-QRNG failed or passed, respectively. Oslo Type 2 has been highlighted for its significance.
[Uncaptioned image]
Table 3: The results of the statistical test battery for each of the Comparison systems. Again, the color coding indicates as follows: red, failed a given test at both the trial and combined sequence levels; yellow, passed at the trial level but failed at the combined; green, passed at both levels; magenta, failed at the trial level but passed at the combined. Tests 5, 7-10, 14, and 15 only operate at the combined level and therefore are only yellow and green if the comparison system failed or passed, respectively.

III.2 QC Efficay Rankings

It can be seen that the different QC systems demonstrate varying levels of statistical randomness, which depends further upon the QRNG algorithm. The Basic QRNG yields the most SR outputs for Oslo and Lagos. QRNG Type 1 yields the most SR outputs from Lagos, Nairobi, Oslo, and Lima. QRNG Type 2 yields the most random outputs from Oslo. QRNG Type 3 yields the most SR outputs from Lagos and Oslo (and yields highly SNR outputs for all the other systems). Despite this variation, we can still generally group the systems by efficacy; with Lagos and Oslo being of high efficacy; Lima, Nairobi, Manila, and Belem being of intermediate efficacy; and Quito, Perth, and Jakarta being of low efficacy. Only a single QC-QRNG combination is found as SR, QRNG Type 2 from Oslo. Additionally, Oslo is the only QC that always passes at the trial level.

IV Discussion

IV.1 Ranking and Performance of the QCs

While a truly exhaustive discussion of the implications of the results of the statistical tests for each QC - QRNG combination is a significant task, suitable for its own work, here we summarise several key conclusions. Some of the higher efficacy systems perform very well at the trial level for some of the QRNGs, some well enough that it suggests the QC could be capable of true randomness. However, when considering the combined trial sequences for each QC - QRNG pairings, all but one are overwhelmingly found to be SNR. The only one that is found to be SR is QRNG Type 2 from Oslo.

QRNG Type 2 from Oslo passes every test at the trial level and only fails two tests at the combined level, which is not enough to conclude it to be SNR. The tests that it fails are tests 3 and 7. This is momentous and suggests that QRNG Type 2 ran on the Oslo QC may indeed be a realization of a truly random physical system. Further testing should be done to determine whether this result is replicable.

IV.2 Comparison Systems

In order to provide a point of comparison for the QCs, We applied the statistical tests to a series of systems, which we refer to as the "comparison systems". The first comparison system is the binary expansion of the digits of pi. Given that there are no known patterns among the digits of pi, it is used as a "gold standard" of randomness NIST . It is important to note, that while we take the decimal digits of pi to be statistically random, the conversion from decimal digits to binary results in repetitions of subsequences which are the binary forms of respective decimal digits. This causes the binary expansion of pi to be found as SNR by test 12, despite it being SR in nature.

The second comparison system is Python’s built-in PRNG, which operates on the Mersenne Twister Algorithm and is known to be difficult to show as SNR, but not SR enough to be considered cryptographically secure python_docs_random . This system demonstrates the performance of a high-quality, but not state-of-the-art, PRNG.

The third comparison system is the PRNG found in Python’s Secrets module, which utilizes a computer’s low-level operating system security protocols to operate PRNGs (so on a Windows computer, as was used here, the module uses Window’s cryptographically secure PRNG) python_docs_secrets . This system illustrates the performance of a PRNG of the highest class.

The fourth comparison system is an intentionally SNR PRNG, generated to have a bias of 52% towards 0s and 48% against 1s. This was done utilizing the Python module Numpy numpy and will be referred to as "PRNG Biased." This system serves as an example of the results that would be seen from a non-random system.

The fifth comparison system is similar in nature to PRNG Biased but adds a second step. First, a "precursor" bit is generated as 1 or 0 with a 0.52 bias towards 0 and 0.48 bias against 1s. Then, if the precursor bit is a 0, the "final" bit is generated with the same bias as the precursor bit, but if the precursor bit is a 1, then the final bit is generated with a 0.48 bias away from 0s and 0.52 towards 1s. The final bit is then included in the binary sequence. We will refer to this system as "PRNG Biased 2 Step." This system illustrates what happens when successive non-random steps counteract each other.

The data from each comparison system was generated in 128 trials of 8,192 digits. These trials were individually analyzed by the statistical tests, as well as in a combined form. This was done to mimic the format of the data from the QCs. The results of the statistical test battery applied to these comparison systems are shown in Table 3.

Refer to caption
Refer to caption
Figure 6: Results from tests 1-15 of the QRNG Type 2 from the Oslo IBM QC system (top) and Python’s PRNG (bottom) at the combined level. The box plots represent the results of individual trials for each statistical test. The indexes for the statistical tests reference Table 1. The red line is at a P-value of 0.01. Notice that while both systems pass all but 2 tests, Python’s PRNG tends to have higher P-values

As expected, the binary expansion of pi, the Secrets module, and Python’s PRNG all perform very well, passing all the tests at the trial level. At the combined level, Pi and the Secrets module fail tests 7 and 12, but, as discussed earlier, because of the strict interpretation of these tests used in this work, this result is not unusual. Python’s PRNG fails tests 6 and 7 at the combined level, which is consistent with it being less random than the previous two. PRNG Biased fails four tests at the trial level, and nearly all of them at the combined level. Clearly, even 2% deviation from an evenly split probability of bits being 1s or 0s is easily detected by the statistical tests.

Curiously, PRNG Biased 2 Step is found to be more SR than the other comparison systems, failing only a single test, test 12, at the combined level, even though we know it to have been generated by a nonrandom algorithm. This demonstrates an interesting phenomenon: successive non-random processes can effectively "cancel out" each other’s non-randomness, and result in statistically random outputs. We refer to this phenomenon as "bias counteraction."

IV.3 Inferences about QCs

By analyzing the specific ways that the QCs are found to be SR or SNR, and by comparing them with the comparison systems, several things are inferred about the QCs

Inference 1. The fact that many QC-QRNG pairings perform well at the trial level, but very poorly at the combined level suggests that whatever the form of the bias in the outputs is, it is not impactful within each trial. That is, the bias does not create any patterns within a given trial, but the bias repeats itself for each trial.

Inference 2. The comparison systems (excluding PRNG Biased), were all found to be SR to a high degree, despite the fact that (aside from Pi) it is known that they were generated from non-random sources. QRNG Type 2 from Oslo, despite being the only system to be found as SR, did not display as high a degree of randomness as the comparison systems. This can be seen in Figure 6 and demonstrates that, at least for these QCs, operating these QRNGs, "quantum supremacy" has yet to be realized for the purpose of random number generation.

Inference 3. Every QC-QRNG combination except the Basic QRNG and QRNG Type 2 from Oslo, and the Basic QRNG and QRNG Types 2 and 3 from Lagos shows an overabundance of 0s (which can be seen from the test 16 LHS values at the combined level). This suggests that for the rest of the QCs, there is a significant source of bias towards 0s. This could potentially be caused by either the Hadamard gate being faulty and creating a superposition state with a bias towards 0s, the presence of noise is the system manifesting as 0s, both, or something else. Notice that the results from the PRNG Biased system closely resemble the results for many of the QC-QRNG pairings. This provides further evidence that the bias for many of the systems manifests as each qubit measurement having a bias towards 0s.

Inference 4. Several QCs systems, Nairobi, Manila, and Lima, show an improvement in performance from QRNG Type 1 over the Basic QRNG. This suggests that, for these systems, the bias towards 0s does in fact come from the Hadamard gate creating a superposition state with a bias toward the initial state. Furthermore, several systems, Nairobi, Oslo, Lagos, Lima, and Belem, all exhibit more random results from QRNG Type 2 than Type 1 or the Basic QRNG, while the other systems perform similarly for Type 1 and 2. These systems which perform better with QRNG Type 2 may be due to the effect of bias counteraction, resulting from the successive Hadamard gates, similar to what is observed in the Biased 2 Step comparison system.

It should be noted that while the effect of bias counteraction that is observed in QRNG Type 2 and PRNG Biased 2 Step is similar, the underlying mechanism is different. In PRNG Biased 2 Step, the bias counteraction works by making the system more chaotic, obfuscating the fundamentally deterministic nature of the algorithm, and making it appear SR. However, in the case of QRNG Type 2, the bias counteraction works to "filter out" the noise present in the system, or whatever the source of the bias may be, distilling the system to its fundamentally statistical nature.

Inference 5. Almost every QC is found as heavily SNR for QRNG Type 3 with only Lagos and Oslo performing reasonably well at either the trial or combined level. Considering that several systems, Lagos, Lima, Oslo, and Perth, fail Tests 16 and 17 at the i=4𝑖4i=4italic_i = 4 by a much larger margin than any other i𝑖iitalic_i levels, this indicates that there are repetitions of 4 digit patterns in the resulting data. This suggests that the state of the qubits is not sufficiently independent of each other and that the qubits are becoming unintentionally entangled or otherwise correlated to some degree. For the other systems, either this entanglement is not occurring, or perhaps, considering that they perform worse for QRNG Type 3 than the other QRNGs, the source of the bias is significant enough to prevent the 4-digit patterns from being apparent. Another possibility is that the bias increases with the number of qubits in use.

IV.4 Conclusion

In this work, we proposed a method of investigating the efficacy of QCs, by utilizing their theoretical potential to demonstrate true randomness. The method involved repeatedly executing a QRNG on the computer and analyzing the statistical randomness of its results, with the analysis being done using a battery of 18 statistical tests.

We then applied this efficacy testing method to nine IBM quantum computers, Perth, Lagos, Nairobi, Oslo, Jakarta, Manila, Quito, Belem, and Lima, using four different QRNG algorithms, each designed to investigate certain properties of the QCs. All but one of the QC-QRNG pairings were found to be statistically non-random, with the output from the QRNG Type 2 algorithm ran on the Oslo QC being found as statistically random. This suggests that this QC-QRNG combination may indeed be a physical realization of true randomness.

We compared the statistical randomness of the QCs to five comparison systems, the binary expansion of pi, a PRNG using Python’s Secrets module, a PRNG using Python’s built-in PRNG, a biased "PRNG," and a PRNG whose biases cancel out. None of the quantum computers performed better than these sources (aside from the biased PRNG), QRNG Type 2 from Oslo being found as random and coming close in terms of statistical randomness. This indicates that further work is necessary in order for these quantum computers to fully realize their theoretical potential to be truly random, either by addressing biases present in the hardware of the QCs or by utilizing quantum algorithms that are capable of extracting the true randomness through the bias. QRNGs utilizing repeated Hadamard gates, such as QRNG Type 2, could be a promising direction for achieving such algorithms.

Acknowledgements

We acknowledge the use of IBM Quantum services for this work. The views expressed are those of the authors, and do not reflect the official policy or position of IBM or the IBM Quantum team.

Statistical Test Function

1. Monobit Test

Tests for even distribution of 1’s and 0’s NIST

2. Frequency Test Within a Block

Tests for even distribution of 1’s and 0’s within subsequencesNIST

3. Runs Test

Tests for uninterrupted sequences of 1’s or 0’s NIST

4. Test for the Longest Run of Ones in a Block

Tests for the longest run of ones in a block of given sizes matches expectations of a random case NIST

5. Binary Matrix Test

Tests the rank property of matrices made from subsequences. NIST

6. Discrete Fourier Transform Test

Analyzes the peak heights in the Discrete Fourier Transform of the sequence.

7. Non-overlap** Template Matching Test

Tests for the number of occurrences of pre-specified target strings within non-overlap** blocks of the sequence NIST

8. Overlap** Template Matching Test

Test for the number of occurrences of pre-specified target strings within overlap** blocks of the sequence NIST

9. Maurer’s "Universal Statistical" Test

Tests for the number of bits between matching patterns.

10. Linear Complexity Test

Tests for the length of a linear feedback shift register.

11. Serial Test

Tests for the frequency of all possible overlap** patterns of a specified length across the entire sequence NIST

12. Approximate Entropy Test

Tests for the frequency of overlap** blocks of two adjacent lengths against the expected result for a random sequence NIST

13 Cumulative Sums Test

Tests for the maximal excursion (from zero) of the random walk defined by the cumulative sum of adjusted digits in the sequence NIST . 13.1 and 13.2 designate the forward and backward versions, respectively.

14. Random Excursions Test

Tests for the number of cycles having exactly K visits in a cumulative sum random walk NIST

15. Random Excursions Variant Test

Tests for the total number of times that a particular state is visited in a cumulative sum random walkNIST

16. Borel Normality Criterion

Tests for the distribution of individual subsequences.Bayesian_Criteria .

17. Bayesian Criteria

Tests for the relative distribution of all subsequences (up to a certain length). Bayesian_Criteria

18. Topological Binary Test

Tests whether the frequency of unique subsequences is higher enough.TBT

Table 4: The statistical test battery and respective descriptions.

References

  • (1) P. W. Shor, "Algorithms for quantum computation: Discrete logarithms and factoring," in Proceedings 35th Annual Symposium on Foundations of Computer Science, IEEE Comput. Soc. Press, 1994, pp. 124–134.
  • (2) F. Arute, K. Arya, R. Babbush, et al., "Quantum supremacy using a programmable superconducting processor," Nature, vol. 574, pp. 505–510, 2019. [Online]. Available: https://doi.org/10.1038/s41586-019-1666-5
  • (3) C. Hughes, J. Isaacson, A. Perry, R. F. Sun, and J. Turner, Quantum Computing for the Quantum Curious, Springer, 2021.
  • (4) IBM Quantum. [Online]. Available: https://quantum-computing.ibm.com, accessed 16 October 2022.
  • (5) "Quantum for Educators," IBM. [Online]. Available: https://www.ibm.com/quantum/educators, accessed 16 October 2022.
  • (6) M. Mulko, "Random number generation: What are its functions and the fields of usage?," Interesting Engineering, 13 May 2022. [Online]. Available: https://interestingengineering.com/innovation/random-number-generation, accessed 15 October 2022.
  • (7) V. Mannalath, S. Mishra, and A. Pathak, "A comprehensive review of quantum random number generators: Concepts, classification and the origin of randomness," arXiv preprint arXiv:2203.00261, 2022.
  • (8) A. C. Martinez, A. Solis, R. Diaz Hernandez Rojas, A. B. U’Ren, J. G. Hirsch, and I. Perez Castillo, "Advanced statistical testing of quantum random number generators," Entropy, vol. 20, p. 886, 2018. [Online]. Available: https://doi.org/10.3390/e2011088
  • (9) M. Eaton, A. Hossameldin, R. J. Birrittella, et al., "Resolution of 100 photons and quantum generation of unbiased random numbers," Nat. Photon., vol. 17, pp. 106–111, 2023. [Online]. Available: https://doi.org/10.1038/s41566-022-01105-9
  • (10) A. Rukhin, J. Soto, J. Nechvatal, M. Smid, and E. Barker, "A statistical test suite for random and pseudorandom number generators for cryptographic applications," Technical report, Booz Allen and Hamilton Inc Mclean Va, 2001.
  • (11) P. M. Alcover, A. Guillamon, and M. d. C. A. Ruiz, "New randomness test for bit sequences," Informatica, January 2013.
  • (12) The Python Foundation, "random — Generate pseudo-random," Python Documentation. [Online]. Available: https://docs.python.org/3/library/random.html, accessed 9 December 2022.
  • (13) Python Documentation, "secrets — Generate secure random numbers for managing secrets". [Online]. Available: https://docs.python.org/3/library/secrets.html, accessed 9 December 2022
  • (14) Numpy Documentation. [Online]. Available: https://numpy.org/doc/stable/, accessed 9 December 2022