HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: boxedminipage
  • failed: environ
  • failed: esdiff

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2308.14993v2 [cs.IT] 27 Jan 2024
\NewEnviron

problem[1]

#1\BODY

On k𝑘kitalic_k-Mer-Based and Maximum Likelihood Estimation Algorithms for Trace Reconstruction

Kuan Cheng Peking University, Haidian, Bei**g, China. Email: [email protected].    Elena Grigorescu Purdue University, West Lafayette, IN, USA. Supported in part by NSF CCF-1910411, and NSF CCF-2228814. Email: [email protected].    Xin Li Johns Hopkins University, Baltimore, MD, USA. Supported in part by NSF CAREER Award CCF-1845349 and NSF Award CCF-2127575. Email: [email protected].    Madhu Sudan School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts, USA. Supported in part by a Simons Investigator Award and NSF Award CCF 2152413. Email: [email protected].    Minshen Zhu Most of the work was done as a PhD student at Purdue University. Supported in part by NSF CCF-1910411, and NSF CCF-2228814. Email: [email protected].
Abstract

The goal of the trace reconstruction problem is to recover a string 𝐱{0,1}n𝐱superscript01𝑛\mathbf{x}\in\{0,1\}^{n}bold_x ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT given many independent traces of 𝐱𝐱\mathbf{x}bold_x, where a trace is a subsequence obtained from deleting bits of 𝐱𝐱\mathbf{x}bold_x independently with some given probability p[0,1).𝑝01p\in[0,1).italic_p ∈ [ 0 , 1 ) . A recent result of Chase (STOC 2021) shows how 𝐱𝐱\mathbf{x}bold_x can be determined (in exponential time) from exp(O(n1/5)log5n)𝑂superscript𝑛15superscript5𝑛\exp({O}(n^{1/5})\log^{5}n)roman_exp ( italic_O ( italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT ) roman_log start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_n ) traces. This is the state-of-the-art result on the sample complexity of trace reconstruction.

In this paper we consider two kinds of algorithms for the trace reconstruction problem.

We first observe that the bound of Chase, which is based on statistics of arbitrary length-k𝑘kitalic_k subsequences, can also be obtained by considering the “k𝑘kitalic_k-mer statistics”, i.e., statistics regarding occurrences of contiguous k𝑘kitalic_k-bit strings (a.k.a, k𝑘kitalic_k-mers) in the initial string 𝐱𝐱\mathbf{x}bold_x, for k=2n1/5𝑘2superscript𝑛15k=2n^{1/5}italic_k = 2 italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT. Mazooji and Shomorony (arXiv.2210.10917) show that such statistics (called k𝑘kitalic_k-mer density map) can be estimated within ε𝜀\varepsilonitalic_ε accuracy from poly(n,2k,1/ε)poly𝑛superscript2𝑘1𝜀\mathrm{poly}(n,2^{k},1/\varepsilon)roman_poly ( italic_n , 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , 1 / italic_ε ) traces. We call an algorithm to be k𝑘kitalic_k-mer-based if it reconstructs 𝐱𝐱\mathbf{x}bold_x given estimates of the k𝑘kitalic_k-mer density map. Such algorithms essentially capture all the analyses in the worst-case and smoothed-complexity models of the trace reconstruction problem we know of so far.

Our first, and technically more involved, result shows that any k𝑘kitalic_k-mer-based algorithm for trace reconstruction must use exp(Ω(n1/5logn))Ωsuperscript𝑛15𝑛\exp(\Omega(n^{1/5}\sqrt{\log n}))roman_exp ( roman_Ω ( italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_n end_ARG ) ) traces, under the assumption that the estimator requires poly(2k,1/ε)polysuperscript2𝑘1𝜀\mathrm{poly}(2^{k},1/\varepsilon)roman_poly ( 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , 1 / italic_ε ) traces, thus establishing the optimality of this number of traces. The analysis of this result also shows that the analysis technique used by Chase (STOC 2021) is essentially tight, and hence new techniques are needed in order to improve the worst-case upper bound.

This result is shown by considering an appropriate class of real polynomials, that have been previously studied in the context of trace estimation (De, O’Donnell, Servedio. Annals of Probability 2019; Nazarov, Peres. STOC 2017), and proving that two of these polynomials are very close to each other on an arc in the complex plane. Our proof of the proximity of such polynomials uses new technical ingredients that allow us to focus on just a few coefficients of these polynomials.

Our second, simple, result considers the performance of the Maximum Likelihood Estimator (MLE), which specifically picks the source string that has the maximum likelihood to generate the samples (traces). We show that the MLE algorithm uses a nearly optimal number of traces, i.e., up to a factor of n𝑛nitalic_n in the number of samples needed for an optimal algorithm, and show that this factor of n𝑛nitalic_n loss may be necessary under general “model estimation” settings.

1 Introduction

The trace reconstruction problem is an infamous question introduced by Batu, Kannan, Khanna and McGregor [BKKM04] in the context of computational biology. It asks to design algorithms that recover a string 𝐱{0,1}n𝐱superscript01𝑛\mathbf{x}\in\{0,1\}^{n}bold_x ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT given access to traces 𝐱~~𝐱\tilde{\mathbf{x}}over~ start_ARG bold_x end_ARG of 𝐱𝐱\mathbf{x}bold_x, obtained by deleting each bit independently with some given probability p[0,1).𝑝01p\in[0,1).italic_p ∈ [ 0 , 1 ) . The best current upper and lower bounds are exponentially apart, namely exp(O~(n1/5))~𝑂superscript𝑛15\exp(\widetilde{O}(n^{1/5}))roman_exp ( over~ start_ARG italic_O end_ARG ( italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT ) ) traces are sufficient for reconstruction [Cha21b] (improving upon the exp(O(n1/3))𝑂superscript𝑛13\exp(O(n^{1/3}))roman_exp ( italic_O ( italic_n start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT ) ) of [NP17, DOS19]) and Ω~(n3/2)~Ωsuperscript𝑛32{\widetilde{\Omega}}(n^{3/2})over~ start_ARG roman_Ω end_ARG ( italic_n start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT ) [HL20, Cha21a] are necessary.

The problem has been recently studied in several variants so far [BKKM04, KM05, VS08, HMPW08, MPV14, PZ17, NP17, DOS19, GM17, HPP18, HL20, HHP18, GM19, CGMR20, KMMP21, BLS20, CDL+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21b, Cha21b, CP21, NR21, SB21, GSZ22, Rub23] and it continues to elicit interest due to its deceptively simple formulation, as well as its motivating applications to DNA computing [YGM17].

In this paper, we focus on the worst-case formulation of the problem, which is equivalent from an information-theoretic point of view to the distinguishing variant. In this variant, the goal is to distinguish whether the received traces come from string 𝐱{0,1}n𝐱superscript01𝑛\mathbf{x}\in\{0,1\}^{n}bold_x ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT or from 𝐲{0,1}n𝐲superscript01𝑛\mathbf{y}\in\{0,1\}^{n}bold_y ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, for some known 𝐱𝐲.𝐱𝐲\mathbf{x}\neq\mathbf{y}.bold_x ≠ bold_y .

Algorithms based on k𝑘kitalic_k-bit statistics

A very natural kind of algorithms [HMPW08, NP17, DOS19] operates using the mean of the received traces at each location i[n]𝑖delimited-[]𝑛i\in[n]italic_i ∈ [ italic_n ] (one may assume that traces of smaller length than n𝑛nitalic_n are padded with 00’s at the end). Indeed, let 𝒟𝐱subscript𝒟𝐱\mathcal{D}_{\mathbf{x}}caligraphic_D start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT be the distribution of the traces induced by the deletion channel on input 𝐱𝐱\mathbf{x}bold_x. A mean/1111-bit-statistics -based algorithm first estimates from the received traces the mean vector 𝐄(𝐱)=(E0(𝐱),,En1(𝐱))[0,1]n𝐄𝐱subscript𝐸0𝐱subscript𝐸𝑛1𝐱superscript01𝑛\mathbf{E}(\mathbf{x})=\left(E_{0}(\mathbf{x}),\cdots,E_{n-1}(\mathbf{x})% \right)\in[0,1]^{n}bold_E ( bold_x ) = ( italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_x ) , ⋯ , italic_E start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ( bold_x ) ) ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, where the j𝑗jitalic_j-th coordinate is defined as

Ej(𝐱)=𝔼𝐱~𝒟𝐱[x~j].subscript𝐸𝑗𝐱similar-to~𝐱subscript𝒟𝐱𝔼delimited-[]subscript~𝑥𝑗\displaystyle E_{j}(\mathbf{x})=\underset{\tilde{\mathbf{x}}\sim\mathcal{D}_{% \mathbf{x}}}{\mathbb{E}}\left[\widetilde{x}_{j}\right].italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_x ) = start_UNDERACCENT over~ start_ARG bold_x end_ARG ∼ caligraphic_D start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG [ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] .

It then may perform further post-processing without further inspection of the traces.

Solving the distinguishing problem then reduces by standard arguments to understanding the 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-norm between the mean traces of 𝐱𝐱\mathbf{x}bold_x and 𝐲𝐲\mathbf{y}bold_y, namely the number T𝑇Titalic_T of traces satisfies

Ω(1/𝐄(𝐱)𝐄(𝐲)1)=T=O(1/𝐄(𝐱)𝐄(𝐲)12).Ω1subscriptdelimited-∥∥𝐄𝐱𝐄𝐲subscript1𝑇𝑂1superscriptsubscriptdelimited-∥∥𝐄𝐱𝐄𝐲subscript12\Omega\left(1/\mathinner{\!\left\lVert\mathbf{E}(\mathbf{x})-\mathbf{E}(% \mathbf{y})\right\rVert}_{\ell_{1}}\right)=T=O\left(1/\mathinner{\!\left\lVert% \mathbf{E}(\mathbf{x})-\mathbf{E}(\mathbf{y})\right\rVert}_{\ell_{1}}^{2}% \right).roman_Ω ( 1 / start_ATOM ∥ bold_E ( bold_x ) - bold_E ( bold_y ) ∥ end_ATOM start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = italic_T = italic_O ( 1 / start_ATOM ∥ bold_E ( bold_x ) - bold_E ( bold_y ) ∥ end_ATOM start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

[NP17, DOS19] related the 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-norm above with the supremum of a certain real univariate polynomial over the complex plane. Using techniques from complex analysis they proved that mean-based algorithms using exp(O(n1/3))𝑂superscript𝑛13\exp(O(n^{1/3}))roman_exp ( italic_O ( italic_n start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT ) ) traces and outputting the string 𝐬{𝐱,𝐲}𝐬𝐱𝐲\mathbf{s}\in\{\mathbf{x},\mathbf{y}\}bold_s ∈ { bold_x , bold_y } whose 𝐄(𝐬)𝐄𝐬\mathbf{E}(\mathbf{s})bold_E ( bold_s ) is closer in 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-distance to the estimate is a successful reconstruction algorithm. Furthermore, any mean-based algorithm needs exp(Ω(n1/3))Ωsuperscript𝑛13\exp(\Omega(n^{1/3}))roman_exp ( roman_Ω ( italic_n start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT ) ) traces to succeed with high probability [NP17, DOS19].

A general class of algorithms may operate by using k𝑘kitalic_k-bit statistics [Cha21b], for k1𝑘1k\geq 1italic_k ≥ 1. Specifically, for w{0,1}k𝑤superscript01𝑘w\in\{0,1\}^{k}italic_w ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, the algorithm estimates from the given traces, for tuples 0i0<i1<<ik1n10subscript𝑖0subscript𝑖1subscript𝑖𝑘1𝑛10\leq i_{0}<i_{1}<\dots<i_{k-1}\leq n-10 ≤ italic_i start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT < italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < ⋯ < italic_i start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ≤ italic_n - 1, the quantity

𝔼𝐱~𝒟𝐱[0j<k𝟏{x~ij=wj}].similar-to~𝐱subscript𝒟𝐱𝔼delimited-[]subscriptproduct0𝑗𝑘1subscript~𝑥subscript𝑖𝑗subscript𝑤𝑗\displaystyle\underset{\tilde{\mathbf{x}}\sim\mathcal{D}_{\mathbf{x}}}{\mathbb% {E}}\left[\prod_{{0\leq j<k}}\mathbf{1}\Set{\widetilde{x}_{i_{j}}=w_{j}}\right].start_UNDERACCENT over~ start_ARG bold_x end_ARG ∼ caligraphic_D start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG [ ∏ start_POSTSUBSCRIPT 0 ≤ italic_j < italic_k end_POSTSUBSCRIPT bold_1 { start_ARG over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG } ] .

After the estimation step, whose accuracy can be argued via standard Chernoff bounds, the algorithm does not need the traces anymore and may perform further post-processing in order to output the correct string. The result of Chase follows from showing that for k=2n1/5𝑘2superscript𝑛15k=2n^{1/5}italic_k = 2 italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT there is a string w{0,1}k𝑤superscript01𝑘w\in\{0,1\}^{k}italic_w ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT for which the 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-distance between the corresponding k𝑘kitalic_k-bit statistics between 𝐱𝐱\mathbf{x}bold_x and 𝐲𝐲\mathbf{y}bold_y is large.

Algorithms based on k𝑘kitalic_k-mer statistics

Another variant proposed by Mazooji and Shomorony [MS22] considers algorithms which operate using estimates of statistics regarding occurrences of contiguous k𝑘kitalic_k-bit strings (a.k.a, k𝑘kitalic_k-mers) in the initial string 𝐱𝐱\mathbf{x}bold_x. We denote by 𝟏{𝐱[j:j+k1]=w}\mathbf{1}\Set{\mathbf{x}[j\colon j+k-1]=w}bold_1 { start_ARG bold_x [ italic_j : italic_j + italic_k - 1 ] = italic_w end_ARG } the indicator bit of whether w{0,1}k𝑤superscript01𝑘w\in\{0,1\}^{k}italic_w ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT occurs as a subword in 𝐱𝐱\mathbf{x}bold_x from position j𝑗jitalic_j.

In particular, [MS22] made the following definition which is central to our paper.

Definition 1 ([MS22]).

Given a string 𝐱{0,1}n𝐱superscript01𝑛\mathbf{x}\in\set{0,1}^{n}bold_x ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and a k𝑘kitalic_k-mer w{0,1}k𝑤superscript01𝑘w\in\set{0,1}^{k}italic_w ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, for i=0,1,,n1𝑖01normal-…𝑛1i=0,1,\dots,n-1italic_i = 0 , 1 , … , italic_n - 1 denote

Kw,𝐱[i]j=0nk(ji)pji(1p)i𝟏{𝐱[j:j+k1]=w}.\displaystyle K_{w,\mathbf{x}}[i]\coloneqq\sum_{j=0}^{n-k}\binom{j}{i}p^{j-i}(% 1-p)^{i}\cdot\mathbf{1}\Set{\mathbf{x}[j\colon j+k-1]=w}.italic_K start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT [ italic_i ] ≔ ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - italic_k end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_j end_ARG start_ARG italic_i end_ARG ) italic_p start_POSTSUPERSCRIPT italic_j - italic_i end_POSTSUPERSCRIPT ( 1 - italic_p ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ bold_1 { start_ARG bold_x [ italic_j : italic_j + italic_k - 1 ] = italic_w end_ARG } .

The vector K𝐱(Kw,𝐱[i]:w{0,1}k,i[n])K_{\mathbf{x}}\coloneqq\left(K_{w,\mathbf{x}}[i]\colon w\in\set{0,1}^{k},i\in[% n]\right)italic_K start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ≔ ( italic_K start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT [ italic_i ] : italic_w ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_i ∈ [ italic_n ] ) is called the k𝑘kitalic_k-mer density map of 𝐱𝐱\mathbf{x}bold_x.

Note that the mean vector 𝐄(𝐱)𝐄𝐱\mathbf{E}(\mathbf{x})bold_E ( bold_x ) is, up to a factor of 1p1𝑝1-p1 - italic_p, equivalent to the 1111-mer density map. Indeed, for k=1𝑘1k=1italic_k = 1 and w=1𝑤1w=1italic_w = 1 we have

Ei(𝐱)subscript𝐸𝑖𝐱\displaystyle E_{i}(\mathbf{x})italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x ) =𝔼𝐱~𝒟𝐱[x~i]=j=0n1Pr[x~i comes from xj]xjabsentsimilar-to~𝐱subscript𝒟𝐱𝔼delimited-[]subscript~𝑥𝑖superscriptsubscript𝑗0𝑛1Prsubscript~𝑥𝑖 comes from subscript𝑥𝑗subscript𝑥𝑗\displaystyle=\underset{\tilde{\mathbf{x}}\sim\mathcal{D}_{\mathbf{x}}}{% \mathbb{E}}\left[\widetilde{x}_{i}\right]=\sum_{j=0}^{n-1}\Pr\left[\widetilde{% x}_{i}\textup{ comes from }x_{j}\right]\cdot x_{j}= start_UNDERACCENT over~ start_ARG bold_x end_ARG ∼ caligraphic_D start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG [ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] = ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT roman_Pr [ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT comes from italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] ⋅ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
=j=0n1(ji)pji(1p)i+1xj=(1p)j=0n1(ji)pji(1p)i𝟏{𝐱[j:j]=1}=(1p)K1,𝐱[i].absentsuperscriptsubscript𝑗0𝑛1binomial𝑗𝑖superscript𝑝𝑗𝑖superscript1𝑝𝑖1subscript𝑥𝑗1𝑝superscriptsubscript𝑗0𝑛1binomial𝑗𝑖superscript𝑝𝑗𝑖superscript1𝑝𝑖1𝐱delimited-[]:𝑗𝑗11𝑝subscript𝐾1𝐱delimited-[]𝑖\displaystyle=\sum_{j=0}^{n-1}\binom{j}{i}p^{j-i}(1-p)^{i+1}\cdot x_{j}=(1-p)% \cdot\sum_{j=0}^{n-1}\binom{j}{i}p^{j-i}(1-p)^{i}\cdot\mathbf{1}\Set{\mathbf{x% }[j\mathrel{\mathop{:}}j]=1}=(1-p)K_{1,\mathbf{x}}[i].= ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_j end_ARG start_ARG italic_i end_ARG ) italic_p start_POSTSUPERSCRIPT italic_j - italic_i end_POSTSUPERSCRIPT ( 1 - italic_p ) start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT ⋅ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ( 1 - italic_p ) ⋅ ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_j end_ARG start_ARG italic_i end_ARG ) italic_p start_POSTSUPERSCRIPT italic_j - italic_i end_POSTSUPERSCRIPT ( 1 - italic_p ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ bold_1 { start_ARG bold_x [ italic_j : italic_j ] = 1 end_ARG } = ( 1 - italic_p ) italic_K start_POSTSUBSCRIPT 1 , bold_x end_POSTSUBSCRIPT [ italic_i ] .

As noted in [MS22], the techniques of [CDL+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21b] in the smoothed complexity model of trace reconstruction can also be viewed as based on k𝑘kitalic_k-mer density maps. Indeed, for a fixed w{0,1}k𝑤superscript01𝑘w\in\{0,1\}^{k}italic_w ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, the number of its occurrences as a subword in 𝐱𝐱\mathbf{x}bold_x is j=0n1𝟏{𝐱[j:j+k1]=w}=i=0n1Kw,𝐱[i]\sum_{j=0}^{n-1}\mathbf{1}\Set{\mathbf{x}[j\colon j+k-1]=w}=\sum_{i=0}^{n-1}K_% {w,\mathbf{x}}[i]∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT bold_1 { start_ARG bold_x [ italic_j : italic_j + italic_k - 1 ] = italic_w end_ARG } = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT [ italic_i ]. They show that for k=O(logn)𝑘𝑂𝑛k=O(\log n)italic_k = italic_O ( roman_log italic_n ), the subword vector (indexed by w{0,1}k𝑤superscript01𝑘w\in\{0,1\}^{k}italic_w ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT) uniquely determines the source string, with high probability [CDL+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21b, Lemma 1.1].

The main result of [MS22] is that given access to T=ε22O(k)poly(n)𝑇superscript𝜀2superscript2𝑂𝑘poly𝑛T=\varepsilon^{-2}\cdot 2^{O(k)}\mathrm{poly}(n)italic_T = italic_ε start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ⋅ 2 start_POSTSUPERSCRIPT italic_O ( italic_k ) end_POSTSUPERSCRIPT roman_poly ( italic_n ) traces of 𝐱𝐱\mathbf{x}bold_x, one can recover an estimation K^𝐱subscript^𝐾𝐱\hat{K}_{\mathbf{x}}over^ start_ARG italic_K end_ARG start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT of the k𝑘kitalic_k-mer density map K𝐱subscript𝐾𝐱K_{\mathbf{x}}italic_K start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT which is entry-wise ε𝜀\varepsilonitalic_ε-accurate, i.e., K^𝐱K𝐱εsubscriptdelimited-∥∥subscript^𝐾𝐱subscript𝐾𝐱subscript𝜀\mathinner{\!\left\lVert\hat{K}_{\mathbf{x}}-K_{\mathbf{x}}\right\rVert}_{\ell% _{\infty}}\leq\varepsilonstart_ATOM ∥ over^ start_ARG italic_K end_ARG start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT - italic_K start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ∥ end_ATOM start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ italic_ε. We remark that by replacing ε𝜀\varepsilonitalic_ε with ε/(2kn)𝜀superscript2𝑘𝑛\varepsilon/(2^{k}n)italic_ε / ( 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_n ), one gets an estimate which is ε𝜀\varepsilonitalic_ε-accurate in 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-norm, while using asymptotically the same number of traces.

We make the following definition generalizing mean-based algorithms ([DOS19, NP17]).

Definition 2.

(Algorithms based on k𝑘kitalic_k-mer statistics) A trace reconstruction algorithm based on k𝑘kitalic_k-mer statistics works in two steps as follows:

  1. 1.

    Once the unknown source string 𝐱{0,1}n𝐱superscript01𝑛\mathbf{x}\in\{0,1\}^{n}bold_x ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is picked, it chooses an accuracy parameter ε(0,1]𝜀01\varepsilon\in(0,1]italic_ε ∈ ( 0 , 1 ]. It then receives an ϵitalic-ϵ\epsilonitalic_ϵ-accurate estimate (in 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-norm) of the k𝑘kitalic_k-mer density map K𝐱subscript𝐾𝐱K_{\mathbf{x}}italic_K start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT based on the traces. From here on the algorithm has no more access to the traces themselves. We define the cost of this part to be 2k/εsuperscript2𝑘𝜀2^{k}/\varepsilon2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT / italic_ε.

  2. 2.

    The algorithm may perform further post-processing and finish by outputting the source string.

Since there is an algorithm to ε𝜀\varepsilonitalic_ε-estimate the k𝑘kitalic_k-mer density map with ε22O(k)poly(n)superscript𝜀2superscript2𝑂𝑘poly𝑛\varepsilon^{-2}\cdot 2^{O(k)}\mathrm{poly}(n)italic_ε start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ⋅ 2 start_POSTSUPERSCRIPT italic_O ( italic_k ) end_POSTSUPERSCRIPT roman_poly ( italic_n ) many traces [MS22], it follows that an algorithm defined as in Definition 2 with cost T𝑇Titalic_T can be turned into a trace reconstruction algorithm with poly(T)poly𝑇\mathrm{poly}(T)roman_poly ( italic_T ) samples.

We note that the k𝑘kitalic_k-mer density map estimators of [MS22] only use k𝑘kitalic_k-bit statistics of the traces, in fact statistics about contiguous k𝑘kitalic_k bits in the traces, and hence k𝑘kitalic_k-mer-based algorithms are a subclass of algorithms based on k𝑘kitalic_k-bit statistics.

In this work, we first observe that the upper bounds of Chase [Cha21b] can be in fact obtained via k𝑘kitalic_k-mer-based algorithms (see the formal statement in Theorem 1), and hence by only using statistics of contiguous subwords of the traces. Our main result says that k𝑘kitalic_k-mer-based algorithms require exp(Ω(n1/5)n)Ωsuperscript𝑛15𝑛\exp(\Omega(n^{1/5})\sqrt{n})roman_exp ( roman_Ω ( italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT ) square-root start_ARG italic_n end_ARG ) many traces (see Theorem 2). In addition, the analysis of this result implies that the proof technique in Chase [Cha21b] cannot lead to a better analysis of the sample complexity (up to log4.5nsuperscript4.5𝑛\log^{4.5}nroman_log start_POSTSUPERSCRIPT 4.5 end_POSTSUPERSCRIPT italic_n factors in the exponent), and hence new techniques are needed to significantly improve the current upper bound.

The Maximum Likelihood Estimator

In model estimation settings, a common tool for picking a “model” that best explains the observed data is the Maximum Likelihood Estimator (MLE). In the setting of trace reconstruction, it is natural to ask: What is the most likely trace distribution 𝒟𝐱subscript𝒟𝐱\mathcal{D}_{\mathbf{x}}caligraphic_D start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT (and hence 𝐱𝐱\mathbf{x}bold_x) to have produced the given sample/trace(s)? We formalize MLE next.

Definition 3 (Maximum Likelihood Estimation).

Let 𝒟={D1,D2,,Dm}𝒟subscript𝐷1subscript𝐷2normal-…subscript𝐷𝑚\mathcal{D}=\set{D_{1},D_{2},\dots,D_{m}}caligraphic_D = { start_ARG italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG } be a finite set of probability distributions over a common domain Ωnormal-Ω\Omegaroman_Ω. Given a sample xΩ𝑥normal-Ωx\in\Omegaitalic_x ∈ roman_Ω, the output of the Maximum Likelihood Estimation is (ties are broken arbitrarily)

MLE(x;𝒟)argmaxi[m]Di(x).MLE𝑥𝒟subscript𝑖delimited-[]𝑚subscript𝐷𝑖𝑥\displaystyle\mathrm{MLE}(x;\mathcal{D})\coloneqq\arg\max_{i\in[m]}D_{i}(x).roman_MLE ( italic_x ; caligraphic_D ) ≔ roman_arg roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) .

For independently and identically distributed samples x1,x2,,xkΩsubscript𝑥1subscript𝑥2normal-…subscript𝑥𝑘normal-Ωx_{1},x_{2},\ldots,x_{k}\in\Omegaitalic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ roman_Ω the output of the Maximum Likelihood Estimation is (ties are broken arbitrarily) is

MLE(x1,x2,xk;𝒟)argmaxi[m]j[k]Di(xj).MLEsubscript𝑥1subscript𝑥2subscript𝑥𝑘𝒟subscript𝑖delimited-[]𝑚subscriptproduct𝑗delimited-[]𝑘subscript𝐷𝑖subscript𝑥𝑗\displaystyle\mathrm{MLE}(x_{1},x_{2},\ldots x_{k};\mathcal{D})\coloneqq\arg% \max_{i\in[m]}\prod_{j\in[k]}D_{i}(x_{j}).roman_MLE ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; caligraphic_D ) ≔ roman_arg roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_m ] end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j ∈ [ italic_k ] end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) .

We present a simple proof that this algorithm (which takes exponential time, as it searches through all 𝐱{0,1}n𝐱superscript01𝑛\mathbf{x}\in\{0,1\}^{n}bold_x ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT) is in fact optimal in the number of traces used, up to an O(n)𝑂𝑛O(n)italic_O ( italic_n ) factor blowup.

We also observe that in the average-case setting, where the source string is a uniformly random string from {0,1}nsuperscript01𝑛\{0,1\}^{n}{ 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, MLEMLE\mathrm{MLE}roman_MLE is indeed optimal – without the O(n)𝑂𝑛O(n)italic_O ( italic_n ) factor blowup (see Remark 2.)

1.1 Our Contributions

The power of k𝑘kitalic_k-mer-based algorithms

Our first result shows that algorithms based on k𝑘kitalic_k-mer statistics can reconstruct a source string using exp(O~(n1/5))~𝑂superscript𝑛15\exp(\widetilde{O}(n^{1/5}))roman_exp ( over~ start_ARG italic_O end_ARG ( italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT ) ) many traces. This follows from the following theorem.

Theorem 1 (Implied by [Cha21b]).

Let 𝐱,𝐲{0,1}n𝐱𝐲superscript01𝑛\mathbf{x},\mathbf{y}\in\set{0,1}^{n}bold_x , bold_y ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be two arbitrary distinct strings, and let K𝐱,K𝐲subscript𝐾𝐱subscript𝐾𝐲K_{\mathbf{x}},K_{\mathbf{y}}italic_K start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT be their k𝑘kitalic_k-mer density maps, respectively. Assuming k=2n1/5𝑘2superscript𝑛15k=2n^{1/5}italic_k = 2 italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT, it holds that

K𝐱K𝐲1exp(O(n1/5log5n)).subscriptdelimited-∥∥subscript𝐾𝐱subscript𝐾𝐲subscript1𝑂superscript𝑛15superscript5𝑛\displaystyle\mathinner{\!\left\lVert K_{\mathbf{x}}-K_{\mathbf{y}}\right% \rVert}_{\ell_{1}}\geq\exp\left(-O(n^{1/5}\log^{5}n)\right).start_ATOM ∥ italic_K start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT - italic_K start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT ∥ end_ATOM start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≥ roman_exp ( - italic_O ( italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_n ) ) .

Based on Theorem 1, the algorithm estimates K^^𝐾\hat{K}over^ start_ARG italic_K end_ARG within an accuracy of ε=exp(O(n1/5log5n))𝜀𝑂superscript𝑛15superscript5𝑛\varepsilon=\exp(-O(n^{1/5}\log^{5}n))italic_ε = roman_exp ( - italic_O ( italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_n ) ) and outputs the 𝐱𝐱\mathbf{x}bold_x that minimizes K^K𝐱1.subscriptdelimited-∥∥^𝐾subscript𝐾𝐱subscript1\mathinner{\!\left\lVert{\hat{K}}-K_{\mathbf{x}}\right\rVert}_{\ell_{1}}.start_ATOM ∥ over^ start_ARG italic_K end_ARG - italic_K start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ∥ end_ATOM start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT . The cost of this k𝑘kitalic_k-mer-based algorithm is exp(O(n1/5log5n))𝑂superscript𝑛15superscript5𝑛\exp(O(n^{1/5}\log^{5}n))roman_exp ( italic_O ( italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_n ) ).

Our main result regarding k𝑘kitalic_k-mer-based algorithms is the following theorem which shows the tightness of the bound in Theorem 1.

Theorem 2.

Fix any kn1/5𝑘superscript𝑛15k\leq n^{1/5}italic_k ≤ italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT. Suppose K𝐱subscript𝐾𝐱K_{\mathbf{x}}italic_K start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT stands for the k𝑘kitalic_k-mer density map of 𝐱𝐱\mathbf{x}bold_x. There exist distinct strings 𝐱,𝐲{0,1}n𝐱𝐲superscript01𝑛\mathbf{x},\mathbf{y}\in\set{0,1}^{n}bold_x , bold_y ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT such that

K𝐱K𝐲1exp(Ω(n1/5logn)).subscriptdelimited-∥∥subscript𝐾𝐱subscript𝐾𝐲subscript1Ωsuperscript𝑛15𝑛\displaystyle\mathinner{\!\left\lVert K_{\mathbf{x}}-K_{\mathbf{y}}\right% \rVert}_{\ell_{1}}\leq\exp\left(-\Omega(n^{1/5}\sqrt{\log n})\right).start_ATOM ∥ italic_K start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT - italic_K start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT ∥ end_ATOM start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ roman_exp ( - roman_Ω ( italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_n end_ARG ) ) .

Hence, Theorem 2 implies that the cost of any k𝑘kitalic_k-mer-based algorithm for worst-case trace reconstruction is exp(Ω(n1/5logn))Ωsuperscript𝑛15𝑛\exp(\Omega(n^{1/5}\sqrt{\log n}))roman_exp ( roman_Ω ( italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_n end_ARG ) ).

Remark 1.

As one might expect, for k<ksuperscript𝑘𝑘k^{\prime}<kitalic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_k the ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT-mers usually contain less information than k𝑘kitalic_k-mers. To see this, observe that for a (k1)𝑘1(k-1)( italic_k - 1 )-mer w𝑤witalic_w, we have the following relation

𝟏{𝐱[j:j+k2]=w}=𝟏{𝐱[j1:j+k2]=0w}+𝟏{𝐱[j1:j+k2]=1w},1𝐱delimited-[]:𝑗𝑗𝑘2𝑤1𝐱delimited-[]:𝑗1𝑗𝑘20𝑤1𝐱delimited-[]:𝑗1𝑗𝑘21𝑤\displaystyle\mathbf{1}\Set{\mathbf{x}[j\mathrel{\mathop{:}}j+k-2]=w}=\mathbf{% 1}\Set{\mathbf{x}[j-1\mathrel{\mathop{:}}j+k-2]=0w}+\mathbf{1}\Set{\mathbf{x}[% j-1\mathrel{\mathop{:}}j+k-2]=1w},bold_1 { start_ARG bold_x [ italic_j : italic_j + italic_k - 2 ] = italic_w end_ARG } = bold_1 { start_ARG bold_x [ italic_j - 1 : italic_j + italic_k - 2 ] = 0 italic_w end_ARG } + bold_1 { start_ARG bold_x [ italic_j - 1 : italic_j + italic_k - 2 ] = 1 italic_w end_ARG } ,

provided that j>0𝑗0j>0italic_j > 0. The same also holds for 𝐲𝐲\mathbf{y}bold_y. In fact, the strings 𝐱𝐱\mathbf{x}bold_x and 𝐲𝐲\mathbf{y}bold_y obtained via Theorem 2 share a common prefix of length at least k𝑘kitalic_k (or one could prepend a prefix anyway), so 𝐱[0:k1]=𝐲[0:k1]𝐱delimited-[]:0superscript𝑘1𝐲delimited-[]:0superscript𝑘1\mathbf{x}[0\mathrel{\mathop{:}}k^{\prime}-1]=\mathbf{y}[0\mathrel{\mathop{:}}% k^{\prime}-1]bold_x [ 0 : italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 ] = bold_y [ 0 : italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 ] for any k<ksuperscript𝑘𝑘k^{\prime}<kitalic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_k, and one does not need to worry about the case j=0𝑗0j=0italic_j = 0. Plugging into the definition of k𝑘kitalic_k-mer density maps, we have

Kw,𝐱[i]Kw,𝐲[i]=(K0w,𝐱[i]K0w,𝐲[i])+(K1w,𝐱[i]K1w,𝐲[i]).subscript𝐾𝑤𝐱delimited-[]𝑖subscript𝐾𝑤𝐲delimited-[]𝑖subscript𝐾0𝑤𝐱delimited-[]𝑖subscript𝐾0𝑤𝐲delimited-[]𝑖subscript𝐾1𝑤𝐱delimited-[]𝑖subscript𝐾1𝑤𝐲delimited-[]𝑖\displaystyle K_{w,\mathbf{x}}[i]-K_{w,\mathbf{y}}[i]=\left(K_{0w,\mathbf{x}}[% i]-K_{0w,\mathbf{y}}[i]\right)+\left(K_{1w,\mathbf{x}}[i]-K_{1w,\mathbf{y}}[i]% \right).italic_K start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT [ italic_i ] - italic_K start_POSTSUBSCRIPT italic_w , bold_y end_POSTSUBSCRIPT [ italic_i ] = ( italic_K start_POSTSUBSCRIPT 0 italic_w , bold_x end_POSTSUBSCRIPT [ italic_i ] - italic_K start_POSTSUBSCRIPT 0 italic_w , bold_y end_POSTSUBSCRIPT [ italic_i ] ) + ( italic_K start_POSTSUBSCRIPT 1 italic_w , bold_x end_POSTSUBSCRIPT [ italic_i ] - italic_K start_POSTSUBSCRIPT 1 italic_w , bold_y end_POSTSUBSCRIPT [ italic_i ] ) .

By induction, for any k<ksuperscript𝑘𝑘k^{\prime}<kitalic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_k we have

w{0,1}k|Kw,𝐱[i]Kw,𝐲[i]|w{0,1}ku{0,1}kk|Kuw,𝐱[i]Kuw,𝐲[i]|=K𝐱K𝐲1.subscript𝑤superscript01superscript𝑘subscript𝐾𝑤𝐱delimited-[]𝑖subscript𝐾𝑤𝐲delimited-[]𝑖subscript𝑤superscript01superscript𝑘subscript𝑢superscript01𝑘superscript𝑘subscript𝐾𝑢𝑤𝐱delimited-[]𝑖subscript𝐾𝑢𝑤𝐲delimited-[]𝑖subscriptdelimited-∥∥subscript𝐾𝐱subscript𝐾𝐲subscript1\displaystyle\sum_{w\in\set{0,1}^{k^{\prime}}}\mathinner{\!\left\lvert K_{w,% \mathbf{x}}[i]-K_{w,\mathbf{y}}[i]\right\rvert}\leq\sum_{w\in\set{0,1}^{k^{% \prime}}}\sum_{u\in\set{0,1}^{k-k^{\prime}}}\mathinner{\!\left\lvert K_{uw,% \mathbf{x}}[i]-K_{uw,\mathbf{y}}[i]\right\rvert}=\mathinner{\!\left\lVert K_{% \mathbf{x}}-K_{\mathbf{y}}\right\rVert}_{\ell_{1}}.∑ start_POSTSUBSCRIPT italic_w ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_ATOM | italic_K start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT [ italic_i ] - italic_K start_POSTSUBSCRIPT italic_w , bold_y end_POSTSUBSCRIPT [ italic_i ] | end_ATOM ≤ ∑ start_POSTSUBSCRIPT italic_w ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_u ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_k - italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_ATOM | italic_K start_POSTSUBSCRIPT italic_u italic_w , bold_x end_POSTSUBSCRIPT [ italic_i ] - italic_K start_POSTSUBSCRIPT italic_u italic_w , bold_y end_POSTSUBSCRIPT [ italic_i ] | end_ATOM = start_ATOM ∥ italic_K start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT - italic_K start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT ∥ end_ATOM start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT .

Therefore, the bound in Theorem 2 indeed covers all ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT-mers for kksuperscript𝑘𝑘k^{\prime}\leq kitalic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_k.

We remark that the proof of Theorem 2 further implies that the analysis technique of [Cha21b] is essentially tight, in the sense that no better upper bound (up to log4.5nsuperscript4.5𝑛\log^{4.5}nroman_log start_POSTSUPERSCRIPT 4.5 end_POSTSUPERSCRIPT italic_n factors in the exponent) can be obtained via his analysis. We include further details about this implication in Remark 3.

Maximum Likelihood Estimator: an optimal algorithm

We next turn to analyzing the performance of the MLE algorithm in the setting of trace reconstruction. Our main result essentially shows that if there is an algorithm for trace reconstruction that uses T𝑇Titalic_T traces and succeeds with probability 3/4343/43 / 4 then the MLE algorithm using O(nT)𝑂𝑛𝑇O(nT)italic_O ( italic_n italic_T ) traces succeeds with probability 3/4.343/4.3 / 4 . Hence, given that the current upper bounds for the worst-case reconstruction problem are exponential in n𝑛nitalic_n, we may view the MLE as an optimal algorithm for trace reconstruction.

Theorem 3.

Suppose 𝒟={D0,D1,,Dm}𝒟subscript𝐷0subscript𝐷1normal-…subscript𝐷𝑚\mathcal{D}=\set{D_{0},D_{1},\dots,D_{m}}caligraphic_D = { start_ARG italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG } is such that dTV(D0,Di)1εsubscript𝑑normal-TVsubscript𝐷0subscript𝐷𝑖1𝜀d_{\mathrm{TV}}\left(D_{0},D_{i}\right)\geq 1-\varepsilonitalic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ 1 - italic_ε for any 1im1𝑖𝑚1\leq i\leq m1 ≤ italic_i ≤ italic_m. Then we have

PrxD0[MLE(x;𝒟)=0]1mε.subscriptPrsimilar-to𝑥subscript𝐷0MLE𝑥𝒟01𝑚𝜀\displaystyle\Pr_{x\sim D_{0}}\left[\mathrm{MLE}(x;\mathcal{D})=0\right]\geq 1% -m\varepsilon.roman_Pr start_POSTSUBSCRIPT italic_x ∼ italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_MLE ( italic_x ; caligraphic_D ) = 0 ] ≥ 1 - italic_m italic_ε .

We remark that the loss of a factor of m𝑚mitalic_m in Theorem 3 is generally inevitable. Here is a simple example: let D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT be the uniform distribution over [m]delimited-[]𝑚[m][ italic_m ], and for i=1,2,,m𝑖12𝑚i=1,2,\dots,mitalic_i = 1 , 2 , … , italic_m, let Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the point distribution supported on {i}𝑖\set{i}{ start_ARG italic_i end_ARG }. We have dTV(D0,Di)=((m1)/m+(11/m))/2=11/msubscript𝑑TVsubscript𝐷0subscript𝐷𝑖𝑚1𝑚11𝑚211𝑚d_{\mathrm{TV}}\left(D_{0},D_{i}\right)=((m-1)/m+(1-1/m))/2=1-1/mitalic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ( ( italic_m - 1 ) / italic_m + ( 1 - 1 / italic_m ) ) / 2 = 1 - 1 / italic_m. However, PrxD0[MLE(x;𝒟)=0]=0subscriptPrsimilar-to𝑥subscript𝐷0MLE𝑥𝒟00\Pr_{x\sim D_{0}}\left[\mathrm{MLE}(x;\mathcal{D})=0\right]=0roman_Pr start_POSTSUBSCRIPT italic_x ∼ italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_MLE ( italic_x ; caligraphic_D ) = 0 ] = 0.

For a string 𝐱{0,1}n𝐱superscript01𝑛\mathbf{x}\in\set{0,1}^{n}bold_x ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, let D𝐱subscript𝐷𝐱D_{\mathbf{x}}italic_D start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT denote the trace distribution of 𝐱𝐱\mathbf{x}bold_x. Theorem 3 implies the following corollary, which implies that in some sense the Maximum Likelihood Estimation is a universal algorithm for trace reconstruction.

Corollary 1.1.

Suppose T𝑇Titalic_T traces are sufficient for worst-case trace reconstruction with a success rate 3/4343/43 / 4. Then for any ε>0𝜀0\varepsilon>0italic_ε > 0, Maximum Likelihood Estimation with 8ln(1/ε)nTnormal-⋅81𝜀𝑛𝑇8\ln(1/\varepsilon)\cdot nT8 roman_ln ( 1 / italic_ε ) ⋅ italic_n italic_T traces solves worst-case trace reconstruction with success rate 1ε1𝜀1-\varepsilon1 - italic_ε.

Corollary 1.1 incurs a factor of O(n)𝑂𝑛O(n)italic_O ( italic_n ) to the sample complexity. While we currently do not know whether this blowup is necessary for trace reconstruction, the next result shows that it is inevitable for the more general “model estimation” problem.

Theorem 4.

For any integer n1𝑛1n\geq 1italic_n ≥ 1, there is a set of distributions 𝒟={D0,D1,D2,,Dm}𝒟subscript𝐷0subscript𝐷1subscript𝐷2normal-…subscript𝐷𝑚\mathcal{D}=\set{D_{0},D_{1},D_{2},\dots,D_{m}}caligraphic_D = { start_ARG italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG } over a common domain Ωnormal-Ω\Omegaroman_Ω of size |Ω|=m+nnormal-Ω𝑚𝑛\mathinner{\!\left\lvert\Omega\right\rvert}=m+nstart_ATOM | roman_Ω | end_ATOM = italic_m + italic_n, where m=(nn/4)=2Θ(n)𝑚binomial𝑛𝑛4superscript2normal-Θ𝑛m=\binom{n}{\left\lfloor{n/4}\right\rfloor}=2^{\Theta(n)}italic_m = ( FRACOP start_ARG italic_n end_ARG start_ARG ⌊ italic_n / 4 ⌋ end_ARG ) = 2 start_POSTSUPERSCRIPT roman_Θ ( italic_n ) end_POSTSUPERSCRIPT, satisfying the following conditions.

  1. 1.

    There is a distinguisher A𝐴Aitalic_A which given one sample xAjsimilar-to𝑥subscript𝐴𝑗x\sim A_{j}italic_x ∼ italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for an unknown j{0,1,,m}𝑗01𝑚j\in\set{0,1,\dots,m}italic_j ∈ { start_ARG 0 , 1 , … , italic_m end_ARG }, recovers j𝑗jitalic_j with probability at least 2/3232/32 / 3. In other words, for all j=0,1,,m𝑗01𝑚j=0,1,\dots,mitalic_j = 0 , 1 , … , italic_m,

    PrxDj[A(x)=j]2/3.subscriptPrsimilar-to𝑥subscript𝐷𝑗𝐴𝑥𝑗23\displaystyle\Pr_{x\sim D_{j}}[A(x)=j]\geq 2/3.roman_Pr start_POSTSUBSCRIPT italic_x ∼ italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_A ( italic_x ) = italic_j ] ≥ 2 / 3 .
  2. 2.

    MLEMLE\mathrm{MLE}roman_MLE fails to distinguish D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from other distributions with probability 1, even with T=n/4𝑇𝑛4T=n/4italic_T = italic_n / 4 samples. In other words,

    Prx1,,xTD0[MLE(x1,,xT;𝒟)=0]=0.subscriptPrsimilar-tosubscript𝑥1subscript𝑥𝑇subscript𝐷0MLEsubscript𝑥1subscript𝑥𝑇𝒟00\displaystyle\Pr_{x_{1},\dots,x_{T}\sim D_{0}}[\mathrm{MLE}(x_{1},\dots,x_{T};% \mathcal{D})=0]=0.roman_Pr start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_MLE ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ; caligraphic_D ) = 0 ] = 0 .
Remark 2.

Finally, we remark that in the average-case setting MLEMLE\mathrm{MLE}roman_MLE is indeed optimal (with no factor of O(n)𝑂𝑛O(n)italic_O ( italic_n ) factor blowup in the number of traces). This is because maximizing the likelihood is equivalent to maximizing the posterior probability under the uniform prior distribution (which is optimal), as can be seen via the Bayes rule

𝒟𝐱(x~1,,x~T)=subscript𝒟𝐱subscript~𝑥1subscript~𝑥𝑇absent\displaystyle\mathcal{D}_{\mathbf{x}}(\widetilde{x}_{1},\dots,\widetilde{x}_{T% })=caligraphic_D start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = p(𝐱x~1,,x~T)𝐱{0,1}np(𝐱)𝒟𝐱(x~1,,x~T)p(𝐱)𝑝conditional𝐱subscript~𝑥1subscript~𝑥𝑇subscriptsuperscript𝐱superscript01𝑛𝑝superscript𝐱subscript𝒟superscript𝐱subscript~𝑥1subscript~𝑥𝑇𝑝𝐱\displaystyle p(\mathbf{x}\mid\widetilde{x}_{1},\dots,\widetilde{x}_{T})\cdot% \frac{\sum_{\mathbf{x}^{\prime}\in\set{0,1}^{n}}p(\mathbf{x}^{\prime})\cdot% \mathcal{D}_{\mathbf{x}^{\prime}}(\widetilde{x}_{1},\dots,\widetilde{x}_{T})}{% p(\mathbf{x})}italic_p ( bold_x ∣ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ⋅ divide start_ARG ∑ start_POSTSUBSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_p ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⋅ caligraphic_D start_POSTSUBSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p ( bold_x ) end_ARG
=\displaystyle== p(𝐱x~1,,x~T)𝐱{0,1}n𝒟𝐱(x~1,,x~T)𝑝conditional𝐱subscript~𝑥1subscript~𝑥𝑇subscriptsuperscript𝐱superscript01𝑛subscript𝒟superscript𝐱subscript~𝑥1subscript~𝑥𝑇\displaystyle p(\mathbf{x}\mid\widetilde{x}_{1},\dots,\widetilde{x}_{T})\cdot% \sum_{\mathbf{x}^{\prime}\in\set{0,1}^{n}}\mathcal{D}_{\mathbf{x}^{\prime}}(% \widetilde{x}_{1},\dots,\widetilde{x}_{T})italic_p ( bold_x ∣ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ⋅ ∑ start_POSTSUBSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT )
=\displaystyle== p(𝐱x~1,,x~T)f(x~1,,x~T).𝑝conditional𝐱subscript~𝑥1subscript~𝑥𝑇𝑓subscript~𝑥1subscript~𝑥𝑇\displaystyle p(\mathbf{x}\mid\widetilde{x}_{1},\dots,\widetilde{x}_{T})\cdot f% (\widetilde{x}_{1},\dots,\widetilde{x}_{T}).italic_p ( bold_x ∣ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ⋅ italic_f ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) .

Therefore maximizing both sides with respect to 𝐱𝐱\mathbf{x}bold_x yields the same result.

1.2 Overview of the techniques

Lower bounds for k𝑘kitalic_k-mer-based algorithms

In recent development of the trace reconstruction problem, the connection to various real and complex polynomials has been a recurring and intriguing theme [HMPW08, NP17, PZ17, HPP18, DOS19, CDL+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21b, CDRV21, Cha21b, SB21, GSZ22, Rub23]. The starting point of these techniques is to design a set of statistics that can be easily estimated from the traces (e.g., mean traces), with the property that for different source strings the corresponding statistics are somewhat “far apart”. To establish this property, one key idea is to associate each source string 𝐱𝐱\mathbf{x}bold_x with a generating polynomial P𝐱subscript𝑃𝐱P_{\mathbf{x}}italic_P start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT where the coefficients are exactly the statistics of 𝐱𝐱\mathbf{x}bold_x. Due to the structure of the deletion channel, in many cases, this generating polynomial (under a change of variables) is identical to another polynomial Q𝐱subscript𝑄𝐱Q_{\mathbf{x}}italic_Q start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT that is much easier to get a handle on. For example, the coefficients of Q𝐱subscript𝑄𝐱Q_{\mathbf{x}}italic_Q start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT are usually 0/1, and they are easily determined from 𝐱𝐱\mathbf{x}bold_x. To show that the statistics corresponding to 𝐱𝐱\mathbf{x}bold_x and 𝐲𝐲\mathbf{y}bold_y are far apart (say, in 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-distance), it is sufficient to show that |Q𝐱(w)Q𝐲(w)|subscript𝑄𝐱𝑤subscript𝑄𝐲𝑤\mathinner{\!\left\lvert Q_{\mathbf{x}}(w)-Q_{\mathbf{y}}(w)\right\rvert}| italic_Q start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ( italic_w ) - italic_Q start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT ( italic_w ) | is large for an appropriate choice of w𝑤witalic_w. This is the point where all sorts of analytical tools are ready to shine. For instance, the main technical result in [Cha21b] is a complex analytical result that says that a certain family of polynomials cannot be uniformly small over a sub-arc of the complex unit circle, which has applications beyond the trace reconstruction problem.

This analytical view of trace reconstruction can lead to a tight analysis of certain algorithms/statistics. The best example would be mean-based algorithms, for which a tight bound of exp(Θ(n1/3))Θsuperscript𝑛13\exp(\Theta(n^{1/3}))roman_exp ( roman_Θ ( italic_n start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT ) ) traces is known to be sufficient and necessary for worst-case trace reconstruction [NP17, DOS19]. The tightness of the sample complexity is exactly due to the tightness of a complex analytical result by Borwein and Erdélyi [BE97]. Our lower bound for k𝑘kitalic_k-mer-based algorithms is obtained in a similar fashion, via establishing a complex analytical result complementary to that of [Cha21b] (See Lemma 3.1).

On the other hand, our argument takes a different approach than that of [BE97]. At a high level, both results use a Pigeonhole argument to show the existence of two univariate polynomials which are uniformly close over a sub-arc ΓΓ\Gammaroman_Γ of the complex unit circle. The difference lies in the objects playing the role of “pigeons”. [BE97]’s argument can be viewed as two steps: (1) apply the Pigeonhole Principle to obtain two polynomials that have close evaluations over a discrete set of points in ΓΓ\Gammaroman_Γ, and (2) use a continuity argument to extend the closeness to the entire sub-arc. Here the roles of pigeons and holes are played by evaluation vectors, and Cartesian products of small intervals. Our approach considers the coordinates of a related polynomial in the Chebyshev basis, which play the roles of pigeons in place of the evaluation vector. The properties of Chebyshev polynomials allow us to get rid of the continuity argument. Instead, we complete the proof by leveraging rather standard tools from complex analysis (e.g., Theorem 5 and Theorem 6). We believe this approach has the advantage of being generalizable to multivariate polynomials over the product of sub-arcs Γ=Γ1××ΓmΓsubscriptΓ1subscriptΓ𝑚\Gamma=\Gamma_{1}\times\dots\times\Gamma_{m}roman_Γ = roman_Γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ⋯ × roman_Γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT via multivariate Chebyshev series (see, e.g., [Mas80, Tre17]), whereas the same generalization seems to be tricky for the continuity argument.

Finally, the counting argument considers a special set of strings for which effectively only one k𝑘kitalic_k-mer contains meaningful information about the initial string. Since previous arguments did not exploit structural properties of the strings, this is another technical novelty of our proof.

Maximum Likelihood Estimation

Most of our results regarding Maximum Likelihood Estimation hold under the more general “model estimation” setting, where one is given a sample x𝑥xitalic_x drawn from an unknown distribution D𝒟𝐷𝒟D\in\mathcal{D}italic_D ∈ caligraphic_D and tries to recover D𝐷Ditalic_D. Our main observation is that if such a distinguisher works in worst-case, then the distributions in 𝒟𝒟\mathcal{D}caligraphic_D have large pairwise statistical distances. The maximization characterization of statistical distance, in conjunction with a union bound, implies that for a sample xDsimilar-to𝑥𝐷x\sim Ditalic_x ∼ italic_D its likelihood is maximized by D𝐷Ditalic_D except with a small probability. The O(n)𝑂𝑛O(n)italic_O ( italic_n ) factor loss in the sample complexity is essentially due to the union bound, and we show that this loss is tight in general by constructing a set of distributions which attains equality in the union bound.

1.3 Related work

The trace reconstruction problem was first introduced and studied by Levenshtein [Lev01b][Lev01a]. The original question is that if a message is sent multiple times through the same channel with random insertion/deletion errors, then how to recover the message? [BKKM04] and [HMPW08] formalized the problem to the current version for which the channel only has random deletions. Their central motivation is actually from computational biology, i.e. how to reconstruct the whole DNA sequence from multiple related subsequences. [CGMR20] and [BLS20] further extended the study to the “coded” version. That is, the string to reconstruct is not an arbitrary string but instead is a codeword from a code. A variant setting where the channel has memoryless replication insertions was studied by [CDRV21].

The average case version was studied in [HMPW08, PZ17, MPV14, HPP18]. For this case, the best known lower bound on the number of traces is Ω~(log5/2n)~Ωsuperscript52𝑛\widetilde{\Omega}(\log^{5/2}n)over~ start_ARG roman_Ω end_ARG ( roman_log start_POSTSUPERSCRIPT 5 / 2 end_POSTSUPERSCRIPT italic_n ) [HL20, Cha21a]. Building on Chase’s upper bound for the worst case, [Rub23] improved the sample complexity upper bound to exp(O~(log1/5n))~𝑂superscript15𝑛\exp(\widetilde{O}(\log^{1/5}n))roman_exp ( over~ start_ARG italic_O end_ARG ( roman_log start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT italic_n ) ) in the average-case model.

[CDL+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21b] studied another variant of the problem which is called the smooth variant. It is an intermediate model between the worst-case and the average-case models, where the initial string is an arbitrary string perturbed by replacing each coordinate by a uniformly random bit with some constant probability in [0,1]01[0,1][ 0 , 1 ]. [CDL+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21b] provided an efficient reconstruction algorithm for this case. Other variants studied include trace reconstruction from the multiset of substrings [GM17, GM19], population recovery variants [BCF+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT19], matrix reconstruction and parameterized algorithms [KMMP21], circular trace reconstruction [NR21], reconstruction from k𝑘kitalic_k-decks [KR97, Sco97, DS03, MPV14], and coded trace reconstruction[CGMR20, BLS20].

[DRSR21] studied approximate trace reconstruction and showed efficient algorithms. [CDL+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21a], [CDK21], and [CP21] further proved that if the source is a random string, then an approximate solution can be found with high probability using very few traces. Notice that approximate reconstructions imply distinguishers for pairs of strings with large edit distances. [MPV14, SB21, GSZ22] study the complexity of the problem parameterized by the Hamming/edit distance between the strings. [GSZ22] also shows that the problem of exhibiting explicit strings that are hard to distinguish for mean-based algorithms is equivalent to the Prouhet-Tarry-Escott problem, a difficult problem in number theory.

1.4 Organization

In Section 2 we prove Theorem 1, in Section 3 we prove our main result Theorem 2, and in Section 4 we prove Theorem 3.

2 k𝑘kitalic_k-mer-based algorithms: the upper bound

We prove Theorem 1 in this section.

Let us start with a definition that is essential for the study of k𝑘kitalic_k-mer-based algorithms.

Definition 4 (k𝑘kitalic_k-mer generating polynomial).

Let 𝐱{0,1}n𝐱superscript01𝑛\mathbf{x}\in\set{0,1}^{n}bold_x ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and w{0,1}k𝑤superscript01𝑘w\in\set{0,1}^{k}italic_w ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. The k𝑘kitalic_k-mer generating polynomial Pw,𝐱subscript𝑃𝑤𝐱P_{w,\mathbf{x}}italic_P start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT for string 𝐱𝐱\mathbf{x}bold_x and k𝑘kitalic_k-mer w𝑤witalic_w is the following degree-(n1)𝑛1(n-1)( italic_n - 1 ) polynomial in z𝑧zitalic_z:

Pw,𝐱(z)=0n1Kw,𝐱[]z.subscript𝑃𝑤𝐱𝑧superscriptsubscript0𝑛1subscript𝐾𝑤𝐱delimited-[]superscript𝑧\displaystyle P_{w,\mathbf{x}}(z)\coloneqq\sum_{\ell=0}^{n-1}K_{w,\mathbf{x}}[% \ell]\cdot z^{\ell}.italic_P start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT ( italic_z ) ≔ ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT [ roman_ℓ ] ⋅ italic_z start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT .

We have the following identity

Pw,𝐱(z)subscript𝑃𝑤𝐱𝑧\displaystyle P_{w,\mathbf{x}}(z)italic_P start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT ( italic_z ) ==0n1Kw,𝐱[]zabsentsuperscriptsubscript0𝑛1subscript𝐾𝑤𝐱delimited-[]superscript𝑧\displaystyle=\sum_{\ell=0}^{n-1}K_{w,\mathbf{x}}[\ell]\cdot z^{\ell}= ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT [ roman_ℓ ] ⋅ italic_z start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT
==0n1(j=0nk(j)(1p)pj𝟏{𝐱[j:j+k1]=w})z\displaystyle=\sum_{\ell=0}^{n-1}\left(\sum_{j=0}^{n-k}\binom{j}{\ell}(1-p)^{% \ell}p^{j-\ell}\cdot\mathbf{1}\Set{\mathbf{x}[j\colon j+k-1]=w}\right)z^{\ell}= ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - italic_k end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_j end_ARG start_ARG roman_ℓ end_ARG ) ( 1 - italic_p ) start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT italic_j - roman_ℓ end_POSTSUPERSCRIPT ⋅ bold_1 { start_ARG bold_x [ italic_j : italic_j + italic_k - 1 ] = italic_w end_ARG } ) italic_z start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT
=j=0nk𝟏{𝐱[j:j+k1]=w}(p+(1p)z)j.\displaystyle=\sum_{j=0}^{n-k}\mathbf{1}\Set{\mathbf{x}[j\colon j+k-1]=w}\cdot% \left(p+(1-p)z\right)^{j}.= ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - italic_k end_POSTSUPERSCRIPT bold_1 { start_ARG bold_x [ italic_j : italic_j + italic_k - 1 ] = italic_w end_ARG } ⋅ ( italic_p + ( 1 - italic_p ) italic_z ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT .

The expression on the last line, under a change of variable z0=p+(1p)zsubscript𝑧0𝑝1𝑝𝑧z_{0}=p+(1-p)zitalic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_p + ( 1 - italic_p ) italic_z, is exactly the polynomial studied in [Cha21b].

Lemma 2.1.

[Cha21b, Proposition 6.3] For distinct 𝐱,𝐲{0,1}n𝐱𝐲superscript01𝑛\mathbf{x},\mathbf{y}\in\set{0,1}^{n}bold_x , bold_y ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, if xi=yisubscript𝑥𝑖subscript𝑦𝑖x_{i}=y_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for all 0i<2n1/510𝑖2superscript𝑛1510\leq i<2n^{1/5}-10 ≤ italic_i < 2 italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT - 1, then there are w{0,1}2n1/5𝑤superscript012superscript𝑛15w\in\set{0,1}^{2n^{1/5}}italic_w ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT 2 italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and z0{eiθ:|θ|n2/5}subscript𝑧0normal-:superscript𝑒𝑖𝜃𝜃superscript𝑛25z_{0}\in\set{e^{i\theta}\colon\mathinner{\!\left\lvert\theta\right\rvert}\leq n% ^{-2/5}}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ { start_ARG italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT : start_ATOM | italic_θ | end_ATOM ≤ italic_n start_POSTSUPERSCRIPT - 2 / 5 end_POSTSUPERSCRIPT end_ARG } such that

|j0(𝟏{𝐱[j:j+2n1/51]=w}𝟏{𝐲[j:j+2n1/51]=w})z0j|exp(Cn1/5log5n).\displaystyle\mathinner{\!\left\lvert\sum_{j\geq 0}\left(\mathbf{1}\Set{% \mathbf{x}[j\colon j+2n^{1/5}-1]=w}-\mathbf{1}\Set{\mathbf{y}[j\colon j+2n^{1/% 5}-1]=w}\right)z_{0}^{j}\right\rvert}\geq\exp\left(-Cn^{1/5}\log^{5}n\right).start_ATOM | ∑ start_POSTSUBSCRIPT italic_j ≥ 0 end_POSTSUBSCRIPT ( bold_1 { start_ARG bold_x [ italic_j : italic_j + 2 italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT - 1 ] = italic_w end_ARG } - bold_1 { start_ARG bold_y [ italic_j : italic_j + 2 italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT - 1 ] = italic_w end_ARG } ) italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | end_ATOM ≥ roman_exp ( - italic_C italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_n ) .

Here C>0𝐶0C>0italic_C > 0 is a constant depending only on the deletion probability p𝑝pitalic_p.

We will use Lemma 2.1 to show that the exp(O~(n1/5))~𝑂superscript𝑛15\exp(\widetilde{O}(n^{1/5}))roman_exp ( over~ start_ARG italic_O end_ARG ( italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT ) ) upper bound of [Cha21b] can be achieved by k𝑘kitalic_k-mer-based algorithms, rather than general algorithms based on k𝑘kitalic_k-bit statistics. Our main lower bound on the number of traces implied by Theorem 2 will follow by showing an upper bound on the LHS in the lemma above (see Lemma 3.1).

Remark 3.

We remark that the result of Chase is obtained by first considering a corresponding multivariate channel polynomial that encodes in its coefficients the k𝑘kitalic_k-bit statistics of the traces. The upper bound on the number of traces reduces to understanding the supremum of this polynomial over a certain region of the complex plane. The crucial element of the proof is the reduction to the existence of w{0,1}k𝑤superscript01𝑘w\in\{0,1\}^{k}italic_w ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT satisfying Lemma 2.1, by appropriately making the remaining variables take value 00. We noticed that the resulting univariate polynomial is essentially the k𝑘kitalic_k-mer generating polynomial defined in Definition 4, with an extra factor of (1p)ksuperscript1𝑝𝑘(1-p)^{k}( 1 - italic_p ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. Our result in Lemma 3.1 implies that no tighter lower bound (up to polylogarithmic factors in the exponent) is possible for this univariate polynomial, showing that the analysis technique used in [Cha21b] cannot give a better upper bound on worst-case trace complexity.

2.1 An upper bound for k𝑘kitalic_k-mer based algorithms

The proof of Theorem 1 mainly uses Lemma 2.1. We will also make use of the following result.

Lemma 2.2.

[BEK99, Theorem 5.1] There are absolute constants c1>0subscript𝑐10c_{1}>0italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 and c2>0subscript𝑐20c_{2}>0italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 such that

|f(0)|c1/aexp(c2a)supt[1a,1]|f(t)|superscript𝑓0subscript𝑐1𝑎subscript𝑐2𝑎subscriptsupremum𝑡1𝑎1𝑓𝑡\displaystyle\mathinner{\!\left\lvert f(0)\right\rvert}^{c_{1}/a}\leq\exp\left% (\frac{c_{2}}{a}\right)\sup_{t\in[1-a,1]}\mathinner{\!\left\lvert f(t)\right\rvert}start_ATOM | italic_f ( 0 ) | end_ATOM start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / italic_a end_POSTSUPERSCRIPT ≤ roman_exp ( divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_a end_ARG ) roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 1 - italic_a , 1 ] end_POSTSUBSCRIPT start_ATOM | italic_f ( italic_t ) | end_ATOM

for every analytic function f𝑓fitalic_f on the open unit disk that satisfies |f(z)|<(1|z|)1𝑓𝑧superscript1𝑧1\mathinner{\!\left\lvert f(z)\right\rvert}<(1-\mathinner{\!\left\lvert z\right% \rvert})^{-1}start_ATOM | italic_f ( italic_z ) | end_ATOM < ( 1 - start_ATOM | italic_z | end_ATOM ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT for |z|<1𝑧1\mathinner{\!\left\lvert z\right\rvert}<1start_ATOM | italic_z | end_ATOM < 1, and a(0,1]𝑎01a\in(0,1]italic_a ∈ ( 0 , 1 ].

Proof of Theorem 1.


The proof deals with two cases.

Case 1: xi=yisubscript𝑥𝑖subscript𝑦𝑖x_{i}=y_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for all 0i<2n1/510𝑖2superscript𝑛1510\leq i<2n^{1/5}-10 ≤ italic_i < 2 italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT - 1.

In this case, 𝐱𝐱\mathbf{x}bold_x and 𝐲𝐲\mathbf{y}bold_y satisfy the premise of Lemma 2.1. It follows that there exist w{0,1}2n1/5𝑤superscript012superscript𝑛15w\in\set{0,1}^{2n^{1/5}}italic_w ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT 2 italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, and z0=eiθsubscript𝑧0superscript𝑒𝑖𝜃z_{0}=e^{i\theta}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT where |θ|n2/5𝜃superscript𝑛25\mathinner{\!\left\lvert\theta\right\rvert}\leq n^{-2/5}start_ATOM | italic_θ | end_ATOM ≤ italic_n start_POSTSUPERSCRIPT - 2 / 5 end_POSTSUPERSCRIPT, satisfying the bound

|j0(𝟏{𝐱[j:j+2n1/51]=w}𝟏{𝐲[j:j+2n1/51]=w})z0j|exp(Cn1/5log5n).\displaystyle\mathinner{\!\left\lvert\sum_{j\geq 0}\left(\mathbf{1}\Set{% \mathbf{x}[j\colon j+2n^{1/5}-1]=w}-\mathbf{1}\Set{\mathbf{y}[j\colon j+2n^{1/% 5}-1]=w}\right)z_{0}^{j}\right\rvert}\geq\exp\left(-Cn^{1/5}\log^{5}n\right).start_ATOM | ∑ start_POSTSUBSCRIPT italic_j ≥ 0 end_POSTSUBSCRIPT ( bold_1 { start_ARG bold_x [ italic_j : italic_j + 2 italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT - 1 ] = italic_w end_ARG } - bold_1 { start_ARG bold_y [ italic_j : italic_j + 2 italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT - 1 ] = italic_w end_ARG } ) italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | end_ATOM ≥ roman_exp ( - italic_C italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_n ) .

Here C>0𝐶0C>0italic_C > 0 is a constant depending only on the deletion probability p𝑝pitalic_p. Rewriting in terms of the k𝑘kitalic_k-mer generating polynomials, we have

|Pw,𝐱(z0p1p)Pw,𝐲(z0p1p)|exp(Cn1/5log5n).subscript𝑃𝑤𝐱subscript𝑧0𝑝1𝑝subscript𝑃𝑤𝐲subscript𝑧0𝑝1𝑝𝐶superscript𝑛15superscript5𝑛\displaystyle\mathinner{\!\left\lvert P_{w,\mathbf{x}}\left(\frac{z_{0}-p}{1-p% }\right)-P_{w,\mathbf{y}}\left(\frac{z_{0}-p}{1-p}\right)\right\rvert}\geq\exp% \left(-Cn^{1/5}\log^{5}n\right).start_ATOM | italic_P start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT ( divide start_ARG italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_p end_ARG start_ARG 1 - italic_p end_ARG ) - italic_P start_POSTSUBSCRIPT italic_w , bold_y end_POSTSUBSCRIPT ( divide start_ARG italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_p end_ARG start_ARG 1 - italic_p end_ARG ) | end_ATOM ≥ roman_exp ( - italic_C italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_n ) . (1)

It is easy to see that |z0p|/|1p|||z0|p|/|1p|=1subscript𝑧0𝑝1𝑝subscript𝑧0𝑝1𝑝1\mathinner{\!\left\lvert z_{0}-p\right\rvert}/\mathinner{\!\left\lvert 1-p% \right\rvert}\geq\mathinner{\!\left\lvert\mathinner{\!\left\lvert z_{0}\right% \rvert}-p\right\rvert}/\mathinner{\!\left\lvert 1-p\right\rvert}=1start_ATOM | italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_p | end_ATOM / start_ATOM | 1 - italic_p | end_ATOM ≥ start_ATOM | start_ATOM | italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ATOM - italic_p | end_ATOM / start_ATOM | 1 - italic_p | end_ATOM = 1. We also have the following upper bounds

|z0p1p|2superscriptsubscript𝑧0𝑝1𝑝2\displaystyle\mathinner{\!\left\lvert\frac{z_{0}-p}{1-p}\right\rvert}^{2}start_ATOM | divide start_ARG italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_p end_ARG start_ARG 1 - italic_p end_ARG | end_ATOM start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =(cosθp)2+sin2θ(1p)2=12pcosθ+p2(1p)2=1+2p(1cosθ)(1p)2absentsuperscript𝜃𝑝2superscript2𝜃superscript1𝑝212𝑝𝜃superscript𝑝2superscript1𝑝212𝑝1𝜃superscript1𝑝2\displaystyle=\frac{(\cos\theta-p)^{2}+\sin^{2}\theta}{(1-p)^{2}}=\frac{1-2p% \cos\theta+p^{2}}{(1-p)^{2}}=1+\frac{2p(1-\cos\theta)}{(1-p)^{2}}= divide start_ARG ( roman_cos italic_θ - italic_p ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_sin start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_θ end_ARG start_ARG ( 1 - italic_p ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = divide start_ARG 1 - 2 italic_p roman_cos italic_θ + italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_p ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = 1 + divide start_ARG 2 italic_p ( 1 - roman_cos italic_θ ) end_ARG start_ARG ( 1 - italic_p ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
=1+4psin2θ2(1p)21+pθ2(1p)21+p(1p)2n4/5,absent14𝑝superscript2𝜃2superscript1𝑝21𝑝superscript𝜃2superscript1𝑝21𝑝superscript1𝑝2superscript𝑛45\displaystyle=1+\frac{4p\sin^{2}\frac{\theta}{2}}{(1-p)^{2}}\leq 1+\frac{p% \theta^{2}}{(1-p)^{2}}\leq 1+\frac{p}{(1-p)^{2}}\cdot n^{-4/5},= 1 + divide start_ARG 4 italic_p roman_sin start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_θ end_ARG start_ARG 2 end_ARG end_ARG start_ARG ( 1 - italic_p ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ 1 + divide start_ARG italic_p italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_p ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ 1 + divide start_ARG italic_p end_ARG start_ARG ( 1 - italic_p ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ italic_n start_POSTSUPERSCRIPT - 4 / 5 end_POSTSUPERSCRIPT ,
|z0p1p|nsuperscriptsubscript𝑧0𝑝1𝑝𝑛\displaystyle\mathinner{\!\left\lvert\frac{z_{0}-p}{1-p}\right\rvert}^{n}start_ATOM | divide start_ARG italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_p end_ARG start_ARG 1 - italic_p end_ARG | end_ATOM start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT (1+p(1p)2n4/5)n/2exp(p(1p)2n4/5n2)absentsuperscript1𝑝superscript1𝑝2superscript𝑛45𝑛2𝑝superscript1𝑝2superscript𝑛45𝑛2\displaystyle\leq\left(1+\frac{p}{(1-p)^{2}}\cdot n^{-4/5}\right)^{n/2}\leq% \exp\left(\frac{p}{(1-p)^{2}}\cdot n^{-4/5}\cdot\frac{n}{2}\right)≤ ( 1 + divide start_ARG italic_p end_ARG start_ARG ( 1 - italic_p ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ italic_n start_POSTSUPERSCRIPT - 4 / 5 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_n / 2 end_POSTSUPERSCRIPT ≤ roman_exp ( divide start_ARG italic_p end_ARG start_ARG ( 1 - italic_p ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ italic_n start_POSTSUPERSCRIPT - 4 / 5 end_POSTSUPERSCRIPT ⋅ divide start_ARG italic_n end_ARG start_ARG 2 end_ARG )
=exp(p2(1p)2n1/5).absent𝑝2superscript1𝑝2superscript𝑛15\displaystyle=\exp\left(\frac{p}{2(1-p)^{2}}\cdot n^{1/5}\right).= roman_exp ( divide start_ARG italic_p end_ARG start_ARG 2 ( 1 - italic_p ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT ) .

From here we can apply the triangle inequality and conclude that

K𝐱K𝐲1subscriptdelimited-∥∥subscript𝐾𝐱subscript𝐾𝐲subscript1\displaystyle\mathinner{\!\left\lVert K_{\mathbf{x}}-K_{\mathbf{y}}\right% \rVert}_{\ell_{1}}start_ATOM ∥ italic_K start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT - italic_K start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT ∥ end_ATOM start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT =0n1|Kw,𝐱[]Kw,𝐲[]|absentsuperscriptsubscript0𝑛1subscript𝐾𝑤𝐱delimited-[]subscript𝐾𝑤𝐲delimited-[]\displaystyle\geq\sum_{\ell=0}^{n-1}\mathinner{\!\left\lvert K_{w,\mathbf{x}}[% \ell]-K_{w,\mathbf{y}}[\ell]\right\rvert}≥ ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT start_ATOM | italic_K start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT [ roman_ℓ ] - italic_K start_POSTSUBSCRIPT italic_w , bold_y end_POSTSUBSCRIPT [ roman_ℓ ] | end_ATOM
|z0p1p|n|=0n1(Kw,𝐱[]Kw,𝐲[])(z0p1p)|absentsuperscriptsubscript𝑧0𝑝1𝑝𝑛superscriptsubscript0𝑛1subscript𝐾𝑤𝐱delimited-[]subscript𝐾𝑤𝐲delimited-[]superscriptsubscript𝑧0𝑝1𝑝\displaystyle\geq\mathinner{\!\left\lvert\frac{z_{0}-p}{1-p}\right\rvert}^{-n}% \cdot\mathinner{\!\left\lvert\sum_{\ell=0}^{n-1}\left(K_{w,\mathbf{x}}[\ell]-K% _{w,\mathbf{y}}[\ell]\right)\cdot\left(\frac{z_{0}-p}{1-p}\right)^{\ell}\right\rvert}≥ start_ATOM | divide start_ARG italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_p end_ARG start_ARG 1 - italic_p end_ARG | end_ATOM start_POSTSUPERSCRIPT - italic_n end_POSTSUPERSCRIPT ⋅ start_ATOM | ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ( italic_K start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT [ roman_ℓ ] - italic_K start_POSTSUBSCRIPT italic_w , bold_y end_POSTSUBSCRIPT [ roman_ℓ ] ) ⋅ ( divide start_ARG italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_p end_ARG start_ARG 1 - italic_p end_ARG ) start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT | end_ATOM
exp(p2(1p)2n1/5Cn1/5log5n)absent𝑝2superscript1𝑝2superscript𝑛15𝐶superscript𝑛15superscript5𝑛\displaystyle\geq\exp\left(-\frac{p}{2(1-p)^{2}}\cdot n^{1/5}-Cn^{1/5}\log^{5}% n\right)≥ roman_exp ( - divide start_ARG italic_p end_ARG start_ARG 2 ( 1 - italic_p ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT - italic_C italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_n )
exp(Cn1/5log5n).absentsuperscript𝐶superscript𝑛15superscript5𝑛\displaystyle\geq\exp\left(-C^{\prime}n^{1/5}\log^{5}n\right).≥ roman_exp ( - italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_n ) .

Here C=p(1p)2/2+Csuperscript𝐶𝑝superscript1𝑝22𝐶C^{\prime}=p(1-p)^{-2}/2+Citalic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_p ( 1 - italic_p ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT / 2 + italic_C is a constant depending only on the deletion probability p𝑝pitalic_p.

Case 2: xiyisubscript𝑥𝑖subscript𝑦𝑖x_{i}\neq y_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for some 0i<2n1/510𝑖2superscript𝑛1510\leq i<2n^{1/5}-10 ≤ italic_i < 2 italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT - 1, i.e., 𝐱[0:2n1/51]𝐲[0:2n1/51]\mathbf{x}[0\colon 2n^{1/5}-1]\neq\mathbf{y}[0\colon 2n^{1/5}-1]bold_x [ 0 : 2 italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT - 1 ] ≠ bold_y [ 0 : 2 italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT - 1 ].

In this case, we are going to take w=𝐱[0:2n1/51]w=\mathbf{x}[0\colon 2n^{1/5}-1]italic_w = bold_x [ 0 : 2 italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT - 1 ] and show a much better bound

supz:|z|1|Pw,𝐱(z)Pw,𝐲(z)|>C′′,subscriptsupremum:𝑧𝑧1subscript𝑃𝑤𝐱𝑧subscript𝑃𝑤𝐲𝑧superscript𝐶′′\displaystyle\sup_{z\colon\mathinner{\!\left\lvert z\right\rvert}\leq 1}% \mathinner{\!\left\lvert P_{w,\mathbf{x}}(z)-P_{w,\mathbf{y}}(z)\right\rvert}>% C^{\prime\prime},roman_sup start_POSTSUBSCRIPT italic_z : start_ATOM | italic_z | end_ATOM ≤ 1 end_POSTSUBSCRIPT start_ATOM | italic_P start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT ( italic_z ) - italic_P start_POSTSUBSCRIPT italic_w , bold_y end_POSTSUBSCRIPT ( italic_z ) | end_ATOM > italic_C start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT , (2)

where C′′>0superscript𝐶′′0C^{\prime\prime}>0italic_C start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT > 0 is a constant depending only on p𝑝pitalic_p (hence certainly greater than exp(O~(n1/5))~𝑂superscript𝑛15\exp(-\widetilde{O}(n^{1/5}))roman_exp ( - over~ start_ARG italic_O end_ARG ( italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT ) )). Similar to what we did in case 1, applying the triangle inequality to Eq. 2 gives the theorem.

To prove Eq. 2, we let

Q(z0)=j0(𝟏{𝐱[j:j+2n1/51]=w}𝟏{𝐲[j:j+2n1/51]=w})z0j,\displaystyle Q(z_{0})=\sum_{j\geq 0}\left(\mathbf{1}\Set{\mathbf{x}[j\colon j% +2n^{1/5}-1]=w}-\mathbf{1}\Set{\mathbf{y}[j\colon j+2n^{1/5}-1]=w}\right)z_{0}% ^{j},italic_Q ( italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j ≥ 0 end_POSTSUBSCRIPT ( bold_1 { start_ARG bold_x [ italic_j : italic_j + 2 italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT - 1 ] = italic_w end_ARG } - bold_1 { start_ARG bold_y [ italic_j : italic_j + 2 italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT - 1 ] = italic_w end_ARG } ) italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ,

so that Q(p+(1p)z)=Pw,𝐱(z)Pw,𝐲(z)𝑄𝑝1𝑝𝑧subscript𝑃𝑤𝐱𝑧subscript𝑃𝑤𝐲𝑧Q(p+(1-p)z)=P_{w,\mathbf{x}}(z)-P_{w,\mathbf{y}}(z)italic_Q ( italic_p + ( 1 - italic_p ) italic_z ) = italic_P start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT ( italic_z ) - italic_P start_POSTSUBSCRIPT italic_w , bold_y end_POSTSUBSCRIPT ( italic_z ). Under our choice of w𝑤witalic_w, the constant term of Q𝑄Qitalic_Q equals to 1, i.e., Q(0)=1𝑄01Q(0)=1italic_Q ( 0 ) = 1.

If p(0,1/2]𝑝012p\in(0,1/2]italic_p ∈ ( 0 , 1 / 2 ], the closed disk B(p;1p)={p+(1p)z:|z|1}𝐵𝑝1𝑝:𝑝1𝑝𝑧𝑧1B(p;1-p)=\set{p+(1-p)z\colon\mathinner{\!\left\lvert z\right\rvert}\leq 1}italic_B ( italic_p ; 1 - italic_p ) = { start_ARG italic_p + ( 1 - italic_p ) italic_z : start_ATOM | italic_z | end_ATOM ≤ 1 end_ARG } contains the point 0. Therefore

supz:|z|1|Pw,𝐱(z)Pw,𝐲(z)|=supz0B(p;1p)|Q(z0)||Q(0)|=1.subscriptsupremum:𝑧𝑧1subscript𝑃𝑤𝐱𝑧subscript𝑃𝑤𝐲𝑧subscriptsupremumsubscript𝑧0𝐵𝑝1𝑝𝑄subscript𝑧0𝑄01\displaystyle\sup_{z\colon\mathinner{\!\left\lvert z\right\rvert}\leq 1}% \mathinner{\!\left\lvert P_{w,\mathbf{x}}(z)-P_{w,\mathbf{y}}(z)\right\rvert}=% \sup_{z_{0}\in B(p;1-p)}\mathinner{\!\left\lvert Q(z_{0})\right\rvert}\geq% \mathinner{\!\left\lvert Q(0)\right\rvert}=1.roman_sup start_POSTSUBSCRIPT italic_z : start_ATOM | italic_z | end_ATOM ≤ 1 end_POSTSUBSCRIPT start_ATOM | italic_P start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT ( italic_z ) - italic_P start_POSTSUBSCRIPT italic_w , bold_y end_POSTSUBSCRIPT ( italic_z ) | end_ATOM = roman_sup start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_B ( italic_p ; 1 - italic_p ) end_POSTSUBSCRIPT start_ATOM | italic_Q ( italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | end_ATOM ≥ start_ATOM | italic_Q ( 0 ) | end_ATOM = 1 .

We are left with the case p(1/2,1)𝑝121p\in(1/2,1)italic_p ∈ ( 1 / 2 , 1 ). Since Q𝑄Qitalic_Q is a polynomial with coefficients absolutely bounded by 1, we can apply Lemma 2.2 with a=2(1p)(0,1)𝑎21𝑝01a=2(1-p)\in(0,1)italic_a = 2 ( 1 - italic_p ) ∈ ( 0 , 1 ) and obtain

supt0[1a,1]|Q(t0)|exp(c1a)|Q(0)|c2/a=exp(c1a)subscriptsupremumsubscript𝑡01𝑎1𝑄subscript𝑡0subscript𝑐1𝑎superscript𝑄0subscript𝑐2𝑎subscript𝑐1𝑎\displaystyle\sup_{t_{0}\in[1-a,1]}\mathinner{\!\left\lvert Q(t_{0})\right% \rvert}\geq\exp\left(-\frac{c_{1}}{a}\right)\cdot\mathinner{\!\left\lvert Q(0)% \right\rvert}^{c_{2}/a}=\exp\left(-\frac{c_{1}}{a}\right)roman_sup start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ [ 1 - italic_a , 1 ] end_POSTSUBSCRIPT start_ATOM | italic_Q ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | end_ATOM ≥ roman_exp ( - divide start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_a end_ARG ) ⋅ start_ATOM | italic_Q ( 0 ) | end_ATOM start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / italic_a end_POSTSUPERSCRIPT = roman_exp ( - divide start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_a end_ARG )

for constants c1,c2>0subscript𝑐1subscript𝑐20c_{1},c_{2}>0italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0. Denoting t=(t0p)/(1p)𝑡subscript𝑡0𝑝1𝑝t=(t_{0}-p)/(1-p)italic_t = ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_p ) / ( 1 - italic_p ), we have t[1,1]𝑡11t\in[-1,1]italic_t ∈ [ - 1 , 1 ] when t0[1a,1]subscript𝑡01𝑎1t_{0}\in[1-a,1]italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ [ 1 - italic_a , 1 ]. In particular, t𝑡titalic_t is inside the closed unit disk B(0;1)𝐵01B(0;1)italic_B ( 0 ; 1 ). Therefore

supz:|z|1|Pw,𝐱(z)Pw,𝐲(z)|supt[1,1]|Pw,𝐱(t)Pw,𝐲(t)|=supt0[1a,1]|Q(t0)|exp(c1a).subscriptsupremum:𝑧𝑧1subscript𝑃𝑤𝐱𝑧subscript𝑃𝑤𝐲𝑧subscriptsupremum𝑡11subscript𝑃𝑤𝐱𝑡subscript𝑃𝑤𝐲𝑡subscriptsupremumsubscript𝑡01𝑎1𝑄subscript𝑡0subscript𝑐1𝑎\displaystyle\sup_{z\colon\mathinner{\!\left\lvert z\right\rvert}\leq 1}% \mathinner{\!\left\lvert P_{w,\mathbf{x}}(z)-P_{w,\mathbf{y}}(z)\right\rvert}% \geq\sup_{t\in[-1,1]}\mathinner{\!\left\lvert P_{w,\mathbf{x}}(t)-P_{w,\mathbf% {y}}(t)\right\rvert}=\sup_{t_{0}\in[1-a,1]}\mathinner{\!\left\lvert Q(t_{0})% \right\rvert}\geq\exp\left(-\frac{c_{1}}{a}\right).roman_sup start_POSTSUBSCRIPT italic_z : start_ATOM | italic_z | end_ATOM ≤ 1 end_POSTSUBSCRIPT start_ATOM | italic_P start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT ( italic_z ) - italic_P start_POSTSUBSCRIPT italic_w , bold_y end_POSTSUBSCRIPT ( italic_z ) | end_ATOM ≥ roman_sup start_POSTSUBSCRIPT italic_t ∈ [ - 1 , 1 ] end_POSTSUBSCRIPT start_ATOM | italic_P start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT ( italic_t ) - italic_P start_POSTSUBSCRIPT italic_w , bold_y end_POSTSUBSCRIPT ( italic_t ) | end_ATOM = roman_sup start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ [ 1 - italic_a , 1 ] end_POSTSUBSCRIPT start_ATOM | italic_Q ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | end_ATOM ≥ roman_exp ( - divide start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_a end_ARG ) .

To conclude, we can take C′′=min{1,exp(c1(1p)1/2)}superscript𝐶′′1subscript𝑐1superscript1𝑝12C^{\prime\prime}=\min\set{1,\exp(-c_{1}(1-p)^{-1}/2)}italic_C start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT = roman_min { start_ARG 1 , roman_exp ( - italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 1 - italic_p ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT / 2 ) end_ARG }. ∎

3 A lower bound for k𝑘kitalic_k-mer based algorithms: Proof of Theorem 2

We prove Theorem 2 in this section. The proof is based on the following lemma, which we will prove shortly.

Lemma 3.1.

There exists 𝐱,𝐲{0,1}n𝐱𝐲superscript01𝑛\mathbf{x},\mathbf{y}\set{0,1}^{n}bold_x , bold_y { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT such that for any k𝑘kitalic_k-mer w𝑤witalic_w, it holds that

supz:|z|=1|Pw,𝐱(z)Pw,𝐲(z)|2cn1/5logn.subscriptsupremum:𝑧𝑧1subscript𝑃𝑤𝐱𝑧subscript𝑃𝑤𝐲𝑧superscript2𝑐superscript𝑛15𝑛\displaystyle\sup_{z\colon\mathinner{\!\left\lvert z\right\rvert}=1}\mathinner% {\!\left\lvert P_{w,\mathbf{x}}(z)-P_{w,\mathbf{y}}(z)\right\rvert}\leq 2^{-cn% ^{1/5}\sqrt{\log n}}.roman_sup start_POSTSUBSCRIPT italic_z : start_ATOM | italic_z | end_ATOM = 1 end_POSTSUBSCRIPT start_ATOM | italic_P start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT ( italic_z ) - italic_P start_POSTSUBSCRIPT italic_w , bold_y end_POSTSUBSCRIPT ( italic_z ) | end_ATOM ≤ 2 start_POSTSUPERSCRIPT - italic_c italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_n end_ARG end_POSTSUPERSCRIPT .
Proof of Theorem 2 using Lemma 3.1.

We can extract Kw,𝐱[]Kw,𝐲[]subscript𝐾𝑤𝐱delimited-[]subscript𝐾𝑤𝐲delimited-[]K_{w,\mathbf{x}}[\ell]-K_{w,\mathbf{y}}[\ell]italic_K start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT [ roman_ℓ ] - italic_K start_POSTSUBSCRIPT italic_w , bold_y end_POSTSUBSCRIPT [ roman_ℓ ] by the contour integral (cf. [Lan13, §4, Theorem 2.1])

Kw,𝐱[]Kw,𝐲[]=12πi|z|=1(Pw,𝐱(z)Pw,𝐲(z))z1dz.subscript𝐾𝑤𝐱delimited-[]subscript𝐾𝑤𝐲delimited-[]12𝜋𝑖subscript𝑧1subscript𝑃𝑤𝐱𝑧subscript𝑃𝑤𝐲𝑧superscript𝑧1d𝑧\displaystyle K_{w,\mathbf{x}}[\ell]-K_{w,\mathbf{y}}[\ell]=\frac{1}{2\pi i}% \int_{\mathinner{\!\left\lvert z\right\rvert}=1}\left(P_{w,\mathbf{x}}(z)-P_{w% ,\mathbf{y}}(z)\right)\cdot z^{-\ell-1}\operatorname{d\!}z.italic_K start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT [ roman_ℓ ] - italic_K start_POSTSUBSCRIPT italic_w , bold_y end_POSTSUBSCRIPT [ roman_ℓ ] = divide start_ARG 1 end_ARG start_ARG 2 italic_π italic_i end_ARG ∫ start_POSTSUBSCRIPT start_ATOM | italic_z | end_ATOM = 1 end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT ( italic_z ) - italic_P start_POSTSUBSCRIPT italic_w , bold_y end_POSTSUBSCRIPT ( italic_z ) ) ⋅ italic_z start_POSTSUPERSCRIPT - roman_ℓ - 1 end_POSTSUPERSCRIPT start_OPFUNCTION roman_d end_OPFUNCTION italic_z .

Therefore

|Kw,𝐱[]Kw,𝐲[]|12π|z|=1|Pw,𝐱(z)Pw,𝐲(z)||z|1|dz|2cn1/5logn.subscript𝐾𝑤𝐱delimited-[]subscript𝐾𝑤𝐲delimited-[]12𝜋subscript𝑧1subscript𝑃𝑤𝐱𝑧subscript𝑃𝑤𝐲𝑧superscript𝑧1d𝑧superscript2𝑐superscript𝑛15𝑛\displaystyle\mathinner{\!\left\lvert K_{w,\mathbf{x}}[\ell]-K_{w,\mathbf{y}}[% \ell]\right\rvert}\leq\frac{1}{2\pi}\int_{\mathinner{\!\left\lvert z\right% \rvert}=1}\mathinner{\!\left\lvert P_{w,\mathbf{x}}(z)-P_{w,\mathbf{y}}(z)% \right\rvert}\cdot\mathinner{\!\left\lvert z\right\rvert}^{-\ell-1}\cdot% \mathinner{\!\left\lvert\operatorname{d\!}z\right\rvert}\leq 2^{-cn^{1/5}\sqrt% {\log n}}.start_ATOM | italic_K start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT [ roman_ℓ ] - italic_K start_POSTSUBSCRIPT italic_w , bold_y end_POSTSUBSCRIPT [ roman_ℓ ] | end_ATOM ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_π end_ARG ∫ start_POSTSUBSCRIPT start_ATOM | italic_z | end_ATOM = 1 end_POSTSUBSCRIPT start_ATOM | italic_P start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT ( italic_z ) - italic_P start_POSTSUBSCRIPT italic_w , bold_y end_POSTSUBSCRIPT ( italic_z ) | end_ATOM ⋅ start_ATOM | italic_z | end_ATOM start_POSTSUPERSCRIPT - roman_ℓ - 1 end_POSTSUPERSCRIPT ⋅ start_ATOM | start_OPFUNCTION roman_d end_OPFUNCTION italic_z | end_ATOM ≤ 2 start_POSTSUPERSCRIPT - italic_c italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_n end_ARG end_POSTSUPERSCRIPT .

We stress that the bound holds for any [n]delimited-[]𝑛\ell\in[n]roman_ℓ ∈ [ italic_n ] and k𝑘kitalic_k-mer w𝑤witalic_w. Note that for any fixed \ellroman_ℓ, there are at most nk+1𝑛𝑘1n-k+1italic_n - italic_k + 1 different k𝑘kitalic_k-mers w𝑤witalic_w for which Kw,𝐱[]>0subscript𝐾𝑤𝐱delimited-[]0K_{w,\mathbf{x}}[\ell]>0italic_K start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT [ roman_ℓ ] > 0. Namely, if w{x[j:j+k1]:0jnk}w\notin\set{x[j\colon j+k-1]\colon 0\leq j\leq n-k}italic_w ∉ { start_ARG italic_x [ italic_j : italic_j + italic_k - 1 ] : 0 ≤ italic_j ≤ italic_n - italic_k end_ARG } then Kw,𝐱[]=0subscript𝐾𝑤𝐱delimited-[]0K_{w,\mathbf{x}}[\ell]=0italic_K start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT [ roman_ℓ ] = 0. It follows that

K𝐱K𝐲1==0n1w|Kw,𝐱[]Kw,𝐲[]|n2(nk+1)2cn1/5log2/5n2cn1/5logn.subscriptdelimited-∥∥subscript𝐾𝐱subscript𝐾𝐲subscript1superscriptsubscript0𝑛1subscript𝑤subscript𝐾𝑤𝐱delimited-[]subscript𝐾𝑤𝐲delimited-[]𝑛2𝑛𝑘1superscript2𝑐superscript𝑛15superscript25𝑛superscript2superscript𝑐superscript𝑛15𝑛\displaystyle\mathinner{\!\left\lVert K_{\mathbf{x}}-K_{\mathbf{y}}\right% \rVert}_{\ell_{1}}=\sum_{\ell=0}^{n-1}\sum_{w}\mathinner{\!\left\lvert K_{w,% \mathbf{x}}[\ell]-K_{w,\mathbf{y}}[\ell]\right\rvert}\leq n\cdot 2(n-k+1)\cdot 2% ^{-cn^{1/5}\log^{2/5}n}\leq 2^{-c^{\prime}n^{1/5}\sqrt{\log n}}.start_ATOM ∥ italic_K start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT - italic_K start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT ∥ end_ATOM start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_ATOM | italic_K start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT [ roman_ℓ ] - italic_K start_POSTSUBSCRIPT italic_w , bold_y end_POSTSUBSCRIPT [ roman_ℓ ] | end_ATOM ≤ italic_n ⋅ 2 ( italic_n - italic_k + 1 ) ⋅ 2 start_POSTSUPERSCRIPT - italic_c italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 / 5 end_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ≤ 2 start_POSTSUPERSCRIPT - italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_n end_ARG end_POSTSUPERSCRIPT .

Next, we prove Lemma 3.1 assuming the following result, which is our main technical lemma.

Lemma 3.2.

Fix any kL1/3𝑘superscript𝐿13k\leq L^{1/3}italic_k ≤ italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT. There exist distinct 𝐱,𝐲{0,1}L𝐱𝐲superscript01𝐿\mathbf{x},\mathbf{y}\in\set{0,1}^{L}bold_x , bold_y ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT both starting with a run of 0s of length L1/31superscript𝐿131L^{1/3}-1italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT - 1, such that for any k𝑘kitalic_k-mer w𝑤witalic_w, it holds that

supθ:|θ|L2/3log1/4L|Pw,𝐱(eiθ)Pw,𝐲(eiθ)|2L1/3logL/20.subscriptsupremum:𝜃𝜃superscript𝐿23superscript14𝐿subscript𝑃𝑤𝐱superscript𝑒𝑖𝜃subscript𝑃𝑤𝐲superscript𝑒𝑖𝜃superscript2superscript𝐿13𝐿20\displaystyle\sup_{\theta\colon\mathinner{\!\left\lvert\theta\right\rvert}\leq L% ^{-2/3}\log^{1/4}L}\mathinner{\!\left\lvert P_{w,\mathbf{x}}(e^{i\theta})-P_{w% ,\mathbf{y}}(e^{i\theta})\right\rvert}\leq 2^{-L^{1/3}\sqrt{\log L}/20}.roman_sup start_POSTSUBSCRIPT italic_θ : start_ATOM | italic_θ | end_ATOM ≤ italic_L start_POSTSUPERSCRIPT - 2 / 3 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT italic_L end_POSTSUBSCRIPT start_ATOM | italic_P start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) - italic_P start_POSTSUBSCRIPT italic_w , bold_y end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) | end_ATOM ≤ 2 start_POSTSUPERSCRIPT - italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG / 20 end_POSTSUPERSCRIPT .
Proof of Lemma 3.1 using Lemma 3.2.

Let β3/5𝛽35\beta\geq 3/5italic_β ≥ 3 / 5 be a parameter to be decided later. Denote Lnβ𝐿superscript𝑛𝛽L\coloneqq n^{\beta}italic_L ≔ italic_n start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT. We have kn1/5=L1/(5β)L1/3𝑘superscript𝑛15superscript𝐿15𝛽superscript𝐿13k\leq n^{1/5}=L^{1/(5\beta)}\leq L^{1/3}italic_k ≤ italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT = italic_L start_POSTSUPERSCRIPT 1 / ( 5 italic_β ) end_POSTSUPERSCRIPT ≤ italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT, so that the premise of Lemma 3.2 is satisfied. Therefore, there exist distinct 𝐱,𝐲{0,1}Lsuperscript𝐱superscript𝐲superscript01𝐿\mathbf{x}^{\prime},\mathbf{y}^{\prime}\in\set{0,1}^{L}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT both starting with a run of 0s of length L1/31superscript𝐿131L^{1/3}-1italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT - 1, such that for any k𝑘kitalic_k-mer w𝑤witalic_w, it holds that

supθ:|θ|L2/3log1/4L|Pw,𝐱(eiθ)Pw,𝐲(eiθ)|2L1/3logL/20.subscriptsupremum:𝜃𝜃superscript𝐿23superscript14𝐿subscript𝑃𝑤superscript𝐱superscript𝑒𝑖𝜃subscript𝑃𝑤superscript𝐲superscript𝑒𝑖𝜃superscript2superscript𝐿13𝐿20\displaystyle\sup_{\theta\colon\mathinner{\!\left\lvert\theta\right\rvert}\leq L% ^{-2/3}\log^{1/4}L}\mathinner{\!\left\lvert P_{w,\mathbf{x}^{\prime}}(e^{i% \theta})-P_{w,\mathbf{y}^{\prime}}(e^{i\theta})\right\rvert}\leq 2^{-L^{1/3}% \sqrt{\log L}/20}.roman_sup start_POSTSUBSCRIPT italic_θ : start_ATOM | italic_θ | end_ATOM ≤ italic_L start_POSTSUPERSCRIPT - 2 / 3 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT italic_L end_POSTSUBSCRIPT start_ATOM | italic_P start_POSTSUBSCRIPT italic_w , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) - italic_P start_POSTSUBSCRIPT italic_w , bold_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) | end_ATOM ≤ 2 start_POSTSUPERSCRIPT - italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG / 20 end_POSTSUPERSCRIPT . (3)

Let 𝐱=0nL𝐱𝐱superscript0𝑛𝐿superscript𝐱\mathbf{x}=0^{n-L}\mathbf{x}^{\prime}bold_x = 0 start_POSTSUPERSCRIPT italic_n - italic_L end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 𝐲=0nL𝐲𝐲superscript0𝑛𝐿superscript𝐲\mathbf{y}=0^{n-L}\mathbf{y}^{\prime}bold_y = 0 start_POSTSUPERSCRIPT italic_n - italic_L end_POSTSUPERSCRIPT bold_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Since kL1/3𝑘superscript𝐿13k\leq L^{1/3}italic_k ≤ italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT, by construction we have 𝐱[j:j+k1]=𝐲[j:j+k1]𝐱delimited-[]:𝑗𝑗𝑘1𝐲delimited-[]:𝑗𝑗𝑘1\mathbf{x}[j\mathrel{\mathop{:}}j+k-1]=\mathbf{y}[j\mathrel{\mathop{:}}j+k-1]bold_x [ italic_j : italic_j + italic_k - 1 ] = bold_y [ italic_j : italic_j + italic_k - 1 ] for all 0jnL0𝑗𝑛𝐿0\leq j\leq n-L0 ≤ italic_j ≤ italic_n - italic_L. Therefore, any k𝑘kitalic_k-mer w𝑤witalic_w we have

Pw,𝐱(eiθ)Pw,𝐲(eiθ)subscript𝑃𝑤𝐱superscript𝑒𝑖𝜃subscript𝑃𝑤𝐲superscript𝑒𝑖𝜃\displaystyle P_{w,\mathbf{x}}(e^{i\theta})-P_{w,\mathbf{y}}(e^{i\theta})italic_P start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) - italic_P start_POSTSUBSCRIPT italic_w , bold_y end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT )
=\displaystyle== j=0nk(𝟏{𝐱[j:jk+1]=w}𝟏{𝐲[j:jk+1]=w})(p+qeiθ)jsuperscriptsubscript𝑗0𝑛𝑘1𝐱delimited-[]:𝑗𝑗𝑘1𝑤1𝐲delimited-[]:𝑗𝑗𝑘1𝑤superscript𝑝𝑞superscript𝑒𝑖𝜃𝑗\displaystyle\sum_{j=0}^{n-k}\left(\mathbf{1}\Set{\mathbf{x}[j\mathrel{\mathop% {:}}j-k+1]=w}-\mathbf{1}\Set{\mathbf{y}[j\mathrel{\mathop{:}}j-k+1]=w}\right)(% p+qe^{i\theta})^{j}∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - italic_k end_POSTSUPERSCRIPT ( bold_1 { start_ARG bold_x [ italic_j : italic_j - italic_k + 1 ] = italic_w end_ARG } - bold_1 { start_ARG bold_y [ italic_j : italic_j - italic_k + 1 ] = italic_w end_ARG } ) ( italic_p + italic_q italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT
=\displaystyle== (p+qeiθ)nLj=nLnk(𝟏{𝐱[j:jk+1]=w}𝟏{𝐲[j:jk+1]=w})(p+qeiθ)j(nL)superscript𝑝𝑞superscript𝑒𝑖𝜃𝑛𝐿superscriptsubscript𝑗𝑛𝐿𝑛𝑘1𝐱delimited-[]:𝑗𝑗𝑘1𝑤1𝐲delimited-[]:𝑗𝑗𝑘1𝑤superscript𝑝𝑞superscript𝑒𝑖𝜃𝑗𝑛𝐿\displaystyle\left(p+qe^{i\theta}\right)^{n-L}\cdot\sum_{j=n-L}^{n-k}\left(% \mathbf{1}\Set{\mathbf{x}[j\mathrel{\mathop{:}}j-k+1]=w}-\mathbf{1}\Set{% \mathbf{y}[j\mathrel{\mathop{:}}j-k+1]=w}\right)(p+qe^{i\theta})^{j-(n-L)}( italic_p + italic_q italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_n - italic_L end_POSTSUPERSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_j = italic_n - italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - italic_k end_POSTSUPERSCRIPT ( bold_1 { start_ARG bold_x [ italic_j : italic_j - italic_k + 1 ] = italic_w end_ARG } - bold_1 { start_ARG bold_y [ italic_j : italic_j - italic_k + 1 ] = italic_w end_ARG } ) ( italic_p + italic_q italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j - ( italic_n - italic_L ) end_POSTSUPERSCRIPT
=\displaystyle== (p+qeiθ)nLj=0Lk(𝟏{𝐱[j:jk+1]=w}𝟏{𝐲[j:jk+1]=w})(p+qeiθ)jsuperscript𝑝𝑞superscript𝑒𝑖𝜃𝑛𝐿superscriptsubscript𝑗0𝐿𝑘1superscript𝐱delimited-[]:𝑗𝑗𝑘1𝑤1superscript𝐲delimited-[]:𝑗𝑗𝑘1𝑤superscript𝑝𝑞superscript𝑒𝑖𝜃𝑗\displaystyle\left(p+qe^{i\theta}\right)^{n-L}\cdot\sum_{j=0}^{L-k}\left(% \mathbf{1}\Set{\mathbf{x}^{\prime}[j\mathrel{\mathop{:}}j-k+1]=w}-\mathbf{1}% \Set{\mathbf{y}^{\prime}[j\mathrel{\mathop{:}}j-k+1]=w}\right)(p+qe^{i\theta})% ^{j}( italic_p + italic_q italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_n - italic_L end_POSTSUPERSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - italic_k end_POSTSUPERSCRIPT ( bold_1 { start_ARG bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT [ italic_j : italic_j - italic_k + 1 ] = italic_w end_ARG } - bold_1 { start_ARG bold_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT [ italic_j : italic_j - italic_k + 1 ] = italic_w end_ARG } ) ( italic_p + italic_q italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT
=\displaystyle== (p+qeiθ)nL(Pw,𝐱(eiθ)Pw,𝐲(eiθ)).superscript𝑝𝑞superscript𝑒𝑖𝜃𝑛𝐿subscript𝑃𝑤superscript𝐱superscript𝑒𝑖𝜃subscript𝑃𝑤superscript𝐲superscript𝑒𝑖𝜃\displaystyle\left(p+qe^{i\theta}\right)^{n-L}\cdot\left(P_{w,\mathbf{x}^{% \prime}}(e^{i\theta})-P_{w,\mathbf{y}^{\prime}}(e^{i\theta})\right).( italic_p + italic_q italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_n - italic_L end_POSTSUPERSCRIPT ⋅ ( italic_P start_POSTSUBSCRIPT italic_w , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) - italic_P start_POSTSUBSCRIPT italic_w , bold_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) ) .

Here q=1p𝑞1𝑝q=1-pitalic_q = 1 - italic_p. When |θ|𝜃\mathinner{\!\left\lvert\theta\right\rvert}| italic_θ | is large, we can upper bound the supremum as

supθ:|θ|>L2/3log1/4L|Pw,𝐱(eiθ)Pw,𝐲(eiθ)|subscriptsupremum:𝜃𝜃superscript𝐿23superscript14𝐿subscript𝑃𝑤𝐱superscript𝑒𝑖𝜃subscript𝑃𝑤𝐲superscript𝑒𝑖𝜃\displaystyle\sup_{\theta\colon\mathinner{\!\left\lvert\theta\right\rvert}>L^{% -2/3}\log^{1/4}L}\mathinner{\!\left\lvert P_{w,\mathbf{x}}(e^{i\theta})-P_{w,% \mathbf{y}}(e^{i\theta})\right\rvert}roman_sup start_POSTSUBSCRIPT italic_θ : start_ATOM | italic_θ | end_ATOM > italic_L start_POSTSUPERSCRIPT - 2 / 3 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT italic_L end_POSTSUBSCRIPT start_ATOM | italic_P start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) - italic_P start_POSTSUBSCRIPT italic_w , bold_y end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) | end_ATOM =supθ:|θ|>L2/3log1/4L|p+qeiθ|nL|Pw,𝐱(eiθ)Pw,𝐲(eiθ)|absentsubscriptsupremum:𝜃𝜃superscript𝐿23superscript14𝐿superscript𝑝𝑞superscript𝑒𝑖𝜃𝑛𝐿subscript𝑃𝑤superscript𝐱superscript𝑒𝑖𝜃subscript𝑃𝑤superscript𝐲superscript𝑒𝑖𝜃\displaystyle=\sup_{\theta\colon\mathinner{\!\left\lvert\theta\right\rvert}>L^% {-2/3}\log^{1/4}L}\mathinner{\!\left\lvert p+qe^{i\theta}\right\rvert}^{n-L}% \cdot\mathinner{\!\left\lvert P_{w,\mathbf{x}^{\prime}}(e^{i\theta})-P_{w,% \mathbf{y}^{\prime}}(e^{i\theta})\right\rvert}= roman_sup start_POSTSUBSCRIPT italic_θ : start_ATOM | italic_θ | end_ATOM > italic_L start_POSTSUPERSCRIPT - 2 / 3 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT italic_L end_POSTSUBSCRIPT start_ATOM | italic_p + italic_q italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT | end_ATOM start_POSTSUPERSCRIPT italic_n - italic_L end_POSTSUPERSCRIPT ⋅ start_ATOM | italic_P start_POSTSUBSCRIPT italic_w , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) - italic_P start_POSTSUBSCRIPT italic_w , bold_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) | end_ATOM
(1c1L4/3log1/2L)nLsupθ:|θ|>L2/3log1/4L|Pw,𝐱(eiθ)Pw,𝐲(eiθ)|absentsuperscript1subscript𝑐1superscript𝐿43superscript12𝐿𝑛𝐿subscriptsupremum:𝜃𝜃superscript𝐿23superscript14𝐿subscript𝑃𝑤superscript𝐱superscript𝑒𝑖𝜃subscript𝑃𝑤superscript𝐲superscript𝑒𝑖𝜃\displaystyle\leq\left(1-c_{1}L^{-4/3}\log^{1/2}L\right)^{n-L}\cdot\sup_{% \theta\colon\mathinner{\!\left\lvert\theta\right\rvert}>L^{-2/3}\log^{1/4}L}% \mathinner{\!\left\lvert P_{w,\mathbf{x}^{\prime}}(e^{i\theta})-P_{w,\mathbf{y% }^{\prime}}(e^{i\theta})\right\rvert}≤ ( 1 - italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT - 4 / 3 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_L ) start_POSTSUPERSCRIPT italic_n - italic_L end_POSTSUPERSCRIPT ⋅ roman_sup start_POSTSUBSCRIPT italic_θ : start_ATOM | italic_θ | end_ATOM > italic_L start_POSTSUPERSCRIPT - 2 / 3 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT italic_L end_POSTSUBSCRIPT start_ATOM | italic_P start_POSTSUBSCRIPT italic_w , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) - italic_P start_POSTSUBSCRIPT italic_w , bold_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) | end_ATOM
exp(c1(nL)L4/3log1/2L)(Lk+1)absentsubscript𝑐1𝑛𝐿superscript𝐿43superscript12𝐿𝐿𝑘1\displaystyle\leq\exp\left(-c_{1}(n-L)L^{-4/3}\log^{1/2}L\right)\cdot(L-k+1)≤ roman_exp ( - italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_n - italic_L ) italic_L start_POSTSUPERSCRIPT - 4 / 3 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_L ) ⋅ ( italic_L - italic_k + 1 )
exp2(c2n14β/3log1/2n).absentsubscript2subscript𝑐2superscript𝑛14𝛽3superscript12𝑛\displaystyle\leq\exp_{2}\left(-c_{2}n^{1-4\beta/3}\log^{1/2}n\right).≤ roman_exp start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( - italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT 1 - 4 italic_β / 3 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_n ) .

Here the first inequality is due to |p+qeiθ|1c1a2𝑝𝑞superscript𝑒𝑖𝜃1subscript𝑐1superscript𝑎2\mathinner{\!\left\lvert p+qe^{i\theta}\right\rvert}\leq 1-c_{1}a^{2}start_ATOM | italic_p + italic_q italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT | end_ATOM ≤ 1 - italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for some constant c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (depending on p𝑝pitalic_p) when |θ|a𝜃𝑎\mathinner{\!\left\lvert\theta\right\rvert}\geq astart_ATOM | italic_θ | end_ATOM ≥ italic_a. When |θ|𝜃\mathinner{\!\left\lvert\theta\right\rvert}| italic_θ | is small, this is taken care of by Eq. 3:

supθ:|θ|L2/3log1/4L|Pw,𝐱(eiθ)Pw,𝐲(eiθ)|subscriptsupremum:𝜃𝜃superscript𝐿23superscript14𝐿subscript𝑃𝑤𝐱superscript𝑒𝑖𝜃subscript𝑃𝑤𝐲superscript𝑒𝑖𝜃\displaystyle\sup_{\theta\colon\mathinner{\!\left\lvert\theta\right\rvert}\leq L% ^{-2/3}\log^{1/4}L}\mathinner{\!\left\lvert P_{w,\mathbf{x}}(e^{i\theta})-P_{w% ,\mathbf{y}}(e^{i\theta})\right\rvert}roman_sup start_POSTSUBSCRIPT italic_θ : start_ATOM | italic_θ | end_ATOM ≤ italic_L start_POSTSUPERSCRIPT - 2 / 3 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT italic_L end_POSTSUBSCRIPT start_ATOM | italic_P start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) - italic_P start_POSTSUBSCRIPT italic_w , bold_y end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) | end_ATOM supθ:|θ|L2/3log1/4L|Pw,𝐱(eiθ)Pw,𝐲(eiθ)|absentsubscriptsupremum:𝜃𝜃superscript𝐿23superscript14𝐿subscript𝑃𝑤superscript𝐱superscript𝑒𝑖𝜃subscript𝑃𝑤superscript𝐲superscript𝑒𝑖𝜃\displaystyle\leq\sup_{\theta\colon\mathinner{\!\left\lvert\theta\right\rvert}% \leq L^{-2/3}\log^{1/4}L}\mathinner{\!\left\lvert P_{w,\mathbf{x}^{\prime}}(e^% {i\theta})-P_{w,\mathbf{y}^{\prime}}(e^{i\theta})\right\rvert}≤ roman_sup start_POSTSUBSCRIPT italic_θ : start_ATOM | italic_θ | end_ATOM ≤ italic_L start_POSTSUPERSCRIPT - 2 / 3 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT italic_L end_POSTSUBSCRIPT start_ATOM | italic_P start_POSTSUBSCRIPT italic_w , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) - italic_P start_POSTSUBSCRIPT italic_w , bold_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) | end_ATOM
exp2(L1/3logL/20)absentsubscript2superscript𝐿13𝐿20\displaystyle\leq\exp_{2}\left(-L^{1/3}\sqrt{\log L}/20\right)≤ roman_exp start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( - italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG / 20 )
exp2(c3nβ/3log1/2n).absentsubscript2subscript𝑐3superscript𝑛𝛽3superscript12𝑛\displaystyle\leq\exp_{2}\left(-c_{3}n^{\beta/3}\log^{1/2}n\right).≤ roman_exp start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( - italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT italic_β / 3 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_n ) .

Finally, the value of β𝛽\betaitalic_β is determined by balancing the two cases. Namely, we let 14β/3=β/314𝛽3𝛽31-4\beta/3=\beta/31 - 4 italic_β / 3 = italic_β / 3, or β=3/5𝛽35\beta=3/5italic_β = 3 / 5, which gives the bound 2cn1/5lognsuperscript2𝑐superscript𝑛15𝑛2^{-cn^{1/5}\sqrt{\log n}}2 start_POSTSUPERSCRIPT - italic_c italic_n start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_n end_ARG end_POSTSUPERSCRIPT for both cases. Here c=min{c2,c3}𝑐subscript𝑐2subscript𝑐3c=\min\set{c_{2},c_{3}}italic_c = roman_min { start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG }. ∎

It remains to prove Lemma 3.2, which we do after some helpful preliminaries from complex analysis.

3.1 Some helpful results in complex analysis

In this section, we introduce some results in complex analysis, which will be useful for proving Lemma 3.2.

Let Td(x)subscript𝑇𝑑𝑥T_{d}(x)italic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x ) denote the d𝑑ditalic_d𝑡ℎ𝑡ℎ{}^{\mbox{\tiny{{th}}}}start_FLOATSUPERSCRIPT th end_FLOATSUPERSCRIPT Chebyshev polynomial, i.e., a degree-d𝑑ditalic_d polynomial such that Td(cosθ)=cos(dθ)subscript𝑇𝑑𝜃𝑑𝜃T_{d}(\cos\theta)=\cos(d\theta)italic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( roman_cos italic_θ ) = roman_cos ( italic_d italic_θ ). Clearly, Td(x)[1,1]subscript𝑇𝑑𝑥11T_{d}(x)\in[-1,1]italic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x ) ∈ [ - 1 , 1 ] for x[1,1]𝑥11x\in[-1,1]italic_x ∈ [ - 1 , 1 ]. If a function f(z)𝑓𝑧f(z)italic_f ( italic_z ) is analytic on [1,1]11[-1,1][ - 1 , 1 ], it has a converging Chebyshev expansion

f(z)=d=0adTd(z),z[1,1].formulae-sequence𝑓𝑧superscriptsubscript𝑑0subscript𝑎𝑑subscript𝑇𝑑𝑧𝑧11\displaystyle f(z)=\sum_{d=0}^{\infty}a_{d}\cdot T_{d}(z),\quad z\in[-1,1].italic_f ( italic_z ) = ∑ start_POSTSUBSCRIPT italic_d = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ⋅ italic_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_z ) , italic_z ∈ [ - 1 , 1 ] .

Here the adsubscript𝑎𝑑a_{d}italic_a start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT’s are the Chebyshev coefficients, and they can be extracted by the following integral

ad=1π02πf(cosθ)cos(dθ)dθ,d1,formulae-sequencesubscript𝑎𝑑1𝜋superscriptsubscript02𝜋𝑓𝜃𝑑𝜃d𝜃𝑑1\displaystyle a_{d}=\frac{1}{\pi}\int_{0}^{2\pi}f(\cos\theta)\cos(d\theta)% \operatorname{d\!}\theta,\quad d\geq 1,italic_a start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_π end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_π end_POSTSUPERSCRIPT italic_f ( roman_cos italic_θ ) roman_cos ( italic_d italic_θ ) start_OPFUNCTION roman_d end_OPFUNCTION italic_θ , italic_d ≥ 1 ,

where π𝜋\piitalic_π is replaced by 2π2𝜋2\pi2 italic_π for d=0𝑑0d=0italic_d = 0. This immediately implies a uniform upper bound on Chebyshev coefficients.

Proposition 1.

For all d0𝑑0d\geq 0italic_d ≥ 0, |ad|2supx[1,1]|f(x)|subscript𝑎𝑑2subscriptsupremum𝑥11𝑓𝑥\mathinner{\!\left\lvert a_{d}\right\rvert}\leq 2\sup_{x\in[-1,1]}\mathinner{% \!\left\lvert f(x)\right\rvert}start_ATOM | italic_a start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT | end_ATOM ≤ 2 roman_sup start_POSTSUBSCRIPT italic_x ∈ [ - 1 , 1 ] end_POSTSUBSCRIPT start_ATOM | italic_f ( italic_x ) | end_ATOM.

In fact, if f𝑓fitalic_f is analytically continuable to a larger region, much better bounds can be obtained. For that we need the notion of Bernstein ellipse.

Definition 5 (Bernstein Ellipse).

Given ρ1𝜌1\rho\geq 1italic_ρ ≥ 1, the boundary of the Bernstein Ellipse is defined as

Eρ{u+u12:u=ρeiθ,θ[0,2π)}.subscript𝐸𝜌:𝑢superscript𝑢12formulae-sequence𝑢𝜌superscript𝑒𝑖𝜃𝜃02𝜋\displaystyle\partial E_{\rho}\coloneqq\Set{\frac{u+u^{-1}}{2}\colon u=\rho e^% {i\theta},\theta\in[0,2\pi)}.∂ italic_E start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ≔ { start_ARG divide start_ARG italic_u + italic_u start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG : italic_u = italic_ρ italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT , italic_θ ∈ [ 0 , 2 italic_π ) end_ARG } .

The Bernstein Ellipse Eρsubscript𝐸𝜌E_{\rho}italic_E start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT has the foci at ±1plus-or-minus1\pm 1± 1 with the major and minor semi-axes given by (ρ+ρ1)/2𝜌superscript𝜌12(\rho+\rho^{-1})/2( italic_ρ + italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) / 2 and (ρρ1)/2𝜌superscript𝜌12(\rho-\rho^{-1})/2( italic_ρ - italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) / 2, respectively. When ρ=1𝜌1\rho=1italic_ρ = 1, Eρsubscript𝐸𝜌E_{\rho}italic_E start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT coincides with the interval [1,1]11[-1,1][ - 1 , 1 ] on the real line. For our purpose, we will also be working with affine transformations of Eρsubscript𝐸𝜌E_{\rho}italic_E start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT. More precisely, for a[0,1/8]𝑎018a\in[0,1/8]italic_a ∈ [ 0 , 1 / 8 ] we denote by E~a,ρsubscript~𝐸𝑎𝜌\widetilde{E}_{a,\rho}over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_a , italic_ρ end_POSTSUBSCRIPT (the interior of) the following ellipse

{(14a)+4au+u12:u=ρeiθ,θ[0,2π)}.:14𝑎4𝑎𝑢superscript𝑢12formulae-sequence𝑢𝜌superscript𝑒𝑖𝜃𝜃02𝜋\displaystyle\Set{(1-4a)+4a\cdot\frac{u+u^{-1}}{2}\colon u=\rho e^{i\theta},% \theta\in[0,2\pi)}.{ start_ARG ( 1 - 4 italic_a ) + 4 italic_a ⋅ divide start_ARG italic_u + italic_u start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG : italic_u = italic_ρ italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT , italic_θ ∈ [ 0 , 2 italic_π ) end_ARG } .

Thus, E~a,ρsubscript~𝐸𝑎𝜌\widetilde{E}_{a,\rho}over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_a , italic_ρ end_POSTSUBSCRIPT can be equivalently defined as

{z:|z(18a)|+|z1|8a+4a(ρ1)2/ρ}.:𝑧𝑧18𝑎𝑧18𝑎4𝑎superscript𝜌12𝜌\displaystyle\Set{z\colon\mathinner{\!\left\lvert z-(1-8a)\right\rvert}+% \mathinner{\!\left\lvert z-1\right\rvert}\leq 8a+4a(\rho-1)^{2}/\rho}.{ start_ARG italic_z : start_ATOM | italic_z - ( 1 - 8 italic_a ) | end_ATOM + start_ATOM | italic_z - 1 | end_ATOM ≤ 8 italic_a + 4 italic_a ( italic_ρ - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ρ end_ARG } .

Below are some useful properties of E~a,ρsubscript~𝐸𝑎𝜌\widetilde{E}_{a,\rho}over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_a , italic_ρ end_POSTSUBSCRIPT.

Proposition 2.

The following statements hold.

  1. 1.

    Let zE~a,ρ𝑧subscript~𝐸𝑎𝜌z\in\partial\widetilde{E}_{a,\rho}italic_z ∈ ∂ over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_a , italic_ρ end_POSTSUBSCRIPT. Then |z|1+2a(ρ1)2/ρ𝑧12𝑎superscript𝜌12𝜌\mathinner{\!\left\lvert z\right\rvert}\leq 1+2a(\rho-1)^{2}/\rhostart_ATOM | italic_z | end_ATOM ≤ 1 + 2 italic_a ( italic_ρ - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ρ.

  2. 2.

    E~a,ρsubscript~𝐸𝑎𝜌\widetilde{E}_{a,\rho}over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_a , italic_ρ end_POSTSUBSCRIPT contains a disk centered at 1 with radius 2a(ρ1)2/ρ2𝑎superscript𝜌12𝜌2a(\rho-1)^{2}/\rho2 italic_a ( italic_ρ - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ρ.

Proof.

Item (1): Writing z=(14a)+4a(u+u1)/2𝑧14𝑎4𝑎𝑢superscript𝑢12z=(1-4a)+4a(u+u^{-1})/2italic_z = ( 1 - 4 italic_a ) + 4 italic_a ( italic_u + italic_u start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) / 2 where u=ρeiθ𝑢𝜌superscript𝑒𝑖𝜃u=\rho e^{i\theta}italic_u = italic_ρ italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT, we have

|z|2superscript𝑧2\displaystyle\mathinner{\!\left\lvert z\right\rvert}^{2}start_ATOM | italic_z | end_ATOM start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =(14a+2aρcosθ+2aρ1cosθ)2+(2aρsinθ2aρ1sinθ)2absentsuperscript14𝑎2𝑎𝜌𝜃2𝑎superscript𝜌1𝜃2superscript2𝑎𝜌𝜃2𝑎superscript𝜌1𝜃2\displaystyle=\left(1-4a+2a\rho\cos\theta+2a\rho^{-1}\cos\theta\right)^{2}+% \left(2a\rho\sin\theta-2a\rho^{-1}\sin\theta\right)^{2}= ( 1 - 4 italic_a + 2 italic_a italic_ρ roman_cos italic_θ + 2 italic_a italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_cos italic_θ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 2 italic_a italic_ρ roman_sin italic_θ - 2 italic_a italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_sin italic_θ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=(14a)2+2(14a)2a(ρ+ρ1)cosθ+(2a(ρ+ρ1))2cos2θ+(2a(ρρ1))2sin2θabsentsuperscript14𝑎2214𝑎2𝑎𝜌superscript𝜌1𝜃superscript2𝑎𝜌superscript𝜌12superscript2𝜃superscript2𝑎𝜌superscript𝜌12superscript2𝜃\displaystyle=(1-4a)^{2}+2(1-4a)\cdot 2a(\rho+\rho^{-1})\cos\theta+\left(2a(% \rho+\rho^{-1})\right)^{2}\cos^{2}\theta+\left(2a(\rho-\rho^{-1})\right)^{2}% \sin^{2}\theta= ( 1 - 4 italic_a ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ( 1 - 4 italic_a ) ⋅ 2 italic_a ( italic_ρ + italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) roman_cos italic_θ + ( 2 italic_a ( italic_ρ + italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_θ + ( 2 italic_a ( italic_ρ - italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_sin start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_θ
(14a)2+2(14a)2a(ρ+ρ1)+(2a(ρ+ρ1))24a2sin2θ((ρ+ρ1)2(ρρ1)2)absentsuperscript14𝑎2214𝑎2𝑎𝜌superscript𝜌1superscript2𝑎𝜌superscript𝜌124superscript𝑎2superscript2𝜃superscript𝜌superscript𝜌12superscript𝜌superscript𝜌12\displaystyle\leq(1-4a)^{2}+2(1-4a)\cdot 2a(\rho+\rho^{-1})+\left(2a(\rho+\rho% ^{-1})\right)^{2}-4a^{2}\sin^{2}\theta\left((\rho+\rho^{-1})^{2}-(\rho-\rho^{-% 1})^{2}\right)≤ ( 1 - 4 italic_a ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ( 1 - 4 italic_a ) ⋅ 2 italic_a ( italic_ρ + italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) + ( 2 italic_a ( italic_ρ + italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 4 italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_sin start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_θ ( ( italic_ρ + italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( italic_ρ - italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
=((14a)+2a(ρ+ρ1))216a2sin2θabsentsuperscript14𝑎2𝑎𝜌superscript𝜌1216superscript𝑎2superscript2𝜃\displaystyle=\left((1-4a)+2a(\rho+\rho^{-1})\right)^{2}-16a^{2}\sin^{2}\theta= ( ( 1 - 4 italic_a ) + 2 italic_a ( italic_ρ + italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 16 italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_sin start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_θ
(1+2a(ρ+ρ12))2.absentsuperscript12𝑎𝜌superscript𝜌122\displaystyle\leq\left(1+2a(\rho+\rho^{-1}-2)\right)^{2}.≤ ( 1 + 2 italic_a ( italic_ρ + italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - 2 ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Therefore |z|1+2a(ρ+ρ12)=1+2a(ρ1)2/ρ𝑧12𝑎𝜌superscript𝜌1212𝑎superscript𝜌12𝜌\mathinner{\!\left\lvert z\right\rvert}\leq 1+2a(\rho+\rho^{-1}-2)=1+2a(\rho-1% )^{2}/\rhostart_ATOM | italic_z | end_ATOM ≤ 1 + 2 italic_a ( italic_ρ + italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - 2 ) = 1 + 2 italic_a ( italic_ρ - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ρ. Item (2): Let z𝑧zitalic_z be such that |z1|2a(ρ1)2/ρ𝑧12𝑎superscript𝜌12𝜌\mathinner{\!\left\lvert z-1\right\rvert}\leq 2a(\rho-1)^{2}/\rhostart_ATOM | italic_z - 1 | end_ATOM ≤ 2 italic_a ( italic_ρ - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ρ. We have

|z(18a)|+|z1|8a+2|z1|8a+4a(ρ1)2/ρ.𝑧18𝑎𝑧18𝑎2𝑧18𝑎4𝑎superscript𝜌12𝜌\displaystyle\mathinner{\!\left\lvert z-(1-8a)\right\rvert}+\mathinner{\!\left% \lvert z-1\right\rvert}\leq 8a+2\mathinner{\!\left\lvert z-1\right\rvert}\leq 8% a+4a(\rho-1)^{2}/\rho.start_ATOM | italic_z - ( 1 - 8 italic_a ) | end_ATOM + start_ATOM | italic_z - 1 | end_ATOM ≤ 8 italic_a + 2 start_ATOM | italic_z - 1 | end_ATOM ≤ 8 italic_a + 4 italic_a ( italic_ρ - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ρ .

This implies zE~a,ρ𝑧subscript~𝐸𝑎𝜌z\in\widetilde{E}_{a,\rho}italic_z ∈ over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_a , italic_ρ end_POSTSUBSCRIPT. ∎

The following result shows an exponential convergence rate of the Chebyshev expansion.

Theorem 5 (Theorem 8.1, [Tre12]).

Let a function f𝑓fitalic_f analytic on [1,1]11[-1,1][ - 1 , 1 ] be analytically continuable to the open Bernstein Ellipse Eρsubscript𝐸𝜌E_{\rho}italic_E start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT, where it satisfies |f(z)|M𝑓𝑧𝑀\mathinner{\!\left\lvert f(z)\right\rvert}\leq Mstart_ATOM | italic_f ( italic_z ) | end_ATOM ≤ italic_M for some M𝑀Mitalic_M. Then its Chebyshev coefficients satisfy

|a0|M, and |ak|2Mρk,k1.formulae-sequencesubscript𝑎0𝑀formulae-sequence and subscript𝑎𝑘2𝑀superscript𝜌𝑘𝑘1\displaystyle\mathinner{\!\left\lvert a_{0}\right\rvert}\leq M,\textup{ and }% \mathinner{\!\left\lvert a_{k}\right\rvert}\leq 2M\rho^{-k},k\geq 1.start_ATOM | italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ATOM ≤ italic_M , and start_ATOM | italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ATOM ≤ 2 italic_M italic_ρ start_POSTSUPERSCRIPT - italic_k end_POSTSUPERSCRIPT , italic_k ≥ 1 .
Proof.

The Chebyshev coefficients of f𝑓fitalic_f is given by

ak=1π02πf(cosθ)Tk(cosθ)dθ=1π02πf(cosθ)cos(kθ)dθ,subscript𝑎𝑘1𝜋superscriptsubscript02𝜋𝑓𝜃subscript𝑇𝑘𝜃d𝜃1𝜋superscriptsubscript02𝜋𝑓𝜃𝑘𝜃d𝜃\displaystyle a_{k}=\frac{1}{\pi}\int_{0}^{2\pi}f\left(\cos\theta\right)T_{k}% \left(\cos\theta\right)\operatorname{d\!}\theta=\frac{1}{\pi}\int_{0}^{2\pi}f% \left(\cos\theta\right)\cos\left(k\theta\right)\operatorname{d\!}\theta,italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_π end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_π end_POSTSUPERSCRIPT italic_f ( roman_cos italic_θ ) italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( roman_cos italic_θ ) start_OPFUNCTION roman_d end_OPFUNCTION italic_θ = divide start_ARG 1 end_ARG start_ARG italic_π end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_π end_POSTSUPERSCRIPT italic_f ( roman_cos italic_θ ) roman_cos ( italic_k italic_θ ) start_OPFUNCTION roman_d end_OPFUNCTION italic_θ ,

with π𝜋\piitalic_π replaced by 2π2𝜋2\pi2 italic_π for k=0𝑘0k=0italic_k = 0. Letting z=eiθ𝑧superscript𝑒𝑖𝜃z=e^{i\theta}italic_z = italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT, one could write cosθ=(z+z1)/2𝜃𝑧superscript𝑧12\cos\theta=(z+z^{-1})/2roman_cos italic_θ = ( italic_z + italic_z start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) / 2, dθ=(iz)1dzd𝜃superscript𝑖𝑧1d𝑧\operatorname{d\!}\theta=(iz)^{-1}\operatorname{d\!}zstart_OPFUNCTION roman_d end_OPFUNCTION italic_θ = ( italic_i italic_z ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_OPFUNCTION roman_d end_OPFUNCTION italic_z, and hence

ak=1πi|z|=1f(z+z12)zk+zk2z1dz.subscript𝑎𝑘1𝜋𝑖subscript𝑧1𝑓𝑧superscript𝑧12superscript𝑧𝑘superscript𝑧𝑘2superscript𝑧1d𝑧\displaystyle a_{k}=\frac{1}{\pi i}\int_{\mathinner{\!\left\lvert z\right% \rvert}=1}f\left(\frac{z+z^{-1}}{2}\right)\frac{z^{k}+z^{-k}}{2}\cdot z^{-1}% \operatorname{d\!}z.italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_π italic_i end_ARG ∫ start_POSTSUBSCRIPT start_ATOM | italic_z | end_ATOM = 1 end_POSTSUBSCRIPT italic_f ( divide start_ARG italic_z + italic_z start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) divide start_ARG italic_z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + italic_z start_POSTSUPERSCRIPT - italic_k end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ⋅ italic_z start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_OPFUNCTION roman_d end_OPFUNCTION italic_z .

Denote F(z)f((z+z1)/2)=F(z1)𝐹𝑧𝑓𝑧superscript𝑧12𝐹superscript𝑧1F(z)\coloneqq f((z+z^{-1})/2)=F(z^{-1})italic_F ( italic_z ) ≔ italic_f ( ( italic_z + italic_z start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) / 2 ) = italic_F ( italic_z start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ). Note that we can substitute z1superscript𝑧1z^{-1}italic_z start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT for z𝑧zitalic_z and obtain

1πi|z|=1F(z)zk1dz=1πi|z|=1F(z1)z(k1)dz1=1πi|z|=1F(z)zk1dz.1𝜋𝑖subscript𝑧1𝐹𝑧superscript𝑧𝑘1d𝑧1𝜋𝑖subscript𝑧1𝐹superscript𝑧1superscript𝑧𝑘1dsuperscript𝑧11𝜋𝑖subscript𝑧1𝐹𝑧superscript𝑧𝑘1d𝑧\displaystyle\frac{1}{\pi i}\int_{\mathinner{\!\left\lvert z\right\rvert}=1}F(% z)z^{k-1}\operatorname{d\!}z=-\frac{1}{\pi i}\int_{\mathinner{\!\left\lvert z% \right\rvert}=1}F(z^{-1})z^{-(k-1)}\operatorname{d\!}z^{-1}=\frac{1}{\pi i}% \int_{\mathinner{\!\left\lvert z\right\rvert}=1}F(z)z^{-k-1}\operatorname{d\!}z.divide start_ARG 1 end_ARG start_ARG italic_π italic_i end_ARG ∫ start_POSTSUBSCRIPT start_ATOM | italic_z | end_ATOM = 1 end_POSTSUBSCRIPT italic_F ( italic_z ) italic_z start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT start_OPFUNCTION roman_d end_OPFUNCTION italic_z = - divide start_ARG 1 end_ARG start_ARG italic_π italic_i end_ARG ∫ start_POSTSUBSCRIPT start_ATOM | italic_z | end_ATOM = 1 end_POSTSUBSCRIPT italic_F ( italic_z start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) italic_z start_POSTSUPERSCRIPT - ( italic_k - 1 ) end_POSTSUPERSCRIPT start_OPFUNCTION roman_d end_OPFUNCTION italic_z start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_π italic_i end_ARG ∫ start_POSTSUBSCRIPT start_ATOM | italic_z | end_ATOM = 1 end_POSTSUBSCRIPT italic_F ( italic_z ) italic_z start_POSTSUPERSCRIPT - italic_k - 1 end_POSTSUPERSCRIPT start_OPFUNCTION roman_d end_OPFUNCTION italic_z .

Therefore we arrived at the expression

ak=1πi|z|=1F(z)zk1dz.subscript𝑎𝑘1𝜋𝑖subscript𝑧1𝐹𝑧superscript𝑧𝑘1d𝑧\displaystyle a_{k}=\frac{1}{\pi i}\int_{|z|=1}F(z)z^{-k-1}\operatorname{d\!}z.italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_π italic_i end_ARG ∫ start_POSTSUBSCRIPT | italic_z | = 1 end_POSTSUBSCRIPT italic_F ( italic_z ) italic_z start_POSTSUPERSCRIPT - italic_k - 1 end_POSTSUPERSCRIPT start_OPFUNCTION roman_d end_OPFUNCTION italic_z .

Since f(z)𝑓𝑧f(z)italic_f ( italic_z ) is analytic in the open Bernstein Ellipse Eρsubscript𝐸𝜌E_{\rho}italic_E start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT, we can conclude that F(z)𝐹𝑧F(z)italic_F ( italic_z ) is analytic in the annulus ρ1<|z|<ρsuperscript𝜌1𝑧𝜌\rho^{-1}<\mathinner{\!\left\lvert z\right\rvert}<\rhoitalic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT < start_ATOM | italic_z | end_ATOM < italic_ρ. That means, for any ρ0(ρ1,ρ)subscript𝜌0superscript𝜌1𝜌\rho_{0}\in(\rho^{-1},\rho)italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ ( italic_ρ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , italic_ρ ) we have by Cauchy’s integral theorem (cf. [Lan13, §3, Theorem 5.1]) that

ak=1πi|z|=ρ0F(z)zk1dz.subscript𝑎𝑘1𝜋𝑖subscript𝑧subscript𝜌0𝐹𝑧superscript𝑧𝑘1d𝑧\displaystyle a_{k}=\frac{1}{\pi i}\int_{\mathinner{\!\left\lvert z\right% \rvert}=\rho_{0}}F(z)z^{-k-1}\operatorname{d\!}z.italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_π italic_i end_ARG ∫ start_POSTSUBSCRIPT start_ATOM | italic_z | end_ATOM = italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_F ( italic_z ) italic_z start_POSTSUPERSCRIPT - italic_k - 1 end_POSTSUPERSCRIPT start_OPFUNCTION roman_d end_OPFUNCTION italic_z .

Now we have

|ak|1π|z|=ρ0|F(z)||z|k1|dz|1π2πρ0Mρ0k1=2Mρ0k.subscript𝑎𝑘1𝜋subscript𝑧subscript𝜌0𝐹𝑧superscript𝑧𝑘1d𝑧1𝜋2𝜋subscript𝜌0𝑀superscriptsubscript𝜌0𝑘12𝑀superscriptsubscript𝜌0𝑘\displaystyle\mathinner{\!\left\lvert a_{k}\right\rvert}\leq\frac{1}{\pi}\cdot% \int_{\mathinner{\!\left\lvert z\right\rvert}=\rho_{0}}\mathinner{\!\left% \lvert F(z)\right\rvert}\cdot\mathinner{\!\left\lvert z\right\rvert}^{-k-1}% \mathinner{\!\left\lvert\operatorname{d\!}z\right\rvert}\leq\frac{1}{\pi}\cdot 2% \pi\rho_{0}M\cdot\rho_{0}^{-k-1}=2M\rho_{0}^{-k}.start_ATOM | italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ATOM ≤ divide start_ARG 1 end_ARG start_ARG italic_π end_ARG ⋅ ∫ start_POSTSUBSCRIPT start_ATOM | italic_z | end_ATOM = italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_ATOM | italic_F ( italic_z ) | end_ATOM ⋅ start_ATOM | italic_z | end_ATOM start_POSTSUPERSCRIPT - italic_k - 1 end_POSTSUPERSCRIPT start_ATOM | start_OPFUNCTION roman_d end_OPFUNCTION italic_z | end_ATOM ≤ divide start_ARG 1 end_ARG start_ARG italic_π end_ARG ⋅ 2 italic_π italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_M ⋅ italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_k - 1 end_POSTSUPERSCRIPT = 2 italic_M italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_k end_POSTSUPERSCRIPT .

Finally, since the bound holds for any ρ0<ρsubscript𝜌0𝜌\rho_{0}<\rhoitalic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT < italic_ρ, it also holds for ρ0=ρsubscript𝜌0𝜌\rho_{0}=\rhoitalic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ρ. ∎

We will also make use of the following theorem.

Theorem 6 (Hadamard Three Circles Theorem).

Suppose f𝑓fitalic_f is analytic inside and on {z:r1|z|r2}normal-:𝑧subscript𝑟1𝑧subscript𝑟2\set{z\in\mathbb{C}\colon r_{1}\leq\mathinner{\!\left\lvert z\right\rvert}\leq r% _{2}}{ start_ARG italic_z ∈ blackboard_C : italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ start_ATOM | italic_z | end_ATOM ≤ italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG }. For r[r1,r2]𝑟subscript𝑟1subscript𝑟2r\in[r_{1},r_{2}]italic_r ∈ [ italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ], let M(r)sup|z|=r|f(z)|normal-≔𝑀𝑟subscriptsupremum𝑧𝑟𝑓𝑧M(r)\coloneqq\sup_{\mathinner{\!\left\lvert z\right\rvert}=r}\mathinner{\!% \left\lvert f(z)\right\rvert}italic_M ( italic_r ) ≔ roman_sup start_POSTSUBSCRIPT start_ATOM | italic_z | end_ATOM = italic_r end_POSTSUBSCRIPT start_ATOM | italic_f ( italic_z ) | end_ATOM. Then

M(r)log(r2/r1)M(r1)log(r2/r)M(r2)log(r/r1).𝑀superscript𝑟subscript𝑟2subscript𝑟1𝑀superscriptsubscript𝑟1subscript𝑟2𝑟𝑀superscriptsubscript𝑟2𝑟subscript𝑟1\displaystyle M(r)^{\log(r_{2}/r_{1})}\leq M(r_{1})^{\log(r_{2}/r)}M(r_{2})^{% \log(r/r_{1})}.italic_M ( italic_r ) start_POSTSUPERSCRIPT roman_log ( italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ≤ italic_M ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT roman_log ( italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / italic_r ) end_POSTSUPERSCRIPT italic_M ( italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT roman_log ( italic_r / italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT .
Corollary 3.1.

Suppose f(z)=j=0n1cjzj𝑓𝑧superscriptsubscript𝑗0𝑛1subscript𝑐𝑗superscript𝑧𝑗f(z)=\sum_{j=0}^{n-1}c_{j}z^{j}italic_f ( italic_z ) = ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT where |cj|1subscript𝑐𝑗1\mathinner{\!\left\lvert c_{j}\right\rvert}\leq 1start_ATOM | italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ATOM ≤ 1. Then

supzE~a,2|f(z)|exp(5an/2)(supz[18a,1]|f(z)|)1/2.subscriptsupremum𝑧subscript~𝐸𝑎2𝑓𝑧5𝑎𝑛2superscriptsubscriptsupremum𝑧18𝑎1𝑓𝑧12\displaystyle\sup_{z\in\partial\widetilde{E}_{a,2}}\mathinner{\!\left\lvert f(% z)\right\rvert}\leq\exp\left(5an/2\right)\cdot\left(\sup_{z\in[1-8a,1]}% \mathinner{\!\left\lvert f(z)\right\rvert}\right)^{1/2}.roman_sup start_POSTSUBSCRIPT italic_z ∈ ∂ over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_a , 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_ATOM | italic_f ( italic_z ) | end_ATOM ≤ roman_exp ( 5 italic_a italic_n / 2 ) ⋅ ( roman_sup start_POSTSUBSCRIPT italic_z ∈ [ 1 - 8 italic_a , 1 ] end_POSTSUBSCRIPT start_ATOM | italic_f ( italic_z ) | end_ATOM ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT .
Proof.

Let ρ1=1,ρ=2,ρ2=22formulae-sequencesubscript𝜌11formulae-sequence𝜌2subscript𝜌2superscript22\rho_{1}=1,\rho=2,\rho_{2}=2^{2}italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 , italic_ρ = 2 , italic_ρ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Let g(z)f(u)𝑔𝑧𝑓𝑢g(z)\coloneqq f(u)italic_g ( italic_z ) ≔ italic_f ( italic_u ) where u=(14a)+4a(z+z1)/2𝑢14𝑎4𝑎𝑧superscript𝑧12u=(1-4a)+4a(z+z^{-1})/2italic_u = ( 1 - 4 italic_a ) + 4 italic_a ( italic_z + italic_z start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) / 2. Since f𝑓fitalic_f is analytic on and inside E~a,ρ2subscript~𝐸𝑎subscript𝜌2\widetilde{E}_{a,\rho_{2}}over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_a , italic_ρ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, g𝑔gitalic_g is analytic inside the centered disk with radius ρ2subscript𝜌2\rho_{2}italic_ρ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Applying the Hadamard Three Circles Theorem to g𝑔gitalic_g gives

supzE~a,ρ|f(z)|(supzE~a,ρ1|f(z)|)1/2(supzE~a,ρ2|f(z)|)1/2.subscriptsupremum𝑧subscript~𝐸𝑎𝜌𝑓𝑧superscriptsubscriptsupremum𝑧subscript~𝐸𝑎subscript𝜌1𝑓𝑧12superscriptsubscriptsupremum𝑧subscript~𝐸𝑎subscript𝜌2𝑓𝑧12\displaystyle\sup_{z\in\partial\widetilde{E}_{a,\rho}}\mathinner{\!\left\lvert f% (z)\right\rvert}\leq\left(\sup_{z\in\partial\widetilde{E}_{a,\rho_{1}}}% \mathinner{\!\left\lvert f(z)\right\rvert}\right)^{1/2}\left(\sup_{z\in% \partial\widetilde{E}_{a,\rho_{2}}}\mathinner{\!\left\lvert f(z)\right\rvert}% \right)^{1/2}.roman_sup start_POSTSUBSCRIPT italic_z ∈ ∂ over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_a , italic_ρ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_ATOM | italic_f ( italic_z ) | end_ATOM ≤ ( roman_sup start_POSTSUBSCRIPT italic_z ∈ ∂ over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_a , italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_ATOM | italic_f ( italic_z ) | end_ATOM ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( roman_sup start_POSTSUBSCRIPT italic_z ∈ ∂ over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_a , italic_ρ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_ATOM | italic_f ( italic_z ) | end_ATOM ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT .

We note that E~a,ρ1subscript~𝐸𝑎subscript𝜌1\widetilde{E}_{a,\rho_{1}}over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_a , italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT coincides with the interval [18a,1]18𝑎1[1-8a,1][ 1 - 8 italic_a , 1 ] on the real line. For zE~a,ρ2𝑧subscript~𝐸𝑎subscript𝜌2z\in\partial\widetilde{E}_{a,\rho_{2}}italic_z ∈ ∂ over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_a , italic_ρ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, Proposition 2 implies |f(z)|n(1+2a(41)2/4)nexp(5an)𝑓𝑧𝑛superscript12𝑎superscript4124𝑛5𝑎𝑛\mathinner{\!\left\lvert f(z)\right\rvert}\leq n\cdot(1+2a(4-1)^{2}/4)^{n}\leq% \exp(5an)start_ATOM | italic_f ( italic_z ) | end_ATOM ≤ italic_n ⋅ ( 1 + 2 italic_a ( 4 - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 4 ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ≤ roman_exp ( 5 italic_a italic_n ). Therefore

supzE~a,2|f(z)|exp(5an/2)(supz[18a,1]|f(z)|)1/2.subscriptsupremum𝑧subscript~𝐸𝑎2𝑓𝑧5𝑎𝑛2superscriptsubscriptsupremum𝑧18𝑎1𝑓𝑧12\displaystyle\sup_{z\in\partial\widetilde{E}_{a,2}}\mathinner{\!\left\lvert f(% z)\right\rvert}\leq\exp\left(5an/2\right)\cdot\left(\sup_{z\in[1-8a,1]}% \mathinner{\!\left\lvert f(z)\right\rvert}\right)^{1/2}.roman_sup start_POSTSUBSCRIPT italic_z ∈ ∂ over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_a , 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_ATOM | italic_f ( italic_z ) | end_ATOM ≤ roman_exp ( 5 italic_a italic_n / 2 ) ⋅ ( roman_sup start_POSTSUBSCRIPT italic_z ∈ [ 1 - 8 italic_a , 1 ] end_POSTSUBSCRIPT start_ATOM | italic_f ( italic_z ) | end_ATOM ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT .

3.2 Proof of Lemma 3.2: A Counting Argument

We prove Lemma 3.2 in this section.

We first prove a technical lemma lower bounding the number of binary strings in which all 1s are far away from each other.

Lemma 3.3.

Let Sn,r{0,1}nsubscript𝑆𝑛𝑟superscript01𝑛S_{n,r}\subseteq\set{0,1}^{n}italic_S start_POSTSUBSCRIPT italic_n , italic_r end_POSTSUBSCRIPT ⊆ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be the collection of all n𝑛nitalic_n-bit strings with the property that any two 1’s are separated by at least r𝑟ritalic_r many 0’s. Then |Sn,r|(r+1)n/r1subscript𝑆𝑛𝑟superscript𝑟1𝑛𝑟1\mathinner{\!\left\lvert S_{n,r}\right\rvert}\geq(\sqrt{r+1})^{n/r-1}start_ATOM | italic_S start_POSTSUBSCRIPT italic_n , italic_r end_POSTSUBSCRIPT | end_ATOM ≥ ( square-root start_ARG italic_r + 1 end_ARG ) start_POSTSUPERSCRIPT italic_n / italic_r - 1 end_POSTSUPERSCRIPT.

Proof.

For ease of notation we fix r𝑟ritalic_r and denote f(n)|Sn,r|𝑓𝑛subscript𝑆𝑛𝑟f(n)\coloneqq\mathinner{\!\left\lvert S_{n,r}\right\rvert}italic_f ( italic_n ) ≔ start_ATOM | italic_S start_POSTSUBSCRIPT italic_n , italic_r end_POSTSUBSCRIPT | end_ATOM. We observe that f𝑓fitalic_f satisfies the following recurrence relation

f(n)={n+1, for 0nrf(n1)+f(nr1), for nr+1.𝑓𝑛cases𝑛1 for 0𝑛𝑟𝑓𝑛1𝑓𝑛𝑟1 for 𝑛𝑟1\displaystyle f(n)=\begin{cases}n+1,&\text{ for }0\leq n\leq r\\ f(n-1)+f(n-r-1),&\text{ for }n\geq r+1\end{cases}.italic_f ( italic_n ) = { start_ROW start_CELL italic_n + 1 , end_CELL start_CELL for 0 ≤ italic_n ≤ italic_r end_CELL end_ROW start_ROW start_CELL italic_f ( italic_n - 1 ) + italic_f ( italic_n - italic_r - 1 ) , end_CELL start_CELL for italic_n ≥ italic_r + 1 end_CELL end_ROW .

We prove by induction that f(n)(r+1)n/r1𝑓𝑛superscript𝑟1𝑛𝑟1f(n)\geq(\sqrt{r+1})^{n/r-1}italic_f ( italic_n ) ≥ ( square-root start_ARG italic_r + 1 end_ARG ) start_POSTSUPERSCRIPT italic_n / italic_r - 1 end_POSTSUPERSCRIPT. The base case is trivial since (r+1)n/r11superscript𝑟1𝑛𝑟11(\sqrt{r+1})^{n/r-1}\leq 1( square-root start_ARG italic_r + 1 end_ARG ) start_POSTSUPERSCRIPT italic_n / italic_r - 1 end_POSTSUPERSCRIPT ≤ 1 when nr𝑛𝑟n\leq ritalic_n ≤ italic_r.

Now suppose f(k)(r+1)k/r1𝑓𝑘superscript𝑟1𝑘𝑟1f(k)\geq(\sqrt{r+1})^{k/r-1}italic_f ( italic_k ) ≥ ( square-root start_ARG italic_r + 1 end_ARG ) start_POSTSUPERSCRIPT italic_k / italic_r - 1 end_POSTSUPERSCRIPT for kn1𝑘𝑛1k\leq n-1italic_k ≤ italic_n - 1. This gives, for k=n𝑘𝑛k=nitalic_k = italic_n, the following bound

f(n)𝑓𝑛\displaystyle f(n)italic_f ( italic_n ) =f(n1)+f(nr1)absent𝑓𝑛1𝑓𝑛𝑟1\displaystyle=f(n-1)+f(n-r-1)= italic_f ( italic_n - 1 ) + italic_f ( italic_n - italic_r - 1 )
(r+1)(n1)/r1+(r+1)(nr1)/r1absentsuperscript𝑟1𝑛1𝑟1superscript𝑟1𝑛𝑟1𝑟1\displaystyle\geq(\sqrt{r+1})^{(n-1)/r-1}+(\sqrt{r+1})^{(n-r-1)/r-1}≥ ( square-root start_ARG italic_r + 1 end_ARG ) start_POSTSUPERSCRIPT ( italic_n - 1 ) / italic_r - 1 end_POSTSUPERSCRIPT + ( square-root start_ARG italic_r + 1 end_ARG ) start_POSTSUPERSCRIPT ( italic_n - italic_r - 1 ) / italic_r - 1 end_POSTSUPERSCRIPT
=(r+1)(n1)/r1(1+1r+1).absentsuperscript𝑟1𝑛1𝑟111𝑟1\displaystyle=(\sqrt{r+1})^{(n-1)/r-1}\left(1+\frac{1}{\sqrt{r+1}}\right).= ( square-root start_ARG italic_r + 1 end_ARG ) start_POSTSUPERSCRIPT ( italic_n - 1 ) / italic_r - 1 end_POSTSUPERSCRIPT ( 1 + divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_r + 1 end_ARG end_ARG ) .

Since by the AM-GM inequality we have

r+rr+1=r1r+1+r+11+1++1r1 1s+r+1r(r+1)1/r,𝑟𝑟𝑟1𝑟1𝑟1𝑟1subscript111𝑟1 1s𝑟1𝑟superscript𝑟11𝑟\displaystyle r+\frac{r}{\sqrt{r+1}}=r-\frac{1}{\sqrt{r+1}}+\sqrt{r+1}\geq% \underbrace{1+1+\dots+1}_{r-1\textrm{ 1s}}+\sqrt{r+1}\geq r(\sqrt{r+1})^{1/r},italic_r + divide start_ARG italic_r end_ARG start_ARG square-root start_ARG italic_r + 1 end_ARG end_ARG = italic_r - divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_r + 1 end_ARG end_ARG + square-root start_ARG italic_r + 1 end_ARG ≥ under⏟ start_ARG 1 + 1 + ⋯ + 1 end_ARG start_POSTSUBSCRIPT italic_r - 1 1s end_POSTSUBSCRIPT + square-root start_ARG italic_r + 1 end_ARG ≥ italic_r ( square-root start_ARG italic_r + 1 end_ARG ) start_POSTSUPERSCRIPT 1 / italic_r end_POSTSUPERSCRIPT ,

or equivalently 1+1/r+1(r+1)1/r11𝑟1superscript𝑟11𝑟1+1/\sqrt{r+1}\geq(\sqrt{r+1})^{1/r}1 + 1 / square-root start_ARG italic_r + 1 end_ARG ≥ ( square-root start_ARG italic_r + 1 end_ARG ) start_POSTSUPERSCRIPT 1 / italic_r end_POSTSUPERSCRIPT, we obtain

f(n)(r+1)(n1)/r1(r+1)1/r=(r+1)n/r1.𝑓𝑛superscript𝑟1𝑛1𝑟1superscript𝑟11𝑟superscript𝑟1𝑛𝑟1\displaystyle f(n)\geq(\sqrt{r+1})^{(n-1)/r-1}\cdot(\sqrt{r+1})^{1/r}=(\sqrt{r% +1})^{n/r-1}.italic_f ( italic_n ) ≥ ( square-root start_ARG italic_r + 1 end_ARG ) start_POSTSUPERSCRIPT ( italic_n - 1 ) / italic_r - 1 end_POSTSUPERSCRIPT ⋅ ( square-root start_ARG italic_r + 1 end_ARG ) start_POSTSUPERSCRIPT 1 / italic_r end_POSTSUPERSCRIPT = ( square-root start_ARG italic_r + 1 end_ARG ) start_POSTSUPERSCRIPT italic_n / italic_r - 1 end_POSTSUPERSCRIPT .

This completes the inductive step, and hence |Sn,r|(r+1)n/r1subscript𝑆𝑛𝑟superscript𝑟1𝑛𝑟1\mathinner{\!\left\lvert S_{n,r}\right\rvert}\geq(\sqrt{r+1})^{n/r-1}start_ATOM | italic_S start_POSTSUBSCRIPT italic_n , italic_r end_POSTSUBSCRIPT | end_ATOM ≥ ( square-root start_ARG italic_r + 1 end_ARG ) start_POSTSUPERSCRIPT italic_n / italic_r - 1 end_POSTSUPERSCRIPT for all n,r𝑛𝑟n,r\in\mathbb{N}italic_n , italic_r ∈ blackboard_N. ∎

In the following, we fix kL1/3𝑘superscript𝐿13k\coloneqq L^{1/3}italic_k ≔ italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT, and let S0kSLk,k1𝑆superscript0𝑘subscript𝑆𝐿𝑘𝑘1S\coloneqq 0^{k}\circ S_{L-k,k-1}italic_S ≔ 0 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∘ italic_S start_POSTSUBSCRIPT italic_L - italic_k , italic_k - 1 end_POSTSUBSCRIPT. The proof will focus on binary strings in the set S𝑆Sitalic_S. We have |S|((k1)+1)(Lk)/(k1)12(L2/3log2L)/6𝑆superscript𝑘11𝐿𝑘𝑘11superscript2superscript𝐿23subscript2𝐿6\mathinner{\!\left\lvert S\right\rvert}\geq(\sqrt{(k-1)+1})^{(L-k)/(k-1)-1}% \geq 2^{(L^{2/3}\log_{2}L)/6}start_ATOM | italic_S | end_ATOM ≥ ( square-root start_ARG ( italic_k - 1 ) + 1 end_ARG ) start_POSTSUPERSCRIPT ( italic_L - italic_k ) / ( italic_k - 1 ) - 1 end_POSTSUPERSCRIPT ≥ 2 start_POSTSUPERSCRIPT ( italic_L start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_L ) / 6 end_POSTSUPERSCRIPT.

Below we characterize some properties of k𝑘kitalic_k-mer generating polynomials of strings in S𝑆Sitalic_S.

Lemma 3.4.

Let S𝑆Sitalic_S be a set of strings defined as above. For j=1,2,,k𝑗12normal-…𝑘j=1,2,\dots,kitalic_j = 1 , 2 , … , italic_k, denote by 𝐞jsubscript𝐞𝑗\mathbf{e}_{j}bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT the string with a single “1” located at index j1𝑗1j-1italic_j - 1 (indices begin with 0). The following properties hold.

  1. 1.

    For any k𝑘kitalic_k-mer w{0k,𝐞1,,𝐞k}𝑤superscript0𝑘subscript𝐞1subscript𝐞𝑘w\notin\set{0^{k},\mathbf{e}_{1},\dots,\mathbf{e}_{k}}italic_w ∉ { start_ARG 0 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG }, Pw,𝐱(z)subscript𝑃𝑤𝐱𝑧P_{w,\mathbf{x}}(z)italic_P start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT ( italic_z ) is the zero polynomial.

  2. 2.

    For any 𝐱S𝐱𝑆\mathbf{x}\in Sbold_x ∈ italic_S and 1j<k1𝑗𝑘1\leq j<k1 ≤ italic_j < italic_k, P𝐞j,𝐱(z)=(p+(1p)z)P𝐞j+1,𝐱(z)subscript𝑃subscript𝐞𝑗𝐱𝑧𝑝1𝑝𝑧subscript𝑃subscript𝐞𝑗1𝐱𝑧P_{\mathbf{e}_{j},\mathbf{x}}(z)=(p+(1-p)z)\cdot P_{\mathbf{e}_{j+1},\mathbf{x% }}(z)italic_P start_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_x end_POSTSUBSCRIPT ( italic_z ) = ( italic_p + ( 1 - italic_p ) italic_z ) ⋅ italic_P start_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT , bold_x end_POSTSUBSCRIPT ( italic_z ).

  3. 3.

    For any 𝐱,𝐲S𝐱𝐲𝑆\mathbf{x},\mathbf{y}\in Sbold_x , bold_y ∈ italic_S and |z|1𝑧1\mathinner{\!\left\lvert z\right\rvert}\leq 1start_ATOM | italic_z | end_ATOM ≤ 1, |P0k,𝐱(z)P0k,𝐲(z)|k|P𝐞k,𝐱(z)P𝐞k,𝐲(z)|subscript𝑃superscript0𝑘𝐱𝑧subscript𝑃superscript0𝑘𝐲𝑧𝑘subscript𝑃subscript𝐞𝑘𝐱𝑧subscript𝑃subscript𝐞𝑘𝐲𝑧\mathinner{\!\left\lvert P_{0^{k},\mathbf{x}}(z)-P_{0^{k},\mathbf{y}}(z)\right% \rvert}\leq k\cdot\mathinner{\!\left\lvert P_{\mathbf{e}_{k},\mathbf{x}}(z)-P_% {\mathbf{e}_{k},\mathbf{y}}(z)\right\rvert}start_ATOM | italic_P start_POSTSUBSCRIPT 0 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_x end_POSTSUBSCRIPT ( italic_z ) - italic_P start_POSTSUBSCRIPT 0 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_y end_POSTSUBSCRIPT ( italic_z ) | end_ATOM ≤ italic_k ⋅ start_ATOM | italic_P start_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_x end_POSTSUBSCRIPT ( italic_z ) - italic_P start_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_y end_POSTSUBSCRIPT ( italic_z ) | end_ATOM .

Proof.

Item 1: By definition of S𝑆Sitalic_S, 𝐱[j:j+k1]𝐱delimited-[]:𝑗𝑗𝑘1\mathbf{x}[j\mathrel{\mathop{:}}j+k-1]bold_x [ italic_j : italic_j + italic_k - 1 ] contains at most one “1” for any string 𝐱S𝐱𝑆\mathbf{x}\in Sbold_x ∈ italic_S. Therefore, if w𝑤witalic_w contains at least two “1”s, then for any 0<Lk0𝐿𝑘0\leq\ell<L-k0 ≤ roman_ℓ < italic_L - italic_k,

Kw,𝐱[]=j=0Lk(j)p(1p)j𝟏{𝐱[j:j+k1]=w}=0.subscript𝐾𝑤𝐱delimited-[]superscriptsubscript𝑗0𝐿𝑘binomial𝑗superscript𝑝superscript1𝑝𝑗1𝐱delimited-[]:𝑗𝑗𝑘1𝑤0\displaystyle K_{w,\mathbf{x}}[\ell]=\sum_{j=0}^{L-k}\binom{j}{\ell}p^{\ell}(1% -p)^{j-\ell}\mathbf{1}\Set{\mathbf{x}[j\mathrel{\mathop{:}}j+k-1]=w}=0.italic_K start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT [ roman_ℓ ] = ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - italic_k end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_j end_ARG start_ARG roman_ℓ end_ARG ) italic_p start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT ( 1 - italic_p ) start_POSTSUPERSCRIPT italic_j - roman_ℓ end_POSTSUPERSCRIPT bold_1 { start_ARG bold_x [ italic_j : italic_j + italic_k - 1 ] = italic_w end_ARG } = 0 .

This means all the coefficients of Pw,𝐱(z)subscript𝑃𝑤𝐱𝑧P_{w,\mathbf{x}}(z)italic_P start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT ( italic_z ) is zero, and hence Pw,𝐱(z)subscript𝑃𝑤𝐱𝑧P_{w,\mathbf{x}}(z)italic_P start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT ( italic_z ) is the zero polynomial.

Item 2: Since any two consecutive “1”s in 𝐱S𝐱𝑆\mathbf{x}\in Sbold_x ∈ italic_S are separated by at least k1𝑘1k-1italic_k - 1 “0”s, 𝐱[i:i+k1]=𝐞j𝐱delimited-[]:𝑖𝑖𝑘1subscript𝐞𝑗\mathbf{x}[i\mathrel{\mathop{:}}i+k-1]=\mathbf{e}_{j}bold_x [ italic_i : italic_i + italic_k - 1 ] = bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT if and only if 𝐱[i1:i+k2]=𝐞j+1𝐱delimited-[]:𝑖1𝑖𝑘2subscript𝐞𝑗1\mathbf{x}[i-1\mathrel{\mathop{:}}i+k-2]=\mathbf{e}_{j+1}bold_x [ italic_i - 1 : italic_i + italic_k - 2 ] = bold_e start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT. We thus have

P𝐞j,𝐱(z)subscript𝑃subscript𝐞𝑗𝐱𝑧\displaystyle P_{\mathbf{e}_{j},\mathbf{x}}(z)italic_P start_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_x end_POSTSUBSCRIPT ( italic_z ) =i=0Lk𝟏{𝐱[i:i+k1]=𝐞j}(p+(1p)z)iabsentsuperscriptsubscript𝑖0𝐿𝑘1𝐱delimited-[]:𝑖𝑖𝑘1subscript𝐞𝑗superscript𝑝1𝑝𝑧𝑖\displaystyle=\sum_{i=0}^{L-k}\mathbf{1}\Set{\mathbf{x}[i\mathrel{\mathop{:}}i% +k-1]=\mathbf{e}_{j}}\cdot(p+(1-p)z)^{i}= ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - italic_k end_POSTSUPERSCRIPT bold_1 { start_ARG bold_x [ italic_i : italic_i + italic_k - 1 ] = bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG } ⋅ ( italic_p + ( 1 - italic_p ) italic_z ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT
=i=1Lk𝟏{𝐱[i1:i+k2]=𝐞j+1}(p+(1p)z)iabsentsuperscriptsubscript𝑖1𝐿𝑘1𝐱delimited-[]:𝑖1𝑖𝑘2subscript𝐞𝑗1superscript𝑝1𝑝𝑧𝑖\displaystyle=\sum_{i=1}^{L-k}\mathbf{1}\Set{\mathbf{x}[i-1\mathrel{\mathop{:}% }i+k-2]=\mathbf{e}_{j+1}}\cdot(p+(1-p)z)^{i}= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - italic_k end_POSTSUPERSCRIPT bold_1 { start_ARG bold_x [ italic_i - 1 : italic_i + italic_k - 2 ] = bold_e start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT end_ARG } ⋅ ( italic_p + ( 1 - italic_p ) italic_z ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT
=(p+(1p)z)i=0Lk1𝟏{𝐱[i:i+k1]=𝐞j+1}(p+(1p)z)iabsent𝑝1𝑝𝑧superscriptsubscript𝑖0𝐿𝑘11𝐱delimited-[]:𝑖𝑖𝑘1subscript𝐞𝑗1superscript𝑝1𝑝𝑧𝑖\displaystyle=(p+(1-p)z)\cdot\sum_{i=0}^{L-k-1}\mathbf{1}\Set{\mathbf{x}[i% \mathrel{\mathop{:}}i+k-1]=\mathbf{e}_{j+1}}\cdot(p+(1-p)z)^{i}= ( italic_p + ( 1 - italic_p ) italic_z ) ⋅ ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - italic_k - 1 end_POSTSUPERSCRIPT bold_1 { start_ARG bold_x [ italic_i : italic_i + italic_k - 1 ] = bold_e start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT end_ARG } ⋅ ( italic_p + ( 1 - italic_p ) italic_z ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT
=(p+(1p)z)P𝐞j+1,𝐱(z).absent𝑝1𝑝𝑧subscript𝑃subscript𝐞𝑗1𝐱𝑧\displaystyle=(p+(1-p)z)\cdot P_{\mathbf{e}_{j+1},\mathbf{x}}(z).= ( italic_p + ( 1 - italic_p ) italic_z ) ⋅ italic_P start_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT , bold_x end_POSTSUBSCRIPT ( italic_z ) .

We have used the fact that for 1j<k1𝑗𝑘1\leq j<k1 ≤ italic_j < italic_k, 𝟏{𝐱[0:k1]=𝐞j}=𝟏{𝐱[Lk:L1]=𝐞j+1}=01𝐱delimited-[]:0𝑘1subscript𝐞𝑗1𝐱delimited-[]:𝐿𝑘𝐿1subscript𝐞𝑗10\mathbf{1}\Set{\mathbf{x}[0\mathrel{\mathop{:}}k-1]=\mathbf{e}_{j}}=\mathbf{1}% \Set{\mathbf{x}[L-k\mathrel{\mathop{:}}L-1]=\mathbf{e}_{j+1}}=0bold_1 { start_ARG bold_x [ 0 : italic_k - 1 ] = bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG } = bold_1 { start_ARG bold_x [ italic_L - italic_k : italic_L - 1 ] = bold_e start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT end_ARG } = 0.

Item 3: We observe that 𝐱[i:i+k1]{0k,𝐞1,,𝐞k}𝐱delimited-[]:𝑖𝑖𝑘1superscript0𝑘subscript𝐞1subscript𝐞𝑘\mathbf{x}[i\mathrel{\mathop{:}}i+k-1]\in\set{0^{k},\mathbf{e}_{1},\dots,% \mathbf{e}_{k}}bold_x [ italic_i : italic_i + italic_k - 1 ] ∈ { start_ARG 0 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG }. That implies

w{0k,𝐞1,,𝐞k}Pw,𝐱(z)subscript𝑤superscript0𝑘subscript𝐞1subscript𝐞𝑘subscript𝑃𝑤𝐱𝑧\displaystyle\sum_{w\in\set{0^{k},\mathbf{e}_{1},\dots,\mathbf{e}_{k}}}P_{w,% \mathbf{x}}(z)∑ start_POSTSUBSCRIPT italic_w ∈ { start_ARG 0 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG } end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT ( italic_z ) =w{0k,𝐞1,,𝐞k}i=0Lk𝟏{𝐱[i:i+k1]=w}(p+(1p)z)iabsentsubscript𝑤superscript0𝑘subscript𝐞1subscript𝐞𝑘superscriptsubscript𝑖0𝐿𝑘1𝐱delimited-[]:𝑖𝑖𝑘1𝑤superscript𝑝1𝑝𝑧𝑖\displaystyle=\sum_{w\in\set{0^{k},\mathbf{e}_{1},\dots,\mathbf{e}_{k}}}\sum_{% i=0}^{L-k}\mathbf{1}\Set{\mathbf{x}[i\mathrel{\mathop{:}}i+k-1]=w}\cdot(p+(1-p% )z)^{i}= ∑ start_POSTSUBSCRIPT italic_w ∈ { start_ARG 0 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG } end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - italic_k end_POSTSUPERSCRIPT bold_1 { start_ARG bold_x [ italic_i : italic_i + italic_k - 1 ] = italic_w end_ARG } ⋅ ( italic_p + ( 1 - italic_p ) italic_z ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT
=i=0Lk(w{0k,𝐞1,,𝐞k}𝟏{𝐱[i:i+k1]=w})(p+(1p)z)iabsentsuperscriptsubscript𝑖0𝐿𝑘subscript𝑤superscript0𝑘subscript𝐞1subscript𝐞𝑘1𝐱delimited-[]:𝑖𝑖𝑘1𝑤superscript𝑝1𝑝𝑧𝑖\displaystyle=\sum_{i=0}^{L-k}\left(\sum_{w\in\set{0^{k},\mathbf{e}_{1},\dots,% \mathbf{e}_{k}}}\mathbf{1}\Set{\mathbf{x}[i\mathrel{\mathop{:}}i+k-1]=w}\right% )(p+(1-p)z)^{i}= ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - italic_k end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_w ∈ { start_ARG 0 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG } end_POSTSUBSCRIPT bold_1 { start_ARG bold_x [ italic_i : italic_i + italic_k - 1 ] = italic_w end_ARG } ) ( italic_p + ( 1 - italic_p ) italic_z ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT
=i=0Lk(p+(1p)z)i.absentsuperscriptsubscript𝑖0𝐿𝑘superscript𝑝1𝑝𝑧𝑖\displaystyle=\sum_{i=0}^{L-k}(p+(1-p)z)^{i}.= ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - italic_k end_POSTSUPERSCRIPT ( italic_p + ( 1 - italic_p ) italic_z ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT .

Note the the right-hand-side is independent of 𝐱𝐱\mathbf{x}bold_x. Therefore

|P0k,𝐱(z)P0k,𝐲(z)|subscript𝑃superscript0𝑘𝐱𝑧subscript𝑃superscript0𝑘𝐲𝑧\displaystyle\mathinner{\!\left\lvert P_{0^{k},\mathbf{x}}(z)-P_{0^{k},\mathbf% {y}}(z)\right\rvert}| italic_P start_POSTSUBSCRIPT 0 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_x end_POSTSUBSCRIPT ( italic_z ) - italic_P start_POSTSUBSCRIPT 0 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_y end_POSTSUBSCRIPT ( italic_z ) | =|w{𝐞1,,𝐞k}Pw,𝐱(z)w{𝐞1,,𝐞k}Pw,𝐲(z)|absentsubscript𝑤subscript𝐞1subscript𝐞𝑘subscript𝑃𝑤𝐱𝑧subscript𝑤subscript𝐞1subscript𝐞𝑘subscript𝑃𝑤𝐲𝑧\displaystyle=\mathinner{\!\left\lvert\sum_{w\in\set{\mathbf{e}_{1},\dots,% \mathbf{e}_{k}}}P_{w,\mathbf{x}}(z)-\sum_{w\in\set{\mathbf{e}_{1},\dots,% \mathbf{e}_{k}}}P_{w,\mathbf{y}}(z)\right\rvert}= start_ATOM | ∑ start_POSTSUBSCRIPT italic_w ∈ { start_ARG bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG } end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT ( italic_z ) - ∑ start_POSTSUBSCRIPT italic_w ∈ { start_ARG bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG } end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_w , bold_y end_POSTSUBSCRIPT ( italic_z ) | end_ATOM
j=1k|P𝐞j,𝐱(z)P𝐞j,𝐲(z)|absentsuperscriptsubscript𝑗1𝑘subscript𝑃subscript𝐞𝑗𝐱𝑧subscript𝑃subscript𝐞𝑗𝐲𝑧\displaystyle\leq\sum_{j=1}^{k}\mathinner{\!\left\lvert P_{\mathbf{e}_{j},% \mathbf{x}}(z)-P_{\mathbf{e}_{j},\mathbf{y}}(z)\right\rvert}≤ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_ATOM | italic_P start_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_x end_POSTSUBSCRIPT ( italic_z ) - italic_P start_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_y end_POSTSUBSCRIPT ( italic_z ) | end_ATOM
=j=1k|(p+(1p)z)kj(P𝐞k,𝐱(z)P𝐞k,𝐲(z))|absentsuperscriptsubscript𝑗1𝑘superscript𝑝1𝑝𝑧𝑘𝑗subscript𝑃subscript𝐞𝑘𝐱𝑧subscript𝑃subscript𝐞𝑘𝐲𝑧\displaystyle=\sum_{j=1}^{k}\mathinner{\!\left\lvert\left(p+(1-p)z\right)^{k-j% }\cdot\left(P_{\mathbf{e}_{k},\mathbf{x}}(z)-P_{\mathbf{e}_{k},\mathbf{y}}(z)% \right)\right\rvert}= ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_ATOM | ( italic_p + ( 1 - italic_p ) italic_z ) start_POSTSUPERSCRIPT italic_k - italic_j end_POSTSUPERSCRIPT ⋅ ( italic_P start_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_x end_POSTSUBSCRIPT ( italic_z ) - italic_P start_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_y end_POSTSUBSCRIPT ( italic_z ) ) | end_ATOM
k|P𝐞k,𝐱(z)P𝐞k,𝐲(z)|.absent𝑘subscript𝑃subscript𝐞𝑘𝐱𝑧subscript𝑃subscript𝐞𝑘𝐲𝑧\displaystyle\leq k\cdot\mathinner{\!\left\lvert P_{\mathbf{e}_{k},\mathbf{x}}% (z)-P_{\mathbf{e}_{k},\mathbf{y}}(z)\right\rvert}.≤ italic_k ⋅ start_ATOM | italic_P start_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_x end_POSTSUBSCRIPT ( italic_z ) - italic_P start_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_y end_POSTSUBSCRIPT ( italic_z ) | end_ATOM .

The second last line is obtained by inductively applying Item 2. ∎

Below we give the proof of Lemma 3.2. We use the notations exp(x)ex𝑥superscript𝑒𝑥\exp(x)\coloneqq e^{x}roman_exp ( italic_x ) ≔ italic_e start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT, and exp2(x)2xsubscript2𝑥superscript2𝑥\exp_{2}(x)\coloneqq 2^{x}roman_exp start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) ≔ 2 start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT.

Proof of Lemma 3.2.

Let 𝐱S𝐱𝑆\mathbf{x}\in Sbold_x ∈ italic_S be a string of length L𝐿Litalic_L. In light of Lemma 3.4, we only need to consider a fixed k𝑘kitalic_k-mer w=𝐞k𝑤subscript𝐞𝑘w=\mathbf{e}_{k}italic_w = bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, where kL1/3𝑘superscript𝐿13k\leq L^{1/3}italic_k ≤ italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT. Define

g𝐱(z)j=0Lk𝟏{𝐱[j:j+k1]=𝐞k}zj.\displaystyle g_{\mathbf{x}}(z)\coloneqq\sum_{j=0}^{L-k}\mathbf{1}\Set{\mathbf% {x}[j\colon j+k-1]=\mathbf{e}_{k}}\cdot z^{j}.italic_g start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ( italic_z ) ≔ ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - italic_k end_POSTSUPERSCRIPT bold_1 { start_ARG bold_x [ italic_j : italic_j + italic_k - 1 ] = bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG } ⋅ italic_z start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT .

Recall that g𝐱(p+qz)=j=0L1K𝐞k,𝐱[j]zj=P𝐞k,𝐱(z)subscript𝑔𝐱𝑝𝑞𝑧superscriptsubscript𝑗0𝐿1subscript𝐾subscript𝐞𝑘𝐱delimited-[]𝑗superscript𝑧𝑗subscript𝑃subscript𝐞𝑘𝐱𝑧g_{\mathbf{x}}(p+qz)=\sum_{j=0}^{L-1}K_{\mathbf{e}_{k},\mathbf{x}}[j]\cdot z^{% j}=P_{\mathbf{e}_{k},\mathbf{x}}(z)italic_g start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ( italic_p + italic_q italic_z ) = ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_x end_POSTSUBSCRIPT [ italic_j ] ⋅ italic_z start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = italic_P start_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_x end_POSTSUBSCRIPT ( italic_z ). Denote by a0(𝐱),,aLk(𝐱)subscript𝑎0𝐱subscript𝑎𝐿𝑘𝐱a_{0}(\mathbf{x}),\dots,a_{L-k}(\mathbf{x})italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_x ) , … , italic_a start_POSTSUBSCRIPT italic_L - italic_k end_POSTSUBSCRIPT ( bold_x ) the Chebyshev coefficients of

f𝐱(z)g𝐱(14a+4az),subscript𝑓𝐱𝑧subscript𝑔𝐱14𝑎4𝑎𝑧\displaystyle f_{\mathbf{x}}(z)\coloneqq g_{\mathbf{x}}(1-4a+4a\cdot z),italic_f start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ( italic_z ) ≔ italic_g start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ( 1 - 4 italic_a + 4 italic_a ⋅ italic_z ) ,

where aL2/3log1/4L𝑎superscript𝐿23superscript14𝐿a\coloneqq L^{-2/3}\log^{1/4}Litalic_a ≔ italic_L start_POSTSUPERSCRIPT - 2 / 3 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT italic_L (equivalently, the coordinates of f𝐱subscript𝑓𝐱f_{\mathbf{x}}italic_f start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT in the Chebyshev basis). In other words, we can write

f𝐱(z)=j=0Lkaj(𝐱)Tj(z).subscript𝑓𝐱𝑧superscriptsubscript𝑗0𝐿𝑘subscript𝑎𝑗𝐱subscript𝑇𝑗𝑧\displaystyle f_{\mathbf{x}}(z)=\sum_{j=0}^{L-k}a_{j}(\mathbf{x})\cdot T_{j}(z).italic_f start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ( italic_z ) = ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - italic_k end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_x ) ⋅ italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_z ) .

We first argue that only the first few coefficients are significant. This can be done by applying Theorem 5 to f𝐱(z)subscript𝑓𝐱𝑧f_{\mathbf{x}}(z)italic_f start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ( italic_z ), say, with ρ=2𝜌2\rho=2italic_ρ = 2. To that end, we first upper bound |f𝐱(z)|subscript𝑓𝐱𝑧\mathinner{\!\left\lvert f_{\mathbf{x}}(z)\right\rvert}| italic_f start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ( italic_z ) | for zE2𝑧subscript𝐸2z\in E_{2}italic_z ∈ italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Denoting z=14a+4azsuperscript𝑧14𝑎4𝑎𝑧z^{\prime}=1-4a+4a\cdot zitalic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 - 4 italic_a + 4 italic_a ⋅ italic_z, we have that zE~a,2superscript𝑧subscript~𝐸𝑎2z^{\prime}\in\widetilde{E}_{a,2}italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_a , 2 end_POSTSUBSCRIPT when zE2𝑧subscript𝐸2z\in E_{2}italic_z ∈ italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. By item (1) of Proposition 2, we have |z|1+asuperscript𝑧1𝑎\mathinner{\!\left\lvert z^{\prime}\right\rvert}\leq 1+astart_ATOM | italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | end_ATOM ≤ 1 + italic_a. It follows that

supzE2|f𝐱(z)|=supzE~a,2|g𝐱(z)|L(1+a)LLexp(aL)=Lexp(L1/3log1/4L).subscriptsupremum𝑧subscript𝐸2subscript𝑓𝐱𝑧subscriptsupremumsuperscript𝑧subscript~𝐸𝑎2subscript𝑔𝐱superscript𝑧𝐿superscript1𝑎𝐿𝐿𝑎𝐿𝐿superscript𝐿13superscript14𝐿\displaystyle\sup_{z\in E_{2}}\mathinner{\!\left\lvert f_{\mathbf{x}}(z)\right% \rvert}=\sup_{z^{\prime}\in\widetilde{E}_{a,2}}\mathinner{\!\left\lvert g_{% \mathbf{x}}(z^{\prime})\right\rvert}\leq L(1+a)^{L}\leq L\exp(aL)=L\exp(L^{1/3% }\log^{1/4}L).roman_sup start_POSTSUBSCRIPT italic_z ∈ italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_ATOM | italic_f start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ( italic_z ) | end_ATOM = roman_sup start_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_a , 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_ATOM | italic_g start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | end_ATOM ≤ italic_L ( 1 + italic_a ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ≤ italic_L roman_exp ( italic_a italic_L ) = italic_L roman_exp ( italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT italic_L ) .

Therefore, we can apply Theorem 5 to f𝐱(z)subscript𝑓𝐱𝑧f_{\mathbf{x}}(z)italic_f start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ( italic_z ) with ρ=2𝜌2\rho=2italic_ρ = 2,M=Lexp(L1/3log1/4L)𝑀𝐿superscript𝐿13superscript14𝐿M=L\exp(L^{1/3}\log^{1/4}L)italic_M = italic_L roman_exp ( italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT italic_L ) and get (for large enough L𝐿Litalic_L)

jL1/3logL,|aj(𝐱)|Lexp(L1/3log1/4L)2L1/3logL2L1/3logL/8.formulae-sequencefor-all𝑗superscript𝐿13𝐿subscript𝑎𝑗𝐱𝐿superscript𝐿13superscript14𝐿superscript2superscript𝐿13𝐿superscript2superscript𝐿13𝐿8\displaystyle\forall j\geq L^{1/3}\sqrt{\log L},\quad\mathinner{\!\left\lvert a% _{j}(\mathbf{x})\right\rvert}\leq L\exp(L^{1/3}\log^{1/4}L)\cdot 2^{-L^{1/3}% \sqrt{\log L}}\leq 2^{-L^{1/3}\sqrt{\log L}/8}.∀ italic_j ≥ italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG , start_ATOM | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_x ) | end_ATOM ≤ italic_L roman_exp ( italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT italic_L ) ⋅ 2 start_POSTSUPERSCRIPT - italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG end_POSTSUPERSCRIPT ≤ 2 start_POSTSUPERSCRIPT - italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG / 8 end_POSTSUPERSCRIPT .

To each string 𝐱{0,1}L𝐱superscript01𝐿\mathbf{x}\in\set{0,1}^{L}bold_x ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT we associate a vector

ϕ(𝐱)(aj(𝐱):j=0,1,,L1/3logL1).\displaystyle\phi(\mathbf{x})\coloneqq\left(a_{j}(\mathbf{x})\colon j=0,1,% \dots,L^{1/3}\sqrt{\log L}-1\right).italic_ϕ ( bold_x ) ≔ ( italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_x ) : italic_j = 0 , 1 , … , italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG - 1 ) .

Proposition 1 implies each entry of ϕ(𝐱)italic-ϕ𝐱\phi(\mathbf{x})italic_ϕ ( bold_x ) belongs to the interval [2(Lk+1),2(Lk+1)][2L,2L]2𝐿𝑘12𝐿𝑘12𝐿2𝐿[-2(L-k+1),2(L-k+1)]\subseteq[-2L,2L][ - 2 ( italic_L - italic_k + 1 ) , 2 ( italic_L - italic_k + 1 ) ] ⊆ [ - 2 italic_L , 2 italic_L ]. We now partition [2L,2L]2𝐿2𝐿[-2L,2L][ - 2 italic_L , 2 italic_L ] into m𝑚mitalic_m smaller intervals I1,,Imsubscript𝐼1subscript𝐼𝑚I_{1},\dots,I_{m}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, each of length 2L1/3logL/8superscript2superscript𝐿13𝐿82^{-L^{1/3}\sqrt{\log L}/8}2 start_POSTSUPERSCRIPT - italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG / 8 end_POSTSUPERSCRIPT, meaning that m=4L2L1/3logL/8𝑚4𝐿superscript2superscript𝐿13𝐿8m=4L\cdot 2^{L^{1/3}\sqrt{\log L}/8}italic_m = 4 italic_L ⋅ 2 start_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG / 8 end_POSTSUPERSCRIPT. The vector ϕ(𝐱)italic-ϕ𝐱\phi(\mathbf{x})italic_ϕ ( bold_x ) must fall into one of the sub-cubes of the form

(r)0j<L1/3logLIr(j),𝑟subscriptproduct0𝑗superscript𝐿13𝐿subscript𝐼𝑟𝑗\displaystyle\mathcal{I}(r)\coloneqq\prod_{0\leq j<L^{1/3}\sqrt{\log L}}I_{r(j% )},caligraphic_I ( italic_r ) ≔ ∏ start_POSTSUBSCRIPT 0 ≤ italic_j < italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_r ( italic_j ) end_POSTSUBSCRIPT ,

where r:[L1/3logL][m]:𝑟delimited-[]superscript𝐿13𝐿delimited-[]𝑚r\colon[L^{1/3}\sqrt{\log L}]\rightarrow[m]italic_r : [ italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG ] → [ italic_m ] is a map** that uniquely identifies the sub-cube. It follows that the total number of such sub-cubes is

mL1/3logLsuperscript𝑚superscript𝐿13𝐿\displaystyle m^{L^{1/3}\sqrt{\log L}}italic_m start_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG end_POSTSUPERSCRIPT =(4L2L1/3logL/8)L1/3logLabsentsuperscript4𝐿superscript2superscript𝐿13𝐿8superscript𝐿13𝐿\displaystyle=\left(4L\cdot 2^{L^{1/3}\sqrt{\log L}/8}\right)^{L^{1/3}\sqrt{% \log L}}= ( 4 italic_L ⋅ 2 start_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG / 8 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG end_POSTSUPERSCRIPT
=exp2((L1/3logL8+log2L+2)L1/3logL)absentsubscript2superscript𝐿13𝐿8subscript2𝐿2superscript𝐿13𝐿\displaystyle=\exp_{2}\left(\left(\frac{L^{1/3}\sqrt{\log L}}{8}+\log_{2}{L}+2% \right)\cdot L^{1/3}\sqrt{\log L}\right)= roman_exp start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ( divide start_ARG italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG end_ARG start_ARG 8 end_ARG + roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_L + 2 ) ⋅ italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG )
exp2(L2/3logL8+O(L1/3log3/2L))<2(L2/3logL)/6|S|absentsubscript2superscript𝐿23𝐿8𝑂superscript𝐿13superscript32𝐿superscript2superscript𝐿23𝐿6𝑆\displaystyle\leq\exp_{2}\left(\frac{L^{2/3}\log L}{8}+O(L^{1/3}\log^{3/2}{L})% \right)<2^{(L^{2/3}\log L)/6}\leq\mathinner{\!\left\lvert S\right\rvert}≤ roman_exp start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG italic_L start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT roman_log italic_L end_ARG start_ARG 8 end_ARG + italic_O ( italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT italic_L ) ) < 2 start_POSTSUPERSCRIPT ( italic_L start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT roman_log italic_L ) / 6 end_POSTSUPERSCRIPT ≤ start_ATOM | italic_S | end_ATOM

for large enough L𝐿Litalic_L. By the Pigeonhole Principle, there must be two distinct strings 𝐱,𝐲S𝐱𝐲𝑆\mathbf{x},\mathbf{y}\in Sbold_x , bold_y ∈ italic_S such that ϕ(𝐱),ϕ(𝐲)italic-ϕ𝐱italic-ϕ𝐲\phi(\mathbf{x}),\phi(\mathbf{y})italic_ϕ ( bold_x ) , italic_ϕ ( bold_y ) fall into the same sub-cube. In other words, we have

0j<L1/3logL,|aj(𝐱)aj(𝐲)|2L1/3logL/8.formulae-sequencefor-all0𝑗superscript𝐿13𝐿subscript𝑎𝑗𝐱subscript𝑎𝑗𝐲superscript2superscript𝐿13𝐿8\displaystyle\forall 0\leq j<L^{1/3}\sqrt{\log L},\quad\mathinner{\!\left% \lvert a_{j}(\mathbf{x})-a_{j}(\mathbf{y})\right\rvert}\leq 2^{-L^{1/3}\sqrt{% \log L}/8}.∀ 0 ≤ italic_j < italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG , start_ATOM | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_x ) - italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_y ) | end_ATOM ≤ 2 start_POSTSUPERSCRIPT - italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG / 8 end_POSTSUPERSCRIPT .

It follows that

supz[18a,1]|g𝐱(z)g𝐲(z)|subscriptsupremum𝑧18𝑎1subscript𝑔𝐱𝑧subscript𝑔𝐲𝑧\displaystyle\sup_{z\in[1-8a,1]}\mathinner{\!\left\lvert g_{\mathbf{x}}(z)-g_{% \mathbf{y}}(z)\right\rvert}roman_sup start_POSTSUBSCRIPT italic_z ∈ [ 1 - 8 italic_a , 1 ] end_POSTSUBSCRIPT start_ATOM | italic_g start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ( italic_z ) - italic_g start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT ( italic_z ) | end_ATOM =supz[1,1]|f𝐱(z)f𝐲(z)|absentsubscriptsupremum𝑧11subscript𝑓𝐱𝑧subscript𝑓𝐲𝑧\displaystyle=\sup_{z\in[-1,1]}\mathinner{\!\left\lvert f_{\mathbf{x}}(z)-f_{% \mathbf{y}}(z)\right\rvert}= roman_sup start_POSTSUBSCRIPT italic_z ∈ [ - 1 , 1 ] end_POSTSUBSCRIPT start_ATOM | italic_f start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ( italic_z ) - italic_f start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT ( italic_z ) | end_ATOM
supz[1,1]j=0Lk|aj(𝐱)aj(𝐲)||Tj(z)|absentsubscriptsupremum𝑧11superscriptsubscript𝑗0𝐿𝑘subscript𝑎𝑗𝐱subscript𝑎𝑗𝐲subscript𝑇𝑗𝑧\displaystyle\leq\sup_{z\in[-1,1]}\sum_{j=0}^{L-k}\mathinner{\!\left\lvert a_{% j}(\mathbf{x})-a_{j}(\mathbf{y})\right\rvert}\cdot\mathinner{\!\left\lvert T_{% j}(z)\right\rvert}≤ roman_sup start_POSTSUBSCRIPT italic_z ∈ [ - 1 , 1 ] end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - italic_k end_POSTSUPERSCRIPT start_ATOM | italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_x ) - italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_y ) | end_ATOM ⋅ start_ATOM | italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_z ) | end_ATOM
j=0L1/3logL12L1/3logL/8+j=L1/3logLLk22L1/3logL/8absentsuperscriptsubscript𝑗0superscript𝐿13𝐿1superscript2superscript𝐿13𝐿8superscriptsubscript𝑗superscript𝐿13𝐿𝐿𝑘2superscript2superscript𝐿13𝐿8\displaystyle\leq\sum_{j=0}^{L^{1/3}\sqrt{\log L}-1}2^{-L^{1/3}\sqrt{\log L}/8% }+\sum_{j=L^{1/3}\sqrt{\log L}}^{L-k}2\cdot 2^{-L^{1/3}\sqrt{\log L}/8}≤ ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG - 1 end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT - italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG / 8 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_j = italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - italic_k end_POSTSUPERSCRIPT 2 ⋅ 2 start_POSTSUPERSCRIPT - italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG / 8 end_POSTSUPERSCRIPT
2L1/3logL/7.absentsuperscript2superscript𝐿13𝐿7\displaystyle\leq 2^{-L^{1/3}\sqrt{\log L}/7}.≤ 2 start_POSTSUPERSCRIPT - italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG / 7 end_POSTSUPERSCRIPT .

Applying Corollary 3.1 to g𝐱g𝐲subscript𝑔𝐱subscript𝑔𝐲g_{\mathbf{x}}-g_{\mathbf{y}}italic_g start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT - italic_g start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT with a=L2/3log1/4L𝑎superscript𝐿23superscript14𝐿a=L^{-2/3}\log^{1/4}Litalic_a = italic_L start_POSTSUPERSCRIPT - 2 / 3 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT italic_L gives (for large enough L𝐿Litalic_L)

supzE~a,2|g𝐱(z)g𝐲(z)|subscriptsupremum𝑧subscript~𝐸𝑎2subscript𝑔𝐱𝑧subscript𝑔𝐲𝑧\displaystyle\sup_{z\in\partial\widetilde{E}_{a,2}}\mathinner{\!\left\lvert g_% {\mathbf{x}}(z)-g_{\mathbf{y}}(z)\right\rvert}roman_sup start_POSTSUBSCRIPT italic_z ∈ ∂ over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_a , 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_ATOM | italic_g start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ( italic_z ) - italic_g start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT ( italic_z ) | end_ATOM exp(5aL/2)supz[18a,1]|g𝐱(z)g𝐲(z)|absent5𝑎𝐿2subscriptsupremum𝑧18𝑎1subscript𝑔𝐱𝑧subscript𝑔𝐲𝑧\displaystyle\leq\exp\left(5aL/2\right)\cdot\sup_{z\in[1-8a,1]}\mathinner{\!% \left\lvert g_{\mathbf{x}}(z)-g_{\mathbf{y}}(z)\right\rvert}≤ roman_exp ( 5 italic_a italic_L / 2 ) ⋅ roman_sup start_POSTSUBSCRIPT italic_z ∈ [ 1 - 8 italic_a , 1 ] end_POSTSUBSCRIPT start_ATOM | italic_g start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ( italic_z ) - italic_g start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT ( italic_z ) | end_ATOM
exp(5L1/3log1/4L/2)2L1/3logL/142L1/3logL/15.absent5superscript𝐿13superscript14𝐿2superscript2superscript𝐿13𝐿14superscript2superscript𝐿13𝐿15\displaystyle\leq\exp\left(5L^{1/3}\log^{1/4}L/2\right)\cdot 2^{-L^{1/3}\sqrt{% \log L}/14}\leq 2^{-L^{1/3}\sqrt{\log L}/15}.≤ roman_exp ( 5 italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT italic_L / 2 ) ⋅ 2 start_POSTSUPERSCRIPT - italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG / 14 end_POSTSUPERSCRIPT ≤ 2 start_POSTSUPERSCRIPT - italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG / 15 end_POSTSUPERSCRIPT .

Let ΓΓ\Gammaroman_Γ be the sub-arc of the circle {p+qz:|z|=1}:𝑝𝑞𝑧𝑧1\set{p+qz\colon\mathinner{\!\left\lvert z\right\rvert}=1}{ start_ARG italic_p + italic_q italic_z : start_ATOM | italic_z | end_ATOM = 1 end_ARG } which lies completely inside the ellipse E~a,2subscript~𝐸𝑎2\widetilde{E}_{a,2}over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_a , 2 end_POSTSUBSCRIPT. Item (2) of Proposition 2 implies that the length of ΓΓ\Gammaroman_Γ is at least a=L2/3log1/4L𝑎superscript𝐿23superscript14𝐿a=L^{-2/3}\log^{1/4}Litalic_a = italic_L start_POSTSUPERSCRIPT - 2 / 3 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT italic_L. Therefore the Maximum Modulus Principle implies

supθ:|θ|a|P𝐞k,𝐱(eiθ)P𝐞k,𝐲(eiθ)|supzΓ|g𝐱(z)g𝐲(z)|supzE~a,2|g𝐱(z)g𝐲(z)|2L1/3logL/15.subscriptsupremum:𝜃𝜃𝑎subscript𝑃subscript𝐞𝑘𝐱superscript𝑒𝑖𝜃subscript𝑃subscript𝐞𝑘𝐲superscript𝑒𝑖𝜃subscriptsupremum𝑧Γsubscript𝑔𝐱𝑧subscript𝑔𝐲𝑧subscriptsupremum𝑧subscript~𝐸𝑎2subscript𝑔𝐱𝑧subscript𝑔𝐲𝑧superscript2superscript𝐿13𝐿15\displaystyle\sup_{\theta\colon\mathinner{\!\left\lvert\theta\right\rvert}\leq a% }\mathinner{\!\left\lvert P_{\mathbf{e}_{k},\mathbf{x}}(e^{i\theta})-P_{% \mathbf{e}_{k},\mathbf{y}}(e^{i\theta})\right\rvert}\leq\sup_{z\in\Gamma}% \mathinner{\!\left\lvert g_{\mathbf{x}}(z)-g_{\mathbf{y}}(z)\right\rvert}\leq% \sup_{z\in\widetilde{E}_{a,2}}\mathinner{\!\left\lvert g_{\mathbf{x}}(z)-g_{% \mathbf{y}}(z)\right\rvert}\leq 2^{-L^{1/3}\sqrt{\log L}/15}.roman_sup start_POSTSUBSCRIPT italic_θ : start_ATOM | italic_θ | end_ATOM ≤ italic_a end_POSTSUBSCRIPT start_ATOM | italic_P start_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_x end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) - italic_P start_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_y end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) | end_ATOM ≤ roman_sup start_POSTSUBSCRIPT italic_z ∈ roman_Γ end_POSTSUBSCRIPT start_ATOM | italic_g start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ( italic_z ) - italic_g start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT ( italic_z ) | end_ATOM ≤ roman_sup start_POSTSUBSCRIPT italic_z ∈ over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_a , 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_ATOM | italic_g start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ( italic_z ) - italic_g start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT ( italic_z ) | end_ATOM ≤ 2 start_POSTSUPERSCRIPT - italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG / 15 end_POSTSUPERSCRIPT .

Now we have established the lemma for a fixed k𝑘kitalic_k-mer w=𝐞k𝑤subscript𝐞𝑘w=\mathbf{e}_{k}italic_w = bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Since 𝐱,𝐲S𝐱𝐲𝑆\mathbf{x},\mathbf{y}\in Sbold_x , bold_y ∈ italic_S, Lemma 3.4 says that for any other k𝑘kitalic_k-mer w{0,1}k𝑤superscript01𝑘w\in\set{0,1}^{k}italic_w ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT either both Pw,𝐱(z)subscript𝑃𝑤𝐱𝑧P_{w,\mathbf{x}}(z)italic_P start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT ( italic_z ) and Pw,𝐲(z)subscript𝑃𝑤𝐲𝑧P_{w,\mathbf{y}}(z)italic_P start_POSTSUBSCRIPT italic_w , bold_y end_POSTSUBSCRIPT ( italic_z ) are zero polynomials or w{0k,𝐞1,,𝐞k}𝑤superscript0𝑘subscript𝐞1subscript𝐞𝑘w\in\set{0^{k},\mathbf{e}_{1},\dots,\mathbf{e}_{k}}italic_w ∈ { start_ARG 0 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG } and

supθ:|θ|a|Pw,𝐱(eiθ)Pw,𝐲(eiθ)|ksupθ:|θ|a|P𝐞k,𝐱(eiθ)P𝐞k,𝐲(eiθ)|2L1/3logL/20.subscriptsupremum:𝜃𝜃𝑎subscript𝑃𝑤𝐱superscript𝑒𝑖𝜃subscript𝑃𝑤𝐲superscript𝑒𝑖𝜃𝑘subscriptsupremum:𝜃𝜃𝑎subscript𝑃subscript𝐞𝑘𝐱superscript𝑒𝑖𝜃subscript𝑃subscript𝐞𝑘𝐲superscript𝑒𝑖𝜃superscript2superscript𝐿13𝐿20\displaystyle\sup_{\theta\colon\mathinner{\!\left\lvert\theta\right\rvert}\leq a% }\mathinner{\!\left\lvert P_{w,\mathbf{x}}(e^{i\theta})-P_{w,\mathbf{y}}(e^{i% \theta})\right\rvert}\leq k\cdot\sup_{\theta\colon\mathinner{\!\left\lvert% \theta\right\rvert}\leq a}\mathinner{\!\left\lvert P_{\mathbf{e}_{k},\mathbf{x% }}(e^{i\theta})-P_{\mathbf{e}_{k},\mathbf{y}}(e^{i\theta})\right\rvert}\leq 2^% {-L^{1/3}\sqrt{\log L}/20}.roman_sup start_POSTSUBSCRIPT italic_θ : start_ATOM | italic_θ | end_ATOM ≤ italic_a end_POSTSUBSCRIPT start_ATOM | italic_P start_POSTSUBSCRIPT italic_w , bold_x end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) - italic_P start_POSTSUBSCRIPT italic_w , bold_y end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) | end_ATOM ≤ italic_k ⋅ roman_sup start_POSTSUBSCRIPT italic_θ : start_ATOM | italic_θ | end_ATOM ≤ italic_a end_POSTSUBSCRIPT start_ATOM | italic_P start_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_x end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) - italic_P start_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_y end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT ) | end_ATOM ≤ 2 start_POSTSUPERSCRIPT - italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT square-root start_ARG roman_log italic_L end_ARG / 20 end_POSTSUPERSCRIPT .

Finally, we note that both 𝐱𝐱\mathbf{x}bold_x and 𝐲𝐲\mathbf{y}bold_y start with a run of 0’s of length k=L1/3𝑘superscript𝐿13k=L^{1/3}italic_k = italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT.

Remark 4.

A much simpler proof for the slightly weaker bound 2Ω(L1/3)superscript2Ωsuperscript𝐿132^{-\Omega(L^{1/3})}2 start_POSTSUPERSCRIPT - roman_Ω ( italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT is possible based on the complex analytical result of Borwein and Erdelyi [BE97, Theorem 3.3] (see also [DOS19, NP17]): there exist strings 𝐱,𝐲{0,1}L2/3𝐱𝐲superscript01superscript𝐿23\mathbf{x},\mathbf{y}\in\set{0,1}^{L^{2/3}}bold_x , bold_y ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT such that

2Ω(L1/3)supzΓL1/3|P𝐱(z)P𝐲(z)|=supzΓL2/3|P𝐱(zL1/3)P𝐲(zL1/3)|,superscript2Ωsuperscript𝐿13subscriptsupremum𝑧subscriptΓsuperscript𝐿13subscript𝑃𝐱𝑧subscript𝑃𝐲𝑧subscriptsupremum𝑧subscriptΓsuperscript𝐿23subscript𝑃𝐱superscript𝑧superscript𝐿13subscript𝑃𝐲superscript𝑧superscript𝐿13\displaystyle 2^{-\Omega(L^{1/3})}\geq\sup_{z\in\Gamma_{L^{-1/3}}}\mathinner{% \!\left\lvert P_{\mathbf{x}}(z)-P_{\mathbf{y}}(z)\right\rvert}=\sup_{z\in% \Gamma_{L^{-2/3}}}\mathinner{\!\left\lvert P_{\mathbf{x}}(z^{L^{1/3}})-P_{% \mathbf{y}}(z^{L^{1/3}})\right\rvert},2 start_POSTSUPERSCRIPT - roman_Ω ( italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ≥ roman_sup start_POSTSUBSCRIPT italic_z ∈ roman_Γ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_ATOM | italic_P start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ( italic_z ) - italic_P start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT ( italic_z ) | end_ATOM = roman_sup start_POSTSUBSCRIPT italic_z ∈ roman_Γ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT - 2 / 3 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_ATOM | italic_P start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) - italic_P start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) | end_ATOM ,

where P𝐱(z)j=0|𝐱|1xjzjsubscript𝑃𝐱𝑧superscriptsubscript𝑗0𝐱1subscript𝑥𝑗superscript𝑧𝑗P_{\mathbf{x}}(z)\coloneqq\sum_{j=0}^{\mathinner{\!\left\lvert\mathbf{x}\right% \rvert}-1}x_{j}z^{j}italic_P start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ( italic_z ) ≔ ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_ATOM | bold_x | end_ATOM - 1 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, ΓasubscriptΓ𝑎\Gamma_{a}roman_Γ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT stands for the sub-arc {eiθ:|θ|<a}:superscript𝑒𝑖𝜃𝜃𝑎\set{e^{i\theta}\colon\mathinner{\!\left\lvert\theta\right\rvert}<a}{ start_ARG italic_e start_POSTSUPERSCRIPT italic_i italic_θ end_POSTSUPERSCRIPT : start_ATOM | italic_θ | end_ATOM < italic_a end_ARG }. Now we observe that P𝐱(zL1/3)=P𝐱(z)subscript𝑃𝐱superscript𝑧superscript𝐿13subscript𝑃superscript𝐱𝑧P_{\mathbf{x}}(z^{L^{1/3}})=P_{\mathbf{x}^{\prime}}(z)italic_P start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) = italic_P start_POSTSUBSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_z ) where 𝐱{0,1}Lsuperscript𝐱superscript01𝐿\mathbf{x}^{\prime}\in\set{0,1}^{L}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT is the string obtained by inserting L1/31superscript𝐿131L^{1/3}-1italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT - 1 many 0’s before every bit of 𝐱𝐱\mathbf{x}bold_x (𝐲superscript𝐲\mathbf{y}^{\prime}bold_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is defined similarly). Clearly, 𝐱,𝐲Ssuperscript𝐱superscript𝐲𝑆\mathbf{x}^{\prime},\mathbf{y}^{\prime}\in Sbold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_S since any two 1’s are separated by at least L1/31superscript𝐿131L^{1/3}-1italic_L start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT - 1 many 0’s. Therefore, they enjoy the properties in Lemma 3.4, and Lemma 3.1 follows with a weaker bound.111We thank an anonymous reviewer for pointing this observation out to us.

4 Optimality of the Maximum Likelihood Estimation

Proof of Theorem 3.

For 1im1𝑖𝑚1\leq i\leq m1 ≤ italic_i ≤ italic_m define Si{xΩ:D0(x)>Di(x)}subscript𝑆𝑖:𝑥Ωsubscript𝐷0𝑥subscript𝐷𝑖𝑥S_{i}\coloneqq\set{x\in\Omega\colon D_{0}(x)>D_{i}(x)}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≔ { start_ARG italic_x ∈ roman_Ω : italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) > italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) end_ARG }, and let SS1S2Sm𝑆subscript𝑆1subscript𝑆2subscript𝑆𝑚S\coloneqq S_{1}\cap S_{2}\cap\dots\cap S_{m}italic_S ≔ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∩ italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∩ ⋯ ∩ italic_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. By definition of the total variation distance, we have

1εdTV(D0,Di)=D0(Si)Di(Si)D0(Si).1𝜀subscript𝑑TVsubscript𝐷0subscript𝐷𝑖subscript𝐷0subscript𝑆𝑖subscript𝐷𝑖subscript𝑆𝑖subscript𝐷0subscript𝑆𝑖\displaystyle 1-\varepsilon\leq d_{\mathrm{TV}}\left(D_{0},D_{i}\right)=D_{0}(% S_{i})-D_{i}(S_{i})\leq D_{0}(S_{i}).1 - italic_ε ≤ italic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

The Union Bound thus implies D0(S)1mεsubscript𝐷0𝑆1𝑚𝜀D_{0}(S)\geq 1-m\varepsilonitalic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_S ) ≥ 1 - italic_m italic_ε. Moreover, by Definition 3, when xS𝑥𝑆x\in Sitalic_x ∈ italic_S it must hold that MLE(x;𝒟)=0MLE𝑥𝒟0\mathrm{MLE}(x;\mathcal{D})=0roman_MLE ( italic_x ; caligraphic_D ) = 0. Therefore

Prx𝒟0[MLE(x;𝒟)=0]Prx𝒟0[xS]=D0(S)1mε.subscriptPrsimilar-to𝑥subscript𝒟0MLE𝑥𝒟0subscriptPrsimilar-to𝑥subscript𝒟0𝑥𝑆subscript𝐷0𝑆1𝑚𝜀\displaystyle\Pr_{x\sim\mathcal{D}_{0}}\left[\mathrm{MLE}(x;\mathcal{D})=0% \right]\geq\Pr_{x\sim\mathcal{D}_{0}}\left[x\in S\right]=D_{0}(S)\geq 1-m\varepsilon.roman_Pr start_POSTSUBSCRIPT italic_x ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_MLE ( italic_x ; caligraphic_D ) = 0 ] ≥ roman_Pr start_POSTSUBSCRIPT italic_x ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_x ∈ italic_S ] = italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_S ) ≥ 1 - italic_m italic_ε .

Proof of Corollary 1.1.

The Chernoff bound implies that if we repeat the purported reconstruction algorithm 8ln(1/ε)n81𝜀𝑛8\ln(1/\varepsilon)n8 roman_ln ( 1 / italic_ε ) italic_n times and output the majority, it succeeds with probability at least 1ε/2n+11𝜀superscript2𝑛11-\varepsilon/2^{n+1}1 - italic_ε / 2 start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT.

Let A𝐴Aitalic_A be such a (deterministic) reconstruction algorithm with T=8ln(1/ε)nTsuperscript𝑇81𝜀𝑛𝑇T^{\prime}=8\ln(1/\varepsilon)\cdot nTitalic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 8 roman_ln ( 1 / italic_ε ) ⋅ italic_n italic_T traces described as above, which successfully outputs the source string x𝑥xitalic_x with probability at least 1ε/2n+11𝜀superscript2𝑛11-\varepsilon/2^{n+1}1 - italic_ε / 2 start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT. Formally, for any source string x{0,1}n𝑥superscript01𝑛x\in\set{0,1}^{n}italic_x ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, it holds that

Prx~1,,x~TDx[A(x~1,,x~T)=x]1ε/2n+1.subscriptPrsimilar-tosubscript~𝑥1subscript~𝑥superscript𝑇subscript𝐷𝑥𝐴subscript~𝑥1subscript~𝑥superscript𝑇𝑥1𝜀superscript2𝑛1\displaystyle\Pr_{\widetilde{x}_{1},\dots,\widetilde{x}_{T^{\prime}}\sim D_{x}% }\left[A(\widetilde{x}_{1},\dots,\widetilde{x}_{T^{\prime}})=x\right]\geq 1-% \varepsilon/2^{n+1}.roman_Pr start_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∼ italic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_A ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) = italic_x ] ≥ 1 - italic_ε / 2 start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT .

Let Rx({0,1}n)Tsubscript𝑅𝑥superscriptsuperscript01absent𝑛superscript𝑇R_{x}\subseteq\left(\set{0,1}^{\leq n}\right)^{T^{\prime}}italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ⊆ ( { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT ≤ italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT be exactly the collection of Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT-tuples of strings on which A𝐴Aitalic_A outputs x𝑥xitalic_x. We thus have

x{0,1}n,DxT(Rx)1ε/2n+1,formulae-sequencefor-all𝑥superscript01𝑛superscriptsubscript𝐷𝑥tensor-productabsentsuperscript𝑇subscript𝑅𝑥1𝜀superscript2𝑛1\displaystyle\forall x\in\set{0,1}^{n},\quad D_{x}^{\otimes T^{\prime}}(R_{x})% \geq 1-\varepsilon/2^{n+1},∀ italic_x ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊗ italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) ≥ 1 - italic_ε / 2 start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT ,

where DxTsuperscriptsubscript𝐷𝑥tensor-productabsentsuperscript𝑇D_{x}^{\otimes T^{\prime}}italic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊗ italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT denotes the Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT-fold product of Dxsubscript𝐷𝑥D_{x}italic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT with itself, capturing the distribution of (x~1,,x~T)subscript~𝑥1subscript~𝑥superscript𝑇\left(\widetilde{x}_{1},\dots,\widetilde{x}_{T^{\prime}}\right)( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ). On the other hand, for distinct strings x𝑥xitalic_x and y𝑦yitalic_y we have RxRy=subscript𝑅𝑥subscript𝑅𝑦R_{x}\cap R_{y}=\varnothingitalic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∩ italic_R start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = ∅ (by definition, A𝐴Aitalic_A cannot output both x𝑥xitalic_x and y𝑦yitalic_y on the same input), and hence the bound

DyT(Rx)1DyT(Ry)ε.superscriptsubscript𝐷𝑦tensor-productabsentsuperscript𝑇subscript𝑅𝑥1superscriptsubscript𝐷𝑦tensor-productabsentsuperscript𝑇subscript𝑅𝑦𝜀\displaystyle D_{y}^{\otimes T^{\prime}}(R_{x})\leq 1-D_{y}^{\otimes T^{\prime% }}(R_{y})\leq\varepsilon.italic_D start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊗ italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) ≤ 1 - italic_D start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊗ italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ≤ italic_ε .

This implies

dTV(DxT,DyT)=supS|DxT(S)DyT(S)|DxT(Rx)DyT(Rx)12ε/2n+1=1ε/2n.subscript𝑑TVsuperscriptsubscript𝐷𝑥tensor-productabsentsuperscript𝑇superscriptsubscript𝐷𝑦tensor-productabsentsuperscript𝑇subscriptsupremum𝑆superscriptsubscript𝐷𝑥tensor-productabsentsuperscript𝑇𝑆superscriptsubscript𝐷𝑦tensor-productabsentsuperscript𝑇𝑆superscriptsubscript𝐷𝑥tensor-productabsentsuperscript𝑇subscript𝑅𝑥superscriptsubscript𝐷𝑦tensor-productabsentsuperscript𝑇subscript𝑅𝑥12𝜀superscript2𝑛11𝜀superscript2𝑛\displaystyle d_{\mathrm{TV}}\left(D_{x}^{\otimes T^{\prime}},D_{y}^{\otimes T% ^{\prime}}\right)=\sup_{S}\mathinner{\!\left\lvert D_{x}^{\otimes T^{\prime}}(% S)-D_{y}^{\otimes T^{\prime}}(S)\right\rvert}\geq D_{x}^{\otimes T^{\prime}}(R% _{x})-D_{y}^{\otimes T^{\prime}}(R_{x})\geq 1-2\varepsilon/2^{n+1}=1-% \varepsilon/2^{n}.italic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊗ italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊗ italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) = roman_sup start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_ATOM | italic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊗ italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_S ) - italic_D start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊗ italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_S ) | end_ATOM ≥ italic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊗ italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) - italic_D start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊗ italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) ≥ 1 - 2 italic_ε / 2 start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT = 1 - italic_ε / 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT .

We stress that the above bound holds for any pair of distinct strings x,y{0,1}n𝑥𝑦superscript01𝑛x,y\in\set{0,1}^{n}italic_x , italic_y ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Applying Theorem 3 to 𝒟{DxT:x{0,1}n}𝒟:superscriptsubscript𝐷𝑥tensor-productabsentsuperscript𝑇𝑥superscript01𝑛\mathcal{D}\coloneqq\set{D_{x}^{\otimes T^{\prime}}\colon x\in\set{0,1}^{n}}caligraphic_D ≔ { start_ARG italic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊗ italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT : italic_x ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG } gives

x{0,1}n,Prx~1,,x~TDx[MLE(x~1,,x~T;𝒟)=x]1(2n1)ε/2n1ε.formulae-sequencefor-all𝑥superscript01𝑛subscriptPrsimilar-tosubscript~𝑥1subscript~𝑥superscript𝑇subscript𝐷𝑥MLEsubscript~𝑥1subscript~𝑥superscript𝑇𝒟𝑥1superscript2𝑛1𝜀superscript2𝑛1𝜀\displaystyle\forall x\in\set{0,1}^{n},\quad\Pr_{\widetilde{x}_{1},\dots,% \widetilde{x}_{T^{\prime}}\sim D_{x}}\left[\mathrm{MLE}(\widetilde{x}_{1},% \dots,\widetilde{x}_{T^{\prime}};\mathcal{D})=x\right]\geq 1-(2^{n}-1)\cdot% \varepsilon/2^{n}\geq 1-\varepsilon.∀ italic_x ∈ { start_ARG 0 , 1 end_ARG } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , roman_Pr start_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∼ italic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_MLE ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ; caligraphic_D ) = italic_x ] ≥ 1 - ( 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - 1 ) ⋅ italic_ε / 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ≥ 1 - italic_ε .

Proof of Theorem 4.

The distributions are defined as follows. Let t=n/4𝑡𝑛4t=\left\lfloor{n/4}\right\rflooritalic_t = ⌊ italic_n / 4 ⌋, and so m=(nt)𝑚binomial𝑛𝑡m=\binom{n}{t}italic_m = ( FRACOP start_ARG italic_n end_ARG start_ARG italic_t end_ARG ). The domain Ω=Ω1Ω2ΩsubscriptΩ1subscriptΩ2\Omega=\Omega_{1}\cup\Omega_{2}roman_Ω = roman_Ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∪ roman_Ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT where Ω1=([n]t)subscriptΩ1binomialdelimited-[]𝑛𝑡\Omega_{1}=\binom{[n]}{t}roman_Ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( FRACOP start_ARG [ italic_n ] end_ARG start_ARG italic_t end_ARG ) is the collection of all subsets of [n]delimited-[]𝑛[n][ italic_n ] of size exactly t𝑡titalic_t, and Ω2=[n]subscriptΩ2delimited-[]𝑛\Omega_{2}=[n]roman_Ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = [ italic_n ]. We have

|Ω|=(nt)+n=m+n.Ωbinomial𝑛𝑡𝑛𝑚𝑛\displaystyle\mathinner{\!\left\lvert\Omega\right\rvert}=\binom{n}{t}+n=m+n.start_ATOM | roman_Ω | end_ATOM = ( FRACOP start_ARG italic_n end_ARG start_ARG italic_t end_ARG ) + italic_n = italic_m + italic_n .

We first define D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to be the uniform distribution over Ω2subscriptΩ2\Omega_{2}roman_Ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, i.e., D0(𝔰)=1/nsubscript𝐷0𝔰1𝑛D_{0}(\mathfrak{s})=1/nitalic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( fraktur_s ) = 1 / italic_n for any 𝔰[n]𝔰delimited-[]𝑛\mathfrak{s}\in[n]fraktur_s ∈ [ italic_n ].

For each one of the remaining m𝑚mitalic_m distributions, we identify it with a t𝑡titalic_t-subset S([n]t)𝑆binomialdelimited-[]𝑛𝑡S\in\binom{[n]}{t}italic_S ∈ ( FRACOP start_ARG [ italic_n ] end_ARG start_ARG italic_t end_ARG ). The precise definition of DSsubscript𝐷𝑆D_{S}italic_D start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT is as follows.

S([n]t),DS(𝔰)={2/3if 𝔰Ω1, and 𝔰=S1/(3t)if 𝔰Ω2, and 𝔰S0otherwise.formulae-sequencefor-all𝑆binomialdelimited-[]𝑛𝑡subscript𝐷𝑆𝔰cases23if 𝔰Ω1, and 𝔰=S13𝑡if 𝔰Ω2, and 𝔰S0otherwise\displaystyle\forall S\in\binom{[n]}{t},\quad D_{S}(\mathfrak{s})=\begin{cases% }2/3&\textup{if $\mathfrak{s}\in\Omega_{1}$, and $\mathfrak{s}=S$}\\ 1/(3t)&\textup{if $\mathfrak{s}\in\Omega_{2}$, and $\mathfrak{s}\in S$}\\ 0&\textup{otherwise}\end{cases}.∀ italic_S ∈ ( FRACOP start_ARG [ italic_n ] end_ARG start_ARG italic_t end_ARG ) , italic_D start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( fraktur_s ) = { start_ROW start_CELL 2 / 3 end_CELL start_CELL if fraktur_s ∈ roman_Ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , and fraktur_s = italic_S end_CELL end_ROW start_ROW start_CELL 1 / ( 3 italic_t ) end_CELL start_CELL if fraktur_s ∈ roman_Ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , and fraktur_s ∈ italic_S end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW .

In other words, 𝔰Ω1𝔰subscriptΩ1\mathfrak{s}\in\Omega_{1}fraktur_s ∈ roman_Ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT occurs with probability 2/3232/32 / 3, conditioned on which DSsubscript𝐷𝑆D_{S}italic_D start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT is the point distribution supported on {S}𝑆\set{S}{ start_ARG italic_S end_ARG }; 𝔰Ω2𝔰subscriptΩ2\mathfrak{s}\in\Omega_{2}fraktur_s ∈ roman_Ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT occurs with probability 1/3131/31 / 3, conditioned on which DSsubscript𝐷𝑆D_{S}italic_D start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT is the uniform distribution over S𝑆Sitalic_S. Now we verify that 𝒟={D0,D1,,Dm}𝒟subscript𝐷0subscript𝐷1subscript𝐷𝑚\mathcal{D}=\set{D_{0},D_{1},\dots,D_{m}}caligraphic_D = { start_ARG italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG } satisfies the two conditions.

For Condition 1, consider a distinguisher A𝐴Aitalic_A which on sample 𝔰Ω𝔰Ω\mathfrak{s}\in\Omegafraktur_s ∈ roman_Ω, outputs S𝑆Sitalic_S if 𝔰=SΩ1𝔰𝑆subscriptΩ1\mathfrak{s}=S\in\Omega_{1}fraktur_s = italic_S ∈ roman_Ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and outputs 0 if 𝔰Ω2𝔰subscriptΩ2\mathfrak{s}\in\Omega_{2}fraktur_s ∈ roman_Ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We have

Pr𝔰D0[A(𝔰)=0]=12/3,Pr𝔰DS[A(𝔰)=S]=2/3.formulae-sequencesubscriptPrsimilar-to𝔰subscript𝐷0𝐴𝔰0123subscriptPrsimilar-to𝔰subscript𝐷𝑆𝐴𝔰𝑆23\displaystyle\Pr_{\mathfrak{s}\sim D_{0}}\left[A(\mathfrak{s})=0\right]=1\geq 2% /3,\quad\Pr_{\mathfrak{s}\sim D_{S}}\left[A(\mathfrak{s})=S\right]=2/3.roman_Pr start_POSTSUBSCRIPT fraktur_s ∼ italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_A ( fraktur_s ) = 0 ] = 1 ≥ 2 / 3 , roman_Pr start_POSTSUBSCRIPT fraktur_s ∼ italic_D start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_A ( fraktur_s ) = italic_S ] = 2 / 3 .

To see Condition 2, let 𝔰1,,𝔰TD0similar-tosubscript𝔰1subscript𝔰𝑇subscript𝐷0\mathfrak{s}_{1},\dots,\mathfrak{s}_{T}\sim D_{0}fraktur_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , fraktur_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT be Tn/4𝑇𝑛4T\leq\left\lfloor{n/4}\right\rflooritalic_T ≤ ⌊ italic_n / 4 ⌋ samples. Since D0subscript𝐷0D_{0}italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is supported on Ω2=[n]subscriptΩ2delimited-[]𝑛\Omega_{2}=[n]roman_Ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = [ italic_n ], the samples are all elements of [n]delimited-[]𝑛[n][ italic_n ], meaning that there is at least one S([n]t)𝑆binomialdelimited-[]𝑛𝑡S\in\binom{[n]}{t}italic_S ∈ ( FRACOP start_ARG [ italic_n ] end_ARG start_ARG italic_t end_ARG ) containing all samples. Calculating the likelihoods gives

i=1TDS(𝔰i)=(13t)T(43n)T>(1n)T=i=1TD0(𝔰i).superscriptsubscriptproduct𝑖1𝑇subscript𝐷𝑆subscript𝔰𝑖superscript13𝑡𝑇superscript43𝑛𝑇superscript1𝑛𝑇superscriptsubscriptproduct𝑖1𝑇subscript𝐷0subscript𝔰𝑖\displaystyle\prod_{i=1}^{T}D_{S}(\mathfrak{s}_{i})=\left(\frac{1}{3t}\right)^% {T}\geq\left(\frac{4}{3n}\right)^{T}>\left(\frac{1}{n}\right)^{T}=\prod_{i=1}^% {T}D_{0}(\mathfrak{s}_{i}).∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( fraktur_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ( divide start_ARG 1 end_ARG start_ARG 3 italic_t end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ≥ ( divide start_ARG 4 end_ARG start_ARG 3 italic_n end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT > ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( fraktur_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

Therefore, the output of the Maximum Likelihood Estimation on 𝔰1,,𝔰TD0similar-tosubscript𝔰1subscript𝔰𝑇subscript𝐷0\mathfrak{s}_{1},\dots,\mathfrak{s}_{T}\sim D_{0}fraktur_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , fraktur_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ italic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT will never be 00.

5 Acknowledgements

We are thankful to several anonymous reviewers for their valuable suggestions and comments.

References

  • [BCF+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT19] Frank Ban, Xi Chen, Adam Freilich, Rocco A Servedio, and Sandip Sinha. Beyond trace reconstruction: Population recovery from the deletion channel. In 60th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2019, pages 745–768. IEEE, 2019.
  • [BE97] Peter Borwein and Tamás Erdélyi. Littlewood-type problems on subarcs of the unit circle. Indiana University mathematics journal, pages 1323–1346, 1997.
  • [BEK99] Peter Borwein, Tamás Erdélyi, and Géza Kós. Littlewood-type problems on [0,1]01[0,1][ 0 , 1 ]. Proceedings of the London Mathematical Society, 79(1):22–46, 1999.
  • [BKKM04] Tugkan Batu, Sampath Kannan, Sanjeev Khanna, and Andrew McGregor. Reconstructing strings from random traces. In J. Ian Munro, editor, Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2004, New Orleans, Louisiana, USA, January 11-14, 2004, pages 910–918. SIAM, 2004.
  • [BLS20] Joshua Brakensiek, Ray Li, and Bruce Spang. Coded trace reconstruction in a constant number of traces. In 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, pages 482–493. IEEE, 2020.
  • [CDK21] Diptarka Chakraborty, Debarati Das, and Robert Krauthgamer. Approximate trace reconstruction via median string (in average-case). In 41st IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science, FSTTCS 2021, volume 213 of LIPIcs, pages 11:1–11:23, 2021.
  • [CDL+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21a] Xi Chen, Anindya De, Chin Ho Lee, Rocco A Servedio, and Sandip Sinha. Near-optimal average-case approximate trace reconstruction from few traces. arXiv preprint arXiv:2107.11530, 2021. (To appear in SODA 2022).
  • [CDL+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21b] Xi Chen, Anindya De, Chin Ho Lee, Rocco A. Servedio, and Sandip Sinha. Polynomial-time trace reconstruction in the smoothed complexity model. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, pages 54–73. SIAM, 2021.
  • [CDRV21] Mahdi Cheraghchi, Joseph Downs, João L. Ribeiro, and Alexandra Veliche. Mean-based trace reconstruction over practically any replication-insertion channel. In IEEE International Symposium on Information Theory, ISIT 2021, pages 2459–2464. IEEE, 2021.
  • [CGMR20] Mahdi Cheraghchi, Ryan Gabrys, Olgica Milenkovic, and João Ribeiro. Coded trace reconstruction. IEEE Transactions on Information Theory, 66(10):6084–6103, 2020.
  • [Cha21a] Zachary Chase. New lower bounds for trace reconstruction. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 57, pages 627–643. Institut Henri Poincaré, 2021.
  • [Cha21b] Zachary Chase. Separating words and trace reconstruction. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021, pages 21–31. ACM, 2021.
  • [CP21] Zachary Chase and Yuval Peres. Approximate trace reconstruction of random strings from a constant number of traces. arXiv preprint arXiv:2107.06454, 2021.
  • [DOS19] Anindya De, Ryan O’Donnell, and Rocco A Servedio. Optimal mean-based algorithms for trace reconstruction. The Annals of Applied Probability, 29(2):851–874, 2019.
  • [DRSR21] Sami Davies, Miklós Z Rácz, Benjamin G Schiffer, and Cyrus Rashtchian. Approximate trace reconstruction: Algorithms. In IEEE International Symposium on Information Theory, ISIT 2021, pages 2525–2530. IEEE, 2021.
  • [DS03] Miroslav Dudık and Leonard J Schulman. Reconstruction from subsequences. Journal of Combinatorial Theory, Series A, 103(2):337–348, 2003.
  • [GM17] Ryan Gabrys and Olgica Milenkovic. The hybrid k-deck problem: Reconstructing sequences from short and long traces. In IEEE International Symposium on Information Theory, ISIT 2017, pages 1306–1310. IEEE, 2017.
  • [GM19] Ryan Gabrys and Olgica Milenkovic. Unique reconstruction of coded strings from multiset substring spectra. IEEE Transactions on Information Theory, 65(12):7682–7696, 2019.
  • [GSZ22] Elena Grigorescu, Madhu Sudan, and Minshen Zhu. Limitations of mean-based algorithms for trace reconstruction at small edit distance. IEEE Trans. Inf. Theory, 68(10):6790–6801, 2022.
  • [HHP18] Lisa Hartung, Nina Holden, and Yuval Peres. Trace reconstruction with varying deletion probabilities. In Proceedings of the Fifteenth Workshop on Analytic Algorithmics and Combinatorics, ANALCO 2018, pages 54–61. SIAM, 2018.
  • [HL20] Nina Holden and Russell Lyons. Lower bounds for trace reconstruction. The Annals of Applied Probability, 30(2):503–525, 2020.
  • [HMPW08] Thomas Holenstein, Michael Mitzenmacher, Rina Panigrahy, and Udi Wieder. Trace reconstruction with constant deletion probability and related results. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2008, pages 389–398. SIAM, 2008.
  • [HPP18] Nina Holden, Robin Pemantle, and Yuval Peres. Subpolynomial trace reconstruction for random strings and arbitrary deletion probability. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, volume 75 of Proceedings of Machine Learning Research, pages 1799–1840. PMLR, 2018.
  • [KM05] Sampath Kannan and Andrew McGregor. More on reconstructing strings from random traces: insertions and deletions. In IEEE International Symposium on Information Theory, ISIT 2005, pages 297–301. IEEE, 2005.
  • [KMMP21] Akshay Krishnamurthy, Arya Mazumdar, Andrew McGregor, and Soumyabrata Pal. Trace reconstruction: Generalized and parameterized. IEEE Transactions on Information Theory, 67(6):3233–3250, 2021.
  • [KR97] Ilia Krasikov and Yehuda Roditty. On a reconstruction problem for sequences,. J. Comb. Theory, Ser. A, 77(2):344–348, 1997.
  • [Lan13] Serge Lang. Complex analysis, volume 103. Springer Science & Business Media, 2013.
  • [Lev01a] Vladimir I. Levenshtein. Efficient reconstruction of sequences. IEEE Transactions on Information Theory, 47(1):2–22, 2001.
  • [Lev01b] Vladimir I. Levenshtein. Efficient reconstruction of sequences from their subsequences or supersequences. J. Comb. Theory, Ser. A, 93(2):310–332, 2001.
  • [Mas80] John C Mason. Near-best multivariate approximation by fourier series, chebyshev series and chebyshev interpolation. Journal of Approximation Theory, 28(4):349–358, 1980.
  • [MPV14] Andrew McGregor, Eric Price, and Sofya Vorotnikova. Trace reconstruction revisited. In 22th Annual European Symposium on Algorithms, ESA 2014, volume 8737 of Lecture Notes in Computer Science, pages 689–700. Springer, 2014.
  • [MS22] Kayvon Mazooji and Ilan Shomorony. Substring density estimation from traces. CoRR, abs/2210.10917, 2022.
  • [NP17] Fedor Nazarov and Yuval Peres. Trace reconstruction with exp(O(n1/3))𝑂superscript𝑛13\exp(O(n^{1/3}))roman_exp ( italic_O ( italic_n start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT ) ) samples. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pages 1042–1046. ACM, 2017.
  • [NR21] Shyam Narayanan and Michael Ren. Circular trace reconstruction. In 12th Innovations in Theoretical Computer Science Conference (ITCS 2021). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2021.
  • [PZ17] Yuval Peres and Alex Zhai. Average-case reconstruction for the deletion channel: Subpolynomially many traces suffice. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, pages 228–239. IEEE Computer Society, 2017.
  • [Rub23] Ittai Rubinstein. Average-case to (shifted) worst-case reduction for the trace reconstruction problem. In Kousha Etessami, Uriel Feige, and Gabriele Puppis, editors, 50th International Colloquium on Automata, Languages, and Programming, ICALP 2023, July 10-14, 2023, Paderborn, Germany, volume 261 of LIPIcs, pages 102:1–102:20. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023.
  • [SB21] ** Sima and Jehoshua Bruck. Trace reconstruction with bounded edit distance. In IEEE International Symposium on Information Theory, ISIT 2021, pages 2519–2524. IEEE, 2021.
  • [Sco97] Alex D Scott. Reconstructing sequences. Discrete Mathematics, 175(1-3):231–238, 1997.
  • [Tre12] Lloyd N. Trefethen. Approximation Theory and Approximation Practice. SIAM, 2012.
  • [Tre17] Lloyd Trefethen. Multivariate polynomial approximation in the hypercube. Proceedings of the American Mathematical Society, 145(11):4837–4844, 2017.
  • [VS08] Krishnamurthy Viswanathan and Ram Swaminathan. Improved string reconstruction over insertion-deletion channels. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2008, pages 399–408. SIAM, 2008.
  • [YGM17] S. M. Hossein Tabatabaei Yazdi, Ryan Gabrys, and Olgica Milenkovic. Portable and error-free dna-based data storage. Scientific Reports, 7:2045–2322, 2017.