\NewEnviron

problem[1]

#1\BODY

On $k$ -Mer-Based and Maximum Likelihood Estimation Algorithms for Trace Reconstruction

Kuan Cheng Peking University, Haidian, Bei**g, China. Email: [email protected]. Elena Grigorescu Purdue University, West Lafayette, IN, USA. Supported in part by NSF CCF-1910411, and NSF CCF-2228814. Email: [email protected]. Xin Li Johns Hopkins University, Baltimore, MD, USA. Supported in part by NSF CAREER Award CCF-1845349 and NSF Award CCF-2127575. Email: [email protected]. Madhu Sudan School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts, USA. Supported in part by a Simons Investigator Award and NSF Award CCF 2152413. Email: [email protected]. Minshen Zhu Most of the work was done as a PhD student at Purdue University. Supported in part by NSF CCF-1910411, and NSF CCF-2228814. Email: [email protected].

Abstract

The goal of the trace reconstruction problem is to recover a string $\mathbf{x}\in\{0,1\}^{n}$ given many independent traces of $\mathbf{x}$ , where a trace is a subsequence obtained from deleting bits of $\mathbf{x}$ independently with some given probability $p\in[0,1).$ A recent result of Chase (STOC 2021) shows how $\mathbf{x}$ can be determined (in exponential time) from $\exp({O}(n^{1/5})\log^{5}n)$ traces. This is the state-of-the-art result on the sample complexity of trace reconstruction.

In this paper we consider two kinds of algorithms for the trace reconstruction problem.

We first observe that the bound of Chase, which is based on statistics of arbitrary length- $k$ subsequences, can also be obtained by considering the “ $k$ -mer statistics”, i.e., statistics regarding occurrences of contiguous $k$ -bit strings (a.k.a, $k$ -mers) in the initial string $\mathbf{x}$ , for $k=2n^{1/5}$ . Mazooji and Shomorony (arXiv.2210.10917) show that such statistics (called $k$ -mer density map) can be estimated within $\varepsilon$ accuracy from $\mathrm{poly}(n,2^{k},1/\varepsilon)$ traces. We call an algorithm to be $k$ -mer-based if it reconstructs $\mathbf{x}$ given estimates of the $k$ -mer density map. Such algorithms essentially capture all the analyses in the worst-case and smoothed-complexity models of the trace reconstruction problem we know of so far.

Our first, and technically more involved, result shows that any $k$ -mer-based algorithm for trace reconstruction must use $\exp(\Omega(n^{1/5}\sqrt{\log n}))$ traces, under the assumption that the estimator requires $\mathrm{poly}(2^{k},1/\varepsilon)$ traces, thus establishing the optimality of this number of traces. The analysis of this result also shows that the analysis technique used by Chase (STOC 2021) is essentially tight, and hence new techniques are needed in order to improve the worst-case upper bound.

This result is shown by considering an appropriate class of real polynomials, that have been previously studied in the context of trace estimation (De, O’Donnell, Servedio. Annals of Probability 2019; Nazarov, Peres. STOC 2017), and proving that two of these polynomials are very close to each other on an arc in the complex plane. Our proof of the proximity of such polynomials uses new technical ingredients that allow us to focus on just a few coefficients of these polynomials.

Our second, simple, result considers the performance of the Maximum Likelihood Estimator (MLE), which specifically picks the source string that has the maximum likelihood to generate the samples (traces). We show that the MLE algorithm uses a nearly optimal number of traces, i.e., up to a factor of $n$ in the number of samples needed for an optimal algorithm, and show that this factor of $n$ loss may be necessary under general “model estimation” settings.

1 Introduction

The trace reconstruction problem is an infamous question introduced by Batu, Kannan, Khanna and McGregor [BKKM04] in the context of computational biology. It asks to design algorithms that recover a string $\mathbf{x}\in\{0,1\}^{n}$ given access to traces $\tilde{\mathbf{x}}$ of $\mathbf{x}$ , obtained by deleting each bit independently with some given probability $p\in[0,1).$ The best current upper and lower bounds are exponentially apart, namely $\exp(\widetilde{O}(n^{1/5}))$ traces are sufficient for reconstruction [Cha21b] (improving upon the $\exp(O(n^{1/3}))$ of [NP17, DOS19]) and ${\widetilde{\Omega}}(n^{3/2})$ [HL20, Cha21a] are necessary.

The problem has been recently studied in several variants so far [BKKM04, KM05, VS08, HMPW08, MPV14, PZ17, NP17, DOS19, GM17, HPP18, HL20, HHP18, GM19, CGMR20, KMMP21, BLS20, CDL ${}^{+}$ 21b, Cha21b, CP21, NR21, SB21, GSZ22, Rub23] and it continues to elicit interest due to its deceptively simple formulation, as well as its motivating applications to DNA computing [YGM17].

In this paper, we focus on the worst-case formulation of the problem, which is equivalent from an information-theoretic point of view to the distinguishing variant. In this variant, the goal is to distinguish whether the received traces come from string $\mathbf{x}\in\{0,1\}^{n}$ or from $\mathbf{y}\in\{0,1\}^{n}$ , for some known $\mathbf{x}\neq\mathbf{y}.$

Algorithms based on $k$ -bit statistics

A very natural kind of algorithms [HMPW08, NP17, DOS19] operates using the mean of the received traces at each location $i\in[n]$ (one may assume that traces of smaller length than $n$ are padded with $0$ ’s at the end). Indeed, let $\mathcal{D}_{\mathbf{x}}$ be the distribution of the traces induced by the deletion channel on input $\mathbf{x}$ . A mean/ $1$ -bit-statistics -based algorithm first estimates from the received traces the mean vector $\mathbf{E}(\mathbf{x})=\left(E_{0}(\mathbf{x}),\cdots,E_{n-1}(\mathbf{x})% \right)\in[0,1]^{n}$ , where the $j$ -th coordinate is defined as

\displaystyle E_{j}(\mathbf{x})=\underset{\tilde{\mathbf{x}}\sim\mathcal{D}_{% \mathbf{x}}}{\mathbb{E}}\left[\widetilde{x}_{j}\right].

It then may perform further post-processing without further inspection of the traces.

Solving the distinguishing problem then reduces by standard arguments to understanding the $\ell_{1}$ -norm between the mean traces of $\mathbf{x}$ and $\mathbf{y}$ , namely the number $T$ of traces satisfies

\Omega\left(1/\mathinner{\!\left\lVert\mathbf{E}(\mathbf{x})-\mathbf{E}(% \mathbf{y})\right\rVert}_{\ell_{1}}\right)=T=O\left(1/\mathinner{\!\left\lVert% \mathbf{E}(\mathbf{x})-\mathbf{E}(\mathbf{y})\right\rVert}_{\ell_{1}}^{2}% \right).

[NP17, DOS19] related the $\ell_{1}$ -norm above with the supremum of a certain real univariate polynomial over the complex plane. Using techniques from complex analysis they proved that mean-based algorithms using $\exp(O(n^{1/3}))$ traces and outputting the string $\mathbf{s}\in\{\mathbf{x},\mathbf{y}\}$ whose $\mathbf{E}(\mathbf{s})$ is closer in $\ell_{1}$ -distance to the estimate is a successful reconstruction algorithm. Furthermore, any mean-based algorithm needs $\exp(\Omega(n^{1/3}))$ traces to succeed with high probability [NP17, DOS19].

A general class of algorithms may operate by using $k$ -bit statistics [Cha21b], for $k\geq 1$ . Specifically, for $w\in\{0,1\}^{k}$ , the algorithm estimates from the given traces, for tuples $0\leq i_{0}<i_{1}<\dots<i_{k-1}\leq n-1$ , the quantity

\displaystyle\underset{\tilde{\mathbf{x}}\sim\mathcal{D}_{\mathbf{x}}}{\mathbb% {E}}\left[\prod_{{0\leq j<k}}\mathbf{1}\Set{\widetilde{x}_{i_{j}}=w_{j}}\right].

After the estimation step, whose accuracy can be argued via standard Chernoff bounds, the algorithm does not need the traces anymore and may perform further post-processing in order to output the correct string. The result of Chase follows from showing that for $k=2n^{1/5}$ there is a string $w\in\{0,1\}^{k}$ for which the $\ell_{1}$ -distance between the corresponding $k$ -bit statistics between $\mathbf{x}$ and $\mathbf{y}$ is large.

Algorithms based on $k$ -mer statistics

Another variant proposed by Mazooji and Shomorony [MS22] considers algorithms which operate using estimates of statistics regarding occurrences of contiguous $k$ -bit strings (a.k.a, $k$ -mers) in the initial string $\mathbf{x}$ . We denote by $\mathbf{1}\Set{\mathbf{x}[j\colon j+k-1]=w}$ the indicator bit of whether $w\in\{0,1\}^{k}$ occurs as a subword in $\mathbf{x}$ from position $j$ .

In particular, [MS22] made the following definition which is central to our paper.

Definition 1 ([MS22]).

Given a string $\mathbf{x}\in\set{0,1}^{n}$ and a $k$ -mer $w\in\set{0,1}^{k}$ , for $i=0,1,\dots,n-1$ denote

\displaystyle K_{w,\mathbf{x}}[i]\coloneqq\sum_{j=0}^{n-k}\binom{j}{i}p^{j-i}(% 1-p)^{i}\cdot\mathbf{1}\Set{\mathbf{x}[j\colon j+k-1]=w}.

The vector $K_{\mathbf{x}}\coloneqq\left(K_{w,\mathbf{x}}[i]\colon w\in\set{0,1}^{k},i\in[% n]\right)$ is called the $k$ -mer density map of $\mathbf{x}$ .

Note that the mean vector $\mathbf{E}(\mathbf{x})$ is, up to a factor of $1-p$ , equivalent to the $1$ -mer density map. Indeed, for $k=1$ and $w=1$ we have

	$\displaystyle E_{i}(\mathbf{x})$	$\displaystyle=\underset{\tilde{\mathbf{x}}\sim\mathcal{D}_{\mathbf{x}}}{% \mathbb{E}}\left[\widetilde{x}_{i}\right]=\sum_{j=0}^{n-1}\Pr\left[\widetilde{% x}_{i}\textup{ comes from }x_{j}\right]\cdot x_{j}$
		$\displaystyle=\sum_{j=0}^{n-1}\binom{j}{i}p^{j-i}(1-p)^{i+1}\cdot x_{j}=(1-p)% \cdot\sum_{j=0}^{n-1}\binom{j}{i}p^{j-i}(1-p)^{i}\cdot\mathbf{1}\Set{\mathbf{x% }[j\mathrel{\mathop{:}}j]=1}=(1-p)K_{1,\mathbf{x}}[i].$

As noted in [MS22], the techniques of [CDL ${}^{+}$ 21b] in the smoothed complexity model of trace reconstruction can also be viewed as based on $k$ -mer density maps. Indeed, for a fixed $w\in\{0,1\}^{k}$ , the number of its occurrences as a subword in $\mathbf{x}$ is $\sum_{j=0}^{n-1}\mathbf{1}\Set{\mathbf{x}[j\colon j+k-1]=w}=\sum_{i=0}^{n-1}K_% {w,\mathbf{x}}[i]$ . They show that for $k=O(\log n)$ , the subword vector (indexed by $w\in\{0,1\}^{k}$ ) uniquely determines the source string, with high probability [CDL ${}^{+}$ 21b, Lemma 1.1].

The main result of [MS22] is that given access to $T=\varepsilon^{-2}\cdot 2^{O(k)}\mathrm{poly}(n)$ traces of $\mathbf{x}$ , one can recover an estimation $\hat{K}_{\mathbf{x}}$ of the $k$ -mer density map $K_{\mathbf{x}}$ which is entry-wise $\varepsilon$ -accurate, i.e., $\mathinner{\!\left\lVert\hat{K}_{\mathbf{x}}-K_{\mathbf{x}}\right\rVert}_{\ell% _{\infty}}\leq\varepsilon$ . We remark that by replacing $\varepsilon$ with $\varepsilon/(2^{k}n)$ , one gets an estimate which is $\varepsilon$ -accurate in $\ell_{1}$ -norm, while using asymptotically the same number of traces.

We make the following definition generalizing mean-based algorithms ([DOS19, NP17]).

Definition 2.

(Algorithms based on $k$ -mer statistics) A trace reconstruction algorithm based on $k$ -mer statistics works in two steps as follows:

1.

Once the unknown source string $\mathbf{x}\in\{0,1\}^{n}$ is picked, it chooses an accuracy parameter $\varepsilon\in(0,1]$ . It then receives an $\epsilon$ -accurate estimate (in $\ell_{1}$ -norm) of the $k$ -mer density map $K_{\mathbf{x}}$ based on the traces. From here on the algorithm has no more access to the traces themselves. We define the cost of this part to be $2^{k}/\varepsilon$ .
2.

The algorithm may perform further post-processing and finish by outputting the source string.

Since there is an algorithm to $\varepsilon$ -estimate the $k$ -mer density map with $\varepsilon^{-2}\cdot 2^{O(k)}\mathrm{poly}(n)$ many traces [MS22], it follows that an algorithm defined as in Definition 2 with cost $T$ can be turned into a trace reconstruction algorithm with $\mathrm{poly}(T)$ samples.

We note that the $k$ -mer density map estimators of [MS22] only use $k$ -bit statistics of the traces, in fact statistics about contiguous $k$ bits in the traces, and hence $k$ -mer-based algorithms are a subclass of algorithms based on $k$ -bit statistics.

In this work, we first observe that the upper bounds of Chase [Cha21b] can be in fact obtained via $k$ -mer-based algorithms (see the formal statement in Theorem 1), and hence by only using statistics of contiguous subwords of the traces. Our main result says that $k$ -mer-based algorithms require $\exp(\Omega(n^{1/5})\sqrt{n})$ many traces (see Theorem 2). In addition, the analysis of this result implies that the proof technique in Chase [Cha21b] cannot lead to a better analysis of the sample complexity (up to $\log^{4.5}n$ factors in the exponent), and hence new techniques are needed to significantly improve the current upper bound.

The Maximum Likelihood Estimator

In model estimation settings, a common tool for picking a “model” that best explains the observed data is the Maximum Likelihood Estimator (MLE). In the setting of trace reconstruction, it is natural to ask: What is the most likely trace distribution $\mathcal{D}_{\mathbf{x}}$ (and hence $\mathbf{x}$ ) to have produced the given sample/trace(s)? We formalize MLE next.

Definition 3 (Maximum Likelihood Estimation).

Let $\mathcal{D}=\set{D_{1},D_{2},\dots,D_{m}}$ be a finite set of probability distributions over a common domain $\Omega$ . Given a sample $x\in\Omega$ , the output of the Maximum Likelihood Estimation is (ties are broken arbitrarily)

\displaystyle\mathrm{MLE}(x;\mathcal{D})\coloneqq\arg\max_{i\in[m]}D_{i}(x).

For independently and identically distributed samples $x_{1},x_{2},\ldots,x_{k}\in\Omega$ the output of the Maximum Likelihood Estimation is (ties are broken arbitrarily) is

\displaystyle\mathrm{MLE}(x_{1},x_{2},\ldots x_{k};\mathcal{D})\coloneqq\arg% \max_{i\in[m]}\prod_{j\in[k]}D_{i}(x_{j}).

We present a simple proof that this algorithm (which takes exponential time, as it searches through all $\mathbf{x}\in\{0,1\}^{n}$ ) is in fact optimal in the number of traces used, up to an $O(n)$ factor blowup.

We also observe that in the average-case setting, where the source string is a uniformly random string from $\{0,1\}^{n}$ , $\mathrm{MLE}$ is indeed optimal – without the $O(n)$ factor blowup (see Remark 2.)

1.1 Our Contributions

The power of $k$ -mer-based algorithms

Our first result shows that algorithms based on $k$ -mer statistics can reconstruct a source string using $\exp(\widetilde{O}(n^{1/5}))$ many traces. This follows from the following theorem.

Theorem 1 (Implied by [Cha21b]).

Let $\mathbf{x},\mathbf{y}\in\set{0,1}^{n}$ be two arbitrary distinct strings, and let $K_{\mathbf{x}},K_{\mathbf{y}}$ be their $k$ -mer density maps, respectively. Assuming $k=2n^{1/5}$ , it holds that

\displaystyle\mathinner{\!\left\lVert K_{\mathbf{x}}-K_{\mathbf{y}}\right% \rVert}_{\ell_{1}}\geq\exp\left(-O(n^{1/5}\log^{5}n)\right).

Based on Theorem 1, the algorithm estimates $\hat{K}$ within an accuracy of $\varepsilon=\exp(-O(n^{1/5}\log^{5}n))$ and outputs the $\mathbf{x}$ that minimizes $\mathinner{\!\left\lVert{\hat{K}}-K_{\mathbf{x}}\right\rVert}_{\ell_{1}}.$ The cost of this $k$ -mer-based algorithm is $\exp(O(n^{1/5}\log^{5}n))$ .

Our main result regarding $k$ -mer-based algorithms is the following theorem which shows the tightness of the bound in Theorem 1.

Theorem 2.

Fix any $k\leq n^{1/5}$ . Suppose $K_{\mathbf{x}}$ stands for the $k$ -mer density map of $\mathbf{x}$ . There exist distinct strings $\mathbf{x},\mathbf{y}\in\set{0,1}^{n}$ such that

\displaystyle\mathinner{\!\left\lVert K_{\mathbf{x}}-K_{\mathbf{y}}\right% \rVert}_{\ell_{1}}\leq\exp\left(-\Omega(n^{1/5}\sqrt{\log n})\right).

Hence, Theorem 2 implies that the cost of any $k$ -mer-based algorithm for worst-case trace reconstruction is $\exp(\Omega(n^{1/5}\sqrt{\log n}))$ .

Remark 1.

As one might expect, for $k^{\prime}<k$ the $k^{\prime}$ -mers usually contain less information than $k$ -mers. To see this, observe that for a $(k-1)$ -mer $w$ , we have the following relation

\displaystyle\mathbf{1}\Set{\mathbf{x}[j\mathrel{\mathop{:}}j+k-2]=w}=\mathbf{% 1}\Set{\mathbf{x}[j-1\mathrel{\mathop{:}}j+k-2]=0w}+\mathbf{1}\Set{\mathbf{x}[% j-1\mathrel{\mathop{:}}j+k-2]=1w},

provided that $j>0$ . The same also holds for $\mathbf{y}$ . In fact, the strings $\mathbf{x}$ and $\mathbf{y}$ obtained via Theorem 2 share a common prefix of length at least $k$ (or one could prepend a prefix anyway), so $\mathbf{x}[0\mathrel{\mathop{:}}k^{\prime}-1]=\mathbf{y}[0\mathrel{\mathop{:}}% k^{\prime}-1]$ for any $k^{\prime}<k$ , and one does not need to worry about the case $j=0$ . Plugging into the definition of $k$ -mer density maps, we have

\displaystyle K_{w,\mathbf{x}}[i]-K_{w,\mathbf{y}}[i]=\left(K_{0w,\mathbf{x}}[% i]-K_{0w,\mathbf{y}}[i]\right)+\left(K_{1w,\mathbf{x}}[i]-K_{1w,\mathbf{y}}[i]% \right).

By induction, for any $k^{\prime}<k$ we have

\displaystyle\sum_{w\in\set{0,1}^{k^{\prime}}}\mathinner{\!\left\lvert K_{w,% \mathbf{x}}[i]-K_{w,\mathbf{y}}[i]\right\rvert}\leq\sum_{w\in\set{0,1}^{k^{% \prime}}}\sum_{u\in\set{0,1}^{k-k^{\prime}}}\mathinner{\!\left\lvert K_{uw,% \mathbf{x}}[i]-K_{uw,\mathbf{y}}[i]\right\rvert}=\mathinner{\!\left\lVert K_{% \mathbf{x}}-K_{\mathbf{y}}\right\rVert}_{\ell_{1}}.

Therefore, the bound in Theorem 2 indeed covers all $k^{\prime}$ -mers for $k^{\prime}\leq k$ .

We remark that the proof of Theorem 2 further implies that the analysis technique of [Cha21b] is essentially tight, in the sense that no better upper bound (up to $\log^{4.5}n$ factors in the exponent) can be obtained via his analysis. We include further details about this implication in Remark 3.

Maximum Likelihood Estimator: an optimal algorithm

We next turn to analyzing the performance of the MLE algorithm in the setting of trace reconstruction. Our main result essentially shows that if there is an algorithm for trace reconstruction that uses $T$ traces and succeeds with probability $3/4$ then the MLE algorithm using $O(nT)$ traces succeeds with probability $3/4.$ Hence, given that the current upper bounds for the worst-case reconstruction problem are exponential in $n$ , we may view the MLE as an optimal algorithm for trace reconstruction.

Theorem 3.

Suppose $\mathcal{D}=\set{D_{0},D_{1},\dots,D_{m}}$ is such that $d_{\mathrm{TV}}\left(D_{0},D_{i}\right)\geq 1-\varepsilon$ for any $1\leq i\leq m$ . Then we have

\displaystyle\Pr_{x\sim D_{0}}\left[\mathrm{MLE}(x;\mathcal{D})=0\right]\geq 1% -m\varepsilon.

We remark that the loss of a factor of $m$ in Theorem 3 is generally inevitable. Here is a simple example: let $D_{0}$ be the uniform distribution over $[m]$ , and for $i=1,2,\dots,m$ , let $D_{i}$ be the point distribution supported on $\set{i}$ . We have $d_{\mathrm{TV}}\left(D_{0},D_{i}\right)=((m-1)/m+(1-1/m))/2=1-1/m$ . However, $\Pr_{x\sim D_{0}}\left[\mathrm{MLE}(x;\mathcal{D})=0\right]=0$ .

For a string $\mathbf{x}\in\set{0,1}^{n}$ , let $D_{\mathbf{x}}$ denote the trace distribution of $\mathbf{x}$ . Theorem 3 implies the following corollary, which implies that in some sense the Maximum Likelihood Estimation is a universal algorithm for trace reconstruction.

Corollary 1.1.

Suppose $T$ traces are sufficient for worst-case trace reconstruction with a success rate $3/4$ . Then for any $\varepsilon>0$ , Maximum Likelihood Estimation with $8\ln(1/\varepsilon)\cdot nT$ traces solves worst-case trace reconstruction with success rate $1-\varepsilon$ .

Corollary 1.1 incurs a factor of $O(n)$ to the sample complexity. While we currently do not know whether this blowup is necessary for trace reconstruction, the next result shows that it is inevitable for the more general “model estimation” problem.

Theorem 4.

For any integer $n\geq 1$ , there is a set of distributions $\mathcal{D}=\set{D_{0},D_{1},D_{2},\dots,D_{m}}$ over a common domain $\Omega$ of size $\mathinner{\!\left\lvert\Omega\right\rvert}=m+n$ , where $m=\binom{n}{\left\lfloor{n/4}\right\rfloor}=2^{\Theta(n)}$ , satisfying the following conditions.

There is a distinguisher $A$ which given one sample $x\sim A_{j}$ for an unknown $j\in\set{0,1,\dots,m}$ , recovers $j$ with probability at least $2/3$ . In other words, for all $j=0,1,\dots,m$ ,

\displaystyle\Pr_{x\sim D_{j}}[A(x)=j]\geq 2/3.

$\mathrm{MLE}$ fails to distinguish $D_{0}$ from other distributions with probability 1, even with $T=n/4$ samples. In other words,

\displaystyle\Pr_{x_{1},\dots,x_{T}\sim D_{0}}[\mathrm{MLE}(x_{1},\dots,x_{T};% \mathcal{D})=0]=0.

Remark 2.

Finally, we remark that in the average-case setting $\mathrm{MLE}$ is indeed optimal (with no factor of $O(n)$ factor blowup in the number of traces). This is because maximizing the likelihood is equivalent to maximizing the posterior probability under the uniform prior distribution (which is optimal), as can be seen via the Bayes rule

	$\displaystyle\mathcal{D}_{\mathbf{x}}(\widetilde{x}_{1},\dots,\widetilde{x}_{T% })=$	$\displaystyle p(\mathbf{x}\mid\widetilde{x}_{1},\dots,\widetilde{x}_{T})\cdot% \frac{\sum_{\mathbf{x}^{\prime}\in\set{0,1}^{n}}p(\mathbf{x}^{\prime})\cdot% \mathcal{D}_{\mathbf{x}^{\prime}}(\widetilde{x}_{1},\dots,\widetilde{x}_{T})}{% p(\mathbf{x})}$
	$\displaystyle=$	$\displaystyle p(\mathbf{x}\mid\widetilde{x}_{1},\dots,\widetilde{x}_{T})\cdot% \sum_{\mathbf{x}^{\prime}\in\set{0,1}^{n}}\mathcal{D}_{\mathbf{x}^{\prime}}(% \widetilde{x}_{1},\dots,\widetilde{x}_{T})$
	$\displaystyle=$	$\displaystyle p(\mathbf{x}\mid\widetilde{x}_{1},\dots,\widetilde{x}_{T})\cdot f% (\widetilde{x}_{1},\dots,\widetilde{x}_{T}).$

Therefore maximizing both sides with respect to $\mathbf{x}$ yields the same result.

1.2 Overview of the techniques

Lower bounds for $k$ -mer-based algorithms

In recent development of the trace reconstruction problem, the connection to various real and complex polynomials has been a recurring and intriguing theme [HMPW08, NP17, PZ17, HPP18, DOS19, CDL ${}^{+}$ 21b, CDRV21, Cha21b, SB21, GSZ22, Rub23]. The starting point of these techniques is to design a set of statistics that can be easily estimated from the traces (e.g., mean traces), with the property that for different source strings the corresponding statistics are somewhat “far apart”. To establish this property, one key idea is to associate each source string $\mathbf{x}$ with a generating polynomial $P_{\mathbf{x}}$ where the coefficients are exactly the statistics of $\mathbf{x}$ . Due to the structure of the deletion channel, in many cases, this generating polynomial (under a change of variables) is identical to another polynomial $Q_{\mathbf{x}}$ that is much easier to get a handle on. For example, the coefficients of $Q_{\mathbf{x}}$ are usually 0/1, and they are easily determined from $\mathbf{x}$ . To show that the statistics corresponding to $\mathbf{x}$ and $\mathbf{y}$ are far apart (say, in $\ell_{1}$ -distance), it is sufficient to show that $\mathinner{\!\left\lvert Q_{\mathbf{x}}(w)-Q_{\mathbf{y}}(w)\right\rvert}$ is large for an appropriate choice of $w$ . This is the point where all sorts of analytical tools are ready to shine. For instance, the main technical result in [Cha21b] is a complex analytical result that says that a certain family of polynomials cannot be uniformly small over a sub-arc of the complex unit circle, which has applications beyond the trace reconstruction problem.

This analytical view of trace reconstruction can lead to a tight analysis of certain algorithms/statistics. The best example would be mean-based algorithms, for which a tight bound of $\exp(\Theta(n^{1/3}))$ traces is known to be sufficient and necessary for worst-case trace reconstruction [NP17, DOS19]. The tightness of the sample complexity is exactly due to the tightness of a complex analytical result by Borwein and Erdélyi [BE97]. Our lower bound for $k$ -mer-based algorithms is obtained in a similar fashion, via establishing a complex analytical result complementary to that of [Cha21b] (See Lemma 3.1).

On the other hand, our argument takes a different approach than that of [BE97]. At a high level, both results use a Pigeonhole argument to show the existence of two univariate polynomials which are uniformly close over a sub-arc $\Gamma$ of the complex unit circle. The difference lies in the objects playing the role of “pigeons”. [BE97]’s argument can be viewed as two steps: (1) apply the Pigeonhole Principle to obtain two polynomials that have close evaluations over a discrete set of points in $\Gamma$ , and (2) use a continuity argument to extend the closeness to the entire sub-arc. Here the roles of pigeons and holes are played by evaluation vectors, and Cartesian products of small intervals. Our approach considers the coordinates of a related polynomial in the Chebyshev basis, which play the roles of pigeons in place of the evaluation vector. The properties of Chebyshev polynomials allow us to get rid of the continuity argument. Instead, we complete the proof by leveraging rather standard tools from complex analysis (e.g., Theorem 5 and Theorem 6). We believe this approach has the advantage of being generalizable to multivariate polynomials over the product of sub-arcs $\Gamma=\Gamma_{1}\times\dots\times\Gamma_{m}$ via multivariate Chebyshev series (see, e.g., [Mas80, Tre17]), whereas the same generalization seems to be tricky for the continuity argument.

Finally, the counting argument considers a special set of strings for which effectively only one $k$ -mer contains meaningful information about the initial string. Since previous arguments did not exploit structural properties of the strings, this is another technical novelty of our proof.

Maximum Likelihood Estimation

Most of our results regarding Maximum Likelihood Estimation hold under the more general “model estimation” setting, where one is given a sample $x$ drawn from an unknown distribution $D\in\mathcal{D}$ and tries to recover $D$ . Our main observation is that if such a distinguisher works in worst-case, then the distributions in $\mathcal{D}$ have large pairwise statistical distances. The maximization characterization of statistical distance, in conjunction with a union bound, implies that for a sample $x\sim D$ its likelihood is maximized by $D$ except with a small probability. The $O(n)$ factor loss in the sample complexity is essentially due to the union bound, and we show that this loss is tight in general by constructing a set of distributions which attains equality in the union bound.

1.3 Related work

The trace reconstruction problem was first introduced and studied by Levenshtein [Lev01b][Lev01a]. The original question is that if a message is sent multiple times through the same channel with random insertion/deletion errors, then how to recover the message? [BKKM04] and [HMPW08] formalized the problem to the current version for which the channel only has random deletions. Their central motivation is actually from computational biology, i.e. how to reconstruct the whole DNA sequence from multiple related subsequences. [CGMR20] and [BLS20] further extended the study to the “coded” version. That is, the string to reconstruct is not an arbitrary string but instead is a codeword from a code. A variant setting where the channel has memoryless replication insertions was studied by [CDRV21].

The average case version was studied in [HMPW08, PZ17, MPV14, HPP18]. For this case, the best known lower bound on the number of traces is $\widetilde{\Omega}(\log^{5/2}n)$ [HL20, Cha21a]. Building on Chase’s upper bound for the worst case, [Rub23] improved the sample complexity upper bound to $\exp(\widetilde{O}(\log^{1/5}n))$ in the average-case model.

[CDL ${}^{+}$ 21b] studied another variant of the problem which is called the smooth variant. It is an intermediate model between the worst-case and the average-case models, where the initial string is an arbitrary string perturbed by replacing each coordinate by a uniformly random bit with some constant probability in $[0,1]$ . [CDL ${}^{+}$ 21b] provided an efficient reconstruction algorithm for this case. Other variants studied include trace reconstruction from the multiset of substrings [GM17, GM19], population recovery variants [BCF ${}^{+}$ 19], matrix reconstruction and parameterized algorithms [KMMP21], circular trace reconstruction [NR21], reconstruction from $k$ -decks [KR97, Sco97, DS03, MPV14], and coded trace reconstruction[CGMR20, BLS20].

[DRSR21] studied approximate trace reconstruction and showed efficient algorithms. [CDL ${}^{+}$ 21a], [CDK21], and [CP21] further proved that if the source is a random string, then an approximate solution can be found with high probability using very few traces. Notice that approximate reconstructions imply distinguishers for pairs of strings with large edit distances. [MPV14, SB21, GSZ22] study the complexity of the problem parameterized by the Hamming/edit distance between the strings. [GSZ22] also shows that the problem of exhibiting explicit strings that are hard to distinguish for mean-based algorithms is equivalent to the Prouhet-Tarry-Escott problem, a difficult problem in number theory.

1.4 Organization

In Section 2 we prove Theorem 1, in Section 3 we prove our main result Theorem 2, and in Section 4 we prove Theorem 3.

2 $k$ -mer-based algorithms: the upper bound

We prove Theorem 1 in this section.

Let us start with a definition that is essential for the study of $k$ -mer-based algorithms.

Definition 4 ( $k$ -mer generating polynomial).

Let $\mathbf{x}\in\set{0,1}^{n}$ and $w\in\set{0,1}^{k}$ . The $k$ -mer generating polynomial $P_{w,\mathbf{x}}$ for string $\mathbf{x}$ and $k$ -mer $w$ is the following degree- $(n-1)$ polynomial in $z$ :

\displaystyle P_{w,\mathbf{x}}(z)\coloneqq\sum_{\ell=0}^{n-1}K_{w,\mathbf{x}}[% \ell]\cdot z^{\ell}.

We have the following identity

	$\displaystyle P_{w,\mathbf{x}}(z)$	$\displaystyle=\sum_{\ell=0}^{n-1}K_{w,\mathbf{x}}[\ell]\cdot z^{\ell}$
		$\displaystyle=\sum_{\ell=0}^{n-1}\left(\sum_{j=0}^{n-k}\binom{j}{\ell}(1-p)^{% \ell}p^{j-\ell}\cdot\mathbf{1}\Set{\mathbf{x}[j\colon j+k-1]=w}\right)z^{\ell}$
		$\displaystyle=\sum_{j=0}^{n-k}\mathbf{1}\Set{\mathbf{x}[j\colon j+k-1]=w}\cdot% \left(p+(1-p)z\right)^{j}.$

The expression on the last line, under a change of variable $z_{0}=p+(1-p)z$ , is exactly the polynomial studied in [Cha21b].

Lemma 2.1.

[Cha21b, Proposition 6.3] For distinct $\mathbf{x},\mathbf{y}\in\set{0,1}^{n}$ , if $x_{i}=y_{i}$ for all $0\leq i<2n^{1/5}-1$ , then there are $w\in\set{0,1}^{2n^{1/5}}$ and $z_{0}\in\set{e^{i\theta}\colon\mathinner{\!\left\lvert\theta\right\rvert}\leq n% ^{-2/5}}$ such that

\displaystyle\mathinner{\!\left\lvert\sum_{j\geq 0}\left(\mathbf{1}\Set{% \mathbf{x}[j\colon j+2n^{1/5}-1]=w}-\mathbf{1}\Set{\mathbf{y}[j\colon j+2n^{1/% 5}-1]=w}\right)z_{0}^{j}\right\rvert}\geq\exp\left(-Cn^{1/5}\log^{5}n\right).

Here $C>0$ is a constant depending only on the deletion probability $p$ .

We will use Lemma 2.1 to show that the $\exp(\widetilde{O}(n^{1/5}))$ upper bound of [Cha21b] can be achieved by $k$ -mer-based algorithms, rather than general algorithms based on $k$ -bit statistics. Our main lower bound on the number of traces implied by Theorem 2 will follow by showing an upper bound on the LHS in the lemma above (see Lemma 3.1).

Remark 3.

We remark that the result of Chase is obtained by first considering a corresponding multivariate channel polynomial that encodes in its coefficients the $k$ -bit statistics of the traces. The upper bound on the number of traces reduces to understanding the supremum of this polynomial over a certain region of the complex plane. The crucial element of the proof is the reduction to the existence of $w\in\{0,1\}^{k}$ and $z_{0}$ satisfying Lemma 2.1, by appropriately making the remaining variables take value $0$ . We noticed that the resulting univariate polynomial is essentially the $k$ -mer generating polynomial defined in Definition 4, with an extra factor of $(1-p)^{k}$ . Our result in Lemma 3.1 implies that no tighter lower bound (up to polylogarithmic factors in the exponent) is possible for this univariate polynomial, showing that the analysis technique used in [Cha21b] cannot give a better upper bound on worst-case trace complexity.

2.1 An upper bound for $k$ -mer based algorithms

The proof of Theorem 1 mainly uses Lemma 2.1. We will also make use of the following result.

Lemma 2.2.

[BEK99, Theorem 5.1] There are absolute constants $c_{1}>0$ and $c_{2}>0$ such that

\displaystyle\mathinner{\!\left\lvert f(0)\right\rvert}^{c_{1}/a}\leq\exp\left% (\frac{c_{2}}{a}\right)\sup_{t\in[1-a,1]}\mathinner{\!\left\lvert f(t)\right\rvert}

for every analytic function $f$ on the open unit disk that satisfies $\mathinner{\!\left\lvert f(z)\right\rvert}<(1-\mathinner{\!\left\lvert z\right% \rvert})^{-1}$ for $\mathinner{\!\left\lvert z\right\rvert}<1$ , and $a\in(0,1]$ .

Proof of Theorem 1.

The proof deals with two cases.

Case 1: $x_{i}=y_{i}$ for all $0\leq i<2n^{1/5}-1$ .

In this case, $\mathbf{x}$ and $\mathbf{y}$ satisfy the premise of Lemma 2.1. It follows that there exist $w\in\set{0,1}^{2n^{1/5}}$ , and $z_{0}=e^{i\theta}$ where $\mathinner{\!\left\lvert\theta\right\rvert}\leq n^{-2/5}$ , satisfying the bound

\displaystyle\mathinner{\!\left\lvert\sum_{j\geq 0}\left(\mathbf{1}\Set{% \mathbf{x}[j\colon j+2n^{1/5}-1]=w}-\mathbf{1}\Set{\mathbf{y}[j\colon j+2n^{1/% 5}-1]=w}\right)z_{0}^{j}\right\rvert}\geq\exp\left(-Cn^{1/5}\log^{5}n\right).

Here $C>0$ is a constant depending only on the deletion probability $p$ . Rewriting in terms of the $k$ -mer generating polynomials, we have

\displaystyle\mathinner{\!\left\lvert P_{w,\mathbf{x}}\left(\frac{z_{0}-p}{1-p% }\right)-P_{w,\mathbf{y}}\left(\frac{z_{0}-p}{1-p}\right)\right\rvert}\geq\exp% \left(-Cn^{1/5}\log^{5}n\right).

(1)

It is easy to see that $\mathinner{\!\left\lvert z_{0}-p\right\rvert}/\mathinner{\!\left\lvert 1-p% \right\rvert}\geq\mathinner{\!\left\lvert\mathinner{\!\left\lvert z_{0}\right% \rvert}-p\right\rvert}/\mathinner{\!\left\lvert 1-p\right\rvert}=1$ . We also have the following upper bounds

	$\displaystyle\mathinner{\!\left\lvert\frac{z_{0}-p}{1-p}\right\rvert}^{2}$	$\displaystyle=\frac{(\cos\theta-p)^{2}+\sin^{2}\theta}{(1-p)^{2}}=\frac{1-2p% \cos\theta+p^{2}}{(1-p)^{2}}=1+\frac{2p(1-\cos\theta)}{(1-p)^{2}}$
		$\displaystyle=1+\frac{4p\sin^{2}\frac{\theta}{2}}{(1-p)^{2}}\leq 1+\frac{p% \theta^{2}}{(1-p)^{2}}\leq 1+\frac{p}{(1-p)^{2}}\cdot n^{-4/5},$
	$\displaystyle\mathinner{\!\left\lvert\frac{z_{0}-p}{1-p}\right\rvert}^{n}$	$\displaystyle\leq\left(1+\frac{p}{(1-p)^{2}}\cdot n^{-4/5}\right)^{n/2}\leq% \exp\left(\frac{p}{(1-p)^{2}}\cdot n^{-4/5}\cdot\frac{n}{2}\right)$
		$\displaystyle=\exp\left(\frac{p}{2(1-p)^{2}}\cdot n^{1/5}\right).$

From here we can apply the triangle inequality and conclude that

	$\displaystyle\mathinner{\!\left\lVert K_{\mathbf{x}}-K_{\mathbf{y}}\right% \rVert}_{\ell_{1}}$	$\displaystyle\geq\sum_{\ell=0}^{n-1}\mathinner{\!\left\lvert K_{w,\mathbf{x}}[% \ell]-K_{w,\mathbf{y}}[\ell]\right\rvert}$
		$\displaystyle\geq\mathinner{\!\left\lvert\frac{z_{0}-p}{1-p}\right\rvert}^{-n}% \cdot\mathinner{\!\left\lvert\sum_{\ell=0}^{n-1}\left(K_{w,\mathbf{x}}[\ell]-K% _{w,\mathbf{y}}[\ell]\right)\cdot\left(\frac{z_{0}-p}{1-p}\right)^{\ell}\right\rvert}$
		$\displaystyle\geq\exp\left(-\frac{p}{2(1-p)^{2}}\cdot n^{1/5}-Cn^{1/5}\log^{5}% n\right)$
		$\displaystyle\geq\exp\left(-C^{\prime}n^{1/5}\log^{5}n\right).$

Here $C^{\prime}=p(1-p)^{-2}/2+C$ is a constant depending only on the deletion probability $p$ .

Case 2: $x_{i}\neq y_{i}$ for some $0\leq i<2n^{1/5}-1$ , i.e., $\mathbf{x}[0\colon 2n^{1/5}-1]\neq\mathbf{y}[0\colon 2n^{1/5}-1]$ .

In this case, we are going to take $w=\mathbf{x}[0\colon 2n^{1/5}-1]$ and show a much better bound

\displaystyle\sup_{z\colon\mathinner{\!\left\lvert z\right\rvert}\leq 1}% \mathinner{\!\left\lvert P_{w,\mathbf{x}}(z)-P_{w,\mathbf{y}}(z)\right\rvert}>% C^{\prime\prime},

(2)

where $C^{\prime\prime}>0$ is a constant depending only on $p$ (hence certainly greater than $\exp(-\widetilde{O}(n^{1/5}))$ ). Similar to what we did in case 1, applying the triangle inequality to Eq. 2 gives the theorem.

To prove Eq. 2, we let

\displaystyle Q(z_{0})=\sum_{j\geq 0}\left(\mathbf{1}\Set{\mathbf{x}[j\colon j% +2n^{1/5}-1]=w}-\mathbf{1}\Set{\mathbf{y}[j\colon j+2n^{1/5}-1]=w}\right)z_{0}% ^{j},

so that $Q(p+(1-p)z)=P_{w,\mathbf{x}}(z)-P_{w,\mathbf{y}}(z)$ . Under our choice of $w$ , the constant term of $Q$ equals to 1, i.e., $Q(0)=1$ .

If $p\in(0,1/2]$ , the closed disk $B(p;1-p)=\set{p+(1-p)z\colon\mathinner{\!\left\lvert z\right\rvert}\leq 1}$ contains the point 0. Therefore

\displaystyle\sup_{z\colon\mathinner{\!\left\lvert z\right\rvert}\leq 1}% \mathinner{\!\left\lvert P_{w,\mathbf{x}}(z)-P_{w,\mathbf{y}}(z)\right\rvert}=% \sup_{z_{0}\in B(p;1-p)}\mathinner{\!\left\lvert Q(z_{0})\right\rvert}\geq% \mathinner{\!\left\lvert Q(0)\right\rvert}=1.

We are left with the case $p\in(1/2,1)$ . Since $Q$ is a polynomial with coefficients absolutely bounded by 1, we can apply Lemma 2.2 with $a=2(1-p)\in(0,1)$ and obtain

\displaystyle\sup_{t_{0}\in[1-a,1]}\mathinner{\!\left\lvert Q(t_{0})\right% \rvert}\geq\exp\left(-\frac{c_{1}}{a}\right)\cdot\mathinner{\!\left\lvert Q(0)% \right\rvert}^{c_{2}/a}=\exp\left(-\frac{c_{1}}{a}\right)

for constants $c_{1},c_{2}>0$ . Denoting $t=(t_{0}-p)/(1-p)$ , we have $t\in[-1,1]$ when $t_{0}\in[1-a,1]$ . In particular, $t$ is inside the closed unit disk $B(0;1)$ . Therefore

\displaystyle\sup_{z\colon\mathinner{\!\left\lvert z\right\rvert}\leq 1}% \mathinner{\!\left\lvert P_{w,\mathbf{x}}(z)-P_{w,\mathbf{y}}(z)\right\rvert}% \geq\sup_{t\in[-1,1]}\mathinner{\!\left\lvert P_{w,\mathbf{x}}(t)-P_{w,\mathbf% {y}}(t)\right\rvert}=\sup_{t_{0}\in[1-a,1]}\mathinner{\!\left\lvert Q(t_{0})% \right\rvert}\geq\exp\left(-\frac{c_{1}}{a}\right).

To conclude, we can take $C^{\prime\prime}=\min\set{1,\exp(-c_{1}(1-p)^{-1}/2)}$ . ∎

3 A lower bound for $k$ -mer based algorithms: Proof of Theorem 2

We prove Theorem 2 in this section. The proof is based on the following lemma, which we will prove shortly.

Lemma 3.1.

There exists $\mathbf{x},\mathbf{y}\set{0,1}^{n}$ such that for any $k$ -mer $w$ , it holds that

\displaystyle\sup_{z\colon\mathinner{\!\left\lvert z\right\rvert}=1}\mathinner% {\!\left\lvert P_{w,\mathbf{x}}(z)-P_{w,\mathbf{y}}(z)\right\rvert}\leq 2^{-cn% ^{1/5}\sqrt{\log n}}.

Proof of Theorem 2 using Lemma 3.1.

We can extract $K_{w,\mathbf{x}}[\ell]-K_{w,\mathbf{y}}[\ell]$ by the contour integral (cf. [Lan13, §4, Theorem 2.1])

\displaystyle K_{w,\mathbf{x}}[\ell]-K_{w,\mathbf{y}}[\ell]=\frac{1}{2\pi i}% \int_{\mathinner{\!\left\lvert z\right\rvert}=1}\left(P_{w,\mathbf{x}}(z)-P_{w% ,\mathbf{y}}(z)\right)\cdot z^{-\ell-1}\operatorname{d\!}z.

Therefore

\displaystyle\mathinner{\!\left\lvert K_{w,\mathbf{x}}[\ell]-K_{w,\mathbf{y}}[% \ell]\right\rvert}\leq\frac{1}{2\pi}\int_{\mathinner{\!\left\lvert z\right% \rvert}=1}\mathinner{\!\left\lvert P_{w,\mathbf{x}}(z)-P_{w,\mathbf{y}}(z)% \right\rvert}\cdot\mathinner{\!\left\lvert z\right\rvert}^{-\ell-1}\cdot% \mathinner{\!\left\lvert\operatorname{d\!}z\right\rvert}\leq 2^{-cn^{1/5}\sqrt% {\log n}}.

We stress that the bound holds for any $\ell\in[n]$ and $k$ -mer $w$ . Note that for any fixed $\ell$ , there are at most $n-k+1$ different $k$ -mers $w$ for which $K_{w,\mathbf{x}}[\ell]>0$ . Namely, if $w\notin\set{x[j\colon j+k-1]\colon 0\leq j\leq n-k}$ then $K_{w,\mathbf{x}}[\ell]=0$ . It follows that

\displaystyle\mathinner{\!\left\lVert K_{\mathbf{x}}-K_{\mathbf{y}}\right% \rVert}_{\ell_{1}}=\sum_{\ell=0}^{n-1}\sum_{w}\mathinner{\!\left\lvert K_{w,% \mathbf{x}}[\ell]-K_{w,\mathbf{y}}[\ell]\right\rvert}\leq n\cdot 2(n-k+1)\cdot 2% ^{-cn^{1/5}\log^{2/5}n}\leq 2^{-c^{\prime}n^{1/5}\sqrt{\log n}}.

∎

Next, we prove Lemma 3.1 assuming the following result, which is our main technical lemma.

Lemma 3.2.

Fix any $k\leq L^{1/3}$ . There exist distinct $\mathbf{x},\mathbf{y}\in\set{0,1}^{L}$ both starting with a run of 0s of length $L^{1/3}-1$ , such that for any $k$ -mer $w$ , it holds that

\displaystyle\sup_{\theta\colon\mathinner{\!\left\lvert\theta\right\rvert}\leq L% ^{-2/3}\log^{1/4}L}\mathinner{\!\left\lvert P_{w,\mathbf{x}}(e^{i\theta})-P_{w% ,\mathbf{y}}(e^{i\theta})\right\rvert}\leq 2^{-L^{1/3}\sqrt{\log L}/20}.

Proof of Lemma 3.1 using Lemma 3.2.

Let $\beta\geq 3/5$ be a parameter to be decided later. Denote $L\coloneqq n^{\beta}$ . We have $k\leq n^{1/5}=L^{1/(5\beta)}\leq L^{1/3}$ , so that the premise of Lemma 3.2 is satisfied. Therefore, there exist distinct $\mathbf{x}^{\prime},\mathbf{y}^{\prime}\in\set{0,1}^{L}$ both starting with a run of 0s of length $L^{1/3}-1$ , such that for any $k$ -mer $w$ , it holds that

\displaystyle\sup_{\theta\colon\mathinner{\!\left\lvert\theta\right\rvert}\leq L% ^{-2/3}\log^{1/4}L}\mathinner{\!\left\lvert P_{w,\mathbf{x}^{\prime}}(e^{i% \theta})-P_{w,\mathbf{y}^{\prime}}(e^{i\theta})\right\rvert}\leq 2^{-L^{1/3}% \sqrt{\log L}/20}.

(3)

Let $\mathbf{x}=0^{n-L}\mathbf{x}^{\prime}$ and $\mathbf{y}=0^{n-L}\mathbf{y}^{\prime}$ . Since $k\leq L^{1/3}$ , by construction we have $\mathbf{x}[j\mathrel{\mathop{:}}j+k-1]=\mathbf{y}[j\mathrel{\mathop{:}}j+k-1]$ for all $0\leq j\leq n-L$ . Therefore, any $k$ -mer $w$ we have

		$\displaystyle P_{w,\mathbf{x}}(e^{i\theta})-P_{w,\mathbf{y}}(e^{i\theta})$
	$\displaystyle=$	$\displaystyle\sum_{j=0}^{n-k}\left(\mathbf{1}\Set{\mathbf{x}[j\mathrel{\mathop% {:}}j-k+1]=w}-\mathbf{1}\Set{\mathbf{y}[j\mathrel{\mathop{:}}j-k+1]=w}\right)(% p+qe^{i\theta})^{j}$
	$\displaystyle=$	$\displaystyle\left(p+qe^{i\theta}\right)^{n-L}\cdot\sum_{j=n-L}^{n-k}\left(% \mathbf{1}\Set{\mathbf{x}[j\mathrel{\mathop{:}}j-k+1]=w}-\mathbf{1}\Set{% \mathbf{y}[j\mathrel{\mathop{:}}j-k+1]=w}\right)(p+qe^{i\theta})^{j-(n-L)}$
	$\displaystyle=$	$\displaystyle\left(p+qe^{i\theta}\right)^{n-L}\cdot\sum_{j=0}^{L-k}\left(% \mathbf{1}\Set{\mathbf{x}^{\prime}[j\mathrel{\mathop{:}}j-k+1]=w}-\mathbf{1}% \Set{\mathbf{y}^{\prime}[j\mathrel{\mathop{:}}j-k+1]=w}\right)(p+qe^{i\theta})% ^{j}$
	$\displaystyle=$	$\displaystyle\left(p+qe^{i\theta}\right)^{n-L}\cdot\left(P_{w,\mathbf{x}^{% \prime}}(e^{i\theta})-P_{w,\mathbf{y}^{\prime}}(e^{i\theta})\right).$

Here $q=1-p$ . When $\mathinner{\!\left\lvert\theta\right\rvert}$ is large, we can upper bound the supremum as

	$\displaystyle\sup_{\theta\colon\mathinner{\!\left\lvert\theta\right\rvert}>L^{% -2/3}\log^{1/4}L}\mathinner{\!\left\lvert P_{w,\mathbf{x}}(e^{i\theta})-P_{w,% \mathbf{y}}(e^{i\theta})\right\rvert}$	$\displaystyle=\sup_{\theta\colon\mathinner{\!\left\lvert\theta\right\rvert}>L^% {-2/3}\log^{1/4}L}\mathinner{\!\left\lvert p+qe^{i\theta}\right\rvert}^{n-L}% \cdot\mathinner{\!\left\lvert P_{w,\mathbf{x}^{\prime}}(e^{i\theta})-P_{w,% \mathbf{y}^{\prime}}(e^{i\theta})\right\rvert}$
		$\displaystyle\leq\left(1-c_{1}L^{-4/3}\log^{1/2}L\right)^{n-L}\cdot\sup_{% \theta\colon\mathinner{\!\left\lvert\theta\right\rvert}>L^{-2/3}\log^{1/4}L}% \mathinner{\!\left\lvert P_{w,\mathbf{x}^{\prime}}(e^{i\theta})-P_{w,\mathbf{y% }^{\prime}}(e^{i\theta})\right\rvert}$
		$\displaystyle\leq\exp\left(-c_{1}(n-L)L^{-4/3}\log^{1/2}L\right)\cdot(L-k+1)$
		$\displaystyle\leq\exp_{2}\left(-c_{2}n^{1-4\beta/3}\log^{1/2}n\right).$

Here the first inequality is due to $\mathinner{\!\left\lvert p+qe^{i\theta}\right\rvert}\leq 1-c_{1}a^{2}$ for some constant $c_{1}$ (depending on $p$ ) when $\mathinner{\!\left\lvert\theta\right\rvert}\geq a$ . When $\mathinner{\!\left\lvert\theta\right\rvert}$ is small, this is taken care of by Eq. 3:

	$\displaystyle\sup_{\theta\colon\mathinner{\!\left\lvert\theta\right\rvert}\leq L% ^{-2/3}\log^{1/4}L}\mathinner{\!\left\lvert P_{w,\mathbf{x}}(e^{i\theta})-P_{w% ,\mathbf{y}}(e^{i\theta})\right\rvert}$	$\displaystyle\leq\sup_{\theta\colon\mathinner{\!\left\lvert\theta\right\rvert}% \leq L^{-2/3}\log^{1/4}L}\mathinner{\!\left\lvert P_{w,\mathbf{x}^{\prime}}(e^% {i\theta})-P_{w,\mathbf{y}^{\prime}}(e^{i\theta})\right\rvert}$
		$\displaystyle\leq\exp_{2}\left(-L^{1/3}\sqrt{\log L}/20\right)$
		$\displaystyle\leq\exp_{2}\left(-c_{3}n^{\beta/3}\log^{1/2}n\right).$

Finally, the value of $\beta$ is determined by balancing the two cases. Namely, we let $1-4\beta/3=\beta/3$ , or $\beta=3/5$ , which gives the bound $2^{-cn^{1/5}\sqrt{\log n}}$ for both cases. Here $c=\min\set{c_{2},c_{3}}$ . ∎

It remains to prove Lemma 3.2, which we do after some helpful preliminaries from complex analysis.

3.1 Some helpful results in complex analysis

In this section, we introduce some results in complex analysis, which will be useful for proving Lemma 3.2.

Let $T_{d}(x)$ denote the $d$ ${}^{\mbox{\tiny{{th}}}}$ Chebyshev polynomial, i.e., a degree- $d$ polynomial such that $T_{d}(\cos\theta)=\cos(d\theta)$ . Clearly, $T_{d}(x)\in[-1,1]$ for $x\in[-1,1]$ . If a function $f(z)$ is analytic on $[-1,1]$ , it has a converging Chebyshev expansion

\displaystyle f(z)=\sum_{d=0}^{\infty}a_{d}\cdot T_{d}(z),\quad z\in[-1,1].

Here the $a_{d}$ ’s are the Chebyshev coefficients, and they can be extracted by the following integral

\displaystyle a_{d}=\frac{1}{\pi}\int_{0}^{2\pi}f(\cos\theta)\cos(d\theta)% \operatorname{d\!}\theta,\quad d\geq 1,

where $\pi$ is replaced by $2\pi$ for $d=0$ . This immediately implies a uniform upper bound on Chebyshev coefficients.

Proposition 1.

For all $d\geq 0$ , $\mathinner{\!\left\lvert a_{d}\right\rvert}\leq 2\sup_{x\in[-1,1]}\mathinner{% \!\left\lvert f(x)\right\rvert}$ .

In fact, if $f$ is analytically continuable to a larger region, much better bounds can be obtained. For that we need the notion of Bernstein ellipse.

Definition 5 (Bernstein Ellipse).

Given $\rho\geq 1$ , the boundary of the Bernstein Ellipse is defined as

\displaystyle\partial E_{\rho}\coloneqq\Set{\frac{u+u^{-1}}{2}\colon u=\rho e^% {i\theta},\theta\in[0,2\pi)}.

The Bernstein Ellipse $E_{\rho}$ has the foci at $\pm 1$ with the major and minor semi-axes given by $(\rho+\rho^{-1})/2$ and $(\rho-\rho^{-1})/2$ , respectively. When $\rho=1$ , $E_{\rho}$ coincides with the interval $[-1,1]$ on the real line. For our purpose, we will also be working with affine transformations of $E_{\rho}$ . More precisely, for $a\in[0,1/8]$ we denote by $\widetilde{E}_{a,\rho}$ (the interior of) the following ellipse

\displaystyle\Set{(1-4a)+4a\cdot\frac{u+u^{-1}}{2}\colon u=\rho e^{i\theta},% \theta\in[0,2\pi)}.

Thus, $\widetilde{E}_{a,\rho}$ can be equivalently defined as

\displaystyle\Set{z\colon\mathinner{\!\left\lvert z-(1-8a)\right\rvert}+% \mathinner{\!\left\lvert z-1\right\rvert}\leq 8a+4a(\rho-1)^{2}/\rho}.

Below are some useful properties of $\widetilde{E}_{a,\rho}$ .

Proposition 2.

The following statements hold.

1.

Let $z\in\partial\widetilde{E}_{a,\rho}$ . Then $\mathinner{\!\left\lvert z\right\rvert}\leq 1+2a(\rho-1)^{2}/\rho$ .
2.

$\widetilde{E}_{a,\rho}$ contains a disk centered at 1 with radius $2a(\rho-1)^{2}/\rho$ .

Proof.

Item (1): Writing $z=(1-4a)+4a(u+u^{-1})/2$ where $u=\rho e^{i\theta}$ , we have

	$\displaystyle\mathinner{\!\left\lvert z\right\rvert}^{2}$	$\displaystyle=\left(1-4a+2a\rho\cos\theta+2a\rho^{-1}\cos\theta\right)^{2}+% \left(2a\rho\sin\theta-2a\rho^{-1}\sin\theta\right)^{2}$
		$\displaystyle=(1-4a)^{2}+2(1-4a)\cdot 2a(\rho+\rho^{-1})\cos\theta+\left(2a(% \rho+\rho^{-1})\right)^{2}\cos^{2}\theta+\left(2a(\rho-\rho^{-1})\right)^{2}% \sin^{2}\theta$
		$\displaystyle\leq(1-4a)^{2}+2(1-4a)\cdot 2a(\rho+\rho^{-1})+\left(2a(\rho+\rho% ^{-1})\right)^{2}-4a^{2}\sin^{2}\theta\left((\rho+\rho^{-1})^{2}-(\rho-\rho^{-% 1})^{2}\right)$
		$\displaystyle=\left((1-4a)+2a(\rho+\rho^{-1})\right)^{2}-16a^{2}\sin^{2}\theta$
		$\displaystyle\leq\left(1+2a(\rho+\rho^{-1}-2)\right)^{2}.$

Therefore $\mathinner{\!\left\lvert z\right\rvert}\leq 1+2a(\rho+\rho^{-1}-2)=1+2a(\rho-1% )^{2}/\rho$ . Item (2): Let $z$ be such that $\mathinner{\!\left\lvert z-1\right\rvert}\leq 2a(\rho-1)^{2}/\rho$ . We have

\displaystyle\mathinner{\!\left\lvert z-(1-8a)\right\rvert}+\mathinner{\!\left% \lvert z-1\right\rvert}\leq 8a+2\mathinner{\!\left\lvert z-1\right\rvert}\leq 8% a+4a(\rho-1)^{2}/\rho.

This implies $z\in\widetilde{E}_{a,\rho}$ . ∎

The following result shows an exponential convergence rate of the Chebyshev expansion.

Theorem 5 (Theorem 8.1, [Tre12]).

Let a function $f$ analytic on $[-1,1]$ be analytically continuable to the open Bernstein Ellipse $E_{\rho}$ , where it satisfies $\mathinner{\!\left\lvert f(z)\right\rvert}\leq M$ for some $M$ . Then its Chebyshev coefficients satisfy

\displaystyle\mathinner{\!\left\lvert a_{0}\right\rvert}\leq M,\textup{ and }% \mathinner{\!\left\lvert a_{k}\right\rvert}\leq 2M\rho^{-k},k\geq 1.

Proof.

The Chebyshev coefficients of $f$ is given by

\displaystyle a_{k}=\frac{1}{\pi}\int_{0}^{2\pi}f\left(\cos\theta\right)T_{k}% \left(\cos\theta\right)\operatorname{d\!}\theta=\frac{1}{\pi}\int_{0}^{2\pi}f% \left(\cos\theta\right)\cos\left(k\theta\right)\operatorname{d\!}\theta,

with $\pi$ replaced by $2\pi$ for $k=0$ . Letting $z=e^{i\theta}$ , one could write $\cos\theta=(z+z^{-1})/2$ , $\operatorname{d\!}\theta=(iz)^{-1}\operatorname{d\!}z$ , and hence

\displaystyle a_{k}=\frac{1}{\pi i}\int_{\mathinner{\!\left\lvert z\right% \rvert}=1}f\left(\frac{z+z^{-1}}{2}\right)\frac{z^{k}+z^{-k}}{2}\cdot z^{-1}% \operatorname{d\!}z.

Denote $F(z)\coloneqq f((z+z^{-1})/2)=F(z^{-1})$ . Note that we can substitute $z^{-1}$ for $z$ and obtain

\displaystyle\frac{1}{\pi i}\int_{\mathinner{\!\left\lvert z\right\rvert}=1}F(% z)z^{k-1}\operatorname{d\!}z=-\frac{1}{\pi i}\int_{\mathinner{\!\left\lvert z% \right\rvert}=1}F(z^{-1})z^{-(k-1)}\operatorname{d\!}z^{-1}=\frac{1}{\pi i}% \int_{\mathinner{\!\left\lvert z\right\rvert}=1}F(z)z^{-k-1}\operatorname{d\!}z.

Therefore we arrived at the expression

\displaystyle a_{k}=\frac{1}{\pi i}\int_{|z|=1}F(z)z^{-k-1}\operatorname{d\!}z.

Since $f(z)$ is analytic in the open Bernstein Ellipse $E_{\rho}$ , we can conclude that $F(z)$ is analytic in the annulus $\rho^{-1}<\mathinner{\!\left\lvert z\right\rvert}<\rho$ . That means, for any $\rho_{0}\in(\rho^{-1},\rho)$ we have by Cauchy’s integral theorem (cf. [Lan13, §3, Theorem 5.1]) that

\displaystyle a_{k}=\frac{1}{\pi i}\int_{\mathinner{\!\left\lvert z\right% \rvert}=\rho_{0}}F(z)z^{-k-1}\operatorname{d\!}z.

Now we have

\displaystyle\mathinner{\!\left\lvert a_{k}\right\rvert}\leq\frac{1}{\pi}\cdot% \int_{\mathinner{\!\left\lvert z\right\rvert}=\rho_{0}}\mathinner{\!\left% \lvert F(z)\right\rvert}\cdot\mathinner{\!\left\lvert z\right\rvert}^{-k-1}% \mathinner{\!\left\lvert\operatorname{d\!}z\right\rvert}\leq\frac{1}{\pi}\cdot 2% \pi\rho_{0}M\cdot\rho_{0}^{-k-1}=2M\rho_{0}^{-k}.

Finally, since the bound holds for any $\rho_{0}<\rho$ , it also holds for $\rho_{0}=\rho$ . ∎

We will also make use of the following theorem.

Theorem 6 (Hadamard Three Circles Theorem).

Suppose $f$ is analytic inside and on $\set{z\in\mathbb{C}\colon r_{1}\leq\mathinner{\!\left\lvert z\right\rvert}\leq r% _{2}}$ . For $r\in[r_{1},r_{2}]$ , let $M(r)\coloneqq\sup_{\mathinner{\!\left\lvert z\right\rvert}=r}\mathinner{\!% \left\lvert f(z)\right\rvert}$ . Then

\displaystyle M(r)^{\log(r_{2}/r_{1})}\leq M(r_{1})^{\log(r_{2}/r)}M(r_{2})^{% \log(r/r_{1})}.

Corollary 3.1.

Suppose $f(z)=\sum_{j=0}^{n-1}c_{j}z^{j}$ where $\mathinner{\!\left\lvert c_{j}\right\rvert}\leq 1$ . Then

\displaystyle\sup_{z\in\partial\widetilde{E}_{a,2}}\mathinner{\!\left\lvert f(% z)\right\rvert}\leq\exp\left(5an/2\right)\cdot\left(\sup_{z\in[1-8a,1]}% \mathinner{\!\left\lvert f(z)\right\rvert}\right)^{1/2}.

Proof.

Let $\rho_{1}=1,\rho=2,\rho_{2}=2^{2}$ . Let $g(z)\coloneqq f(u)$ where $u=(1-4a)+4a(z+z^{-1})/2$ . Since $f$ is analytic on and inside $\widetilde{E}_{a,\rho_{2}}$ , $g$ is analytic inside the centered disk with radius $\rho_{2}$ . Applying the Hadamard Three Circles Theorem to $g$ gives

\displaystyle\sup_{z\in\partial\widetilde{E}_{a,\rho}}\mathinner{\!\left\lvert f% (z)\right\rvert}\leq\left(\sup_{z\in\partial\widetilde{E}_{a,\rho_{1}}}% \mathinner{\!\left\lvert f(z)\right\rvert}\right)^{1/2}\left(\sup_{z\in% \partial\widetilde{E}_{a,\rho_{2}}}\mathinner{\!\left\lvert f(z)\right\rvert}% \right)^{1/2}.

We note that $\widetilde{E}_{a,\rho_{1}}$ coincides with the interval $[1-8a,1]$ on the real line. For $z\in\partial\widetilde{E}_{a,\rho_{2}}$ , Proposition 2 implies $\mathinner{\!\left\lvert f(z)\right\rvert}\leq n\cdot(1+2a(4-1)^{2}/4)^{n}\leq% \exp(5an)$ . Therefore

\displaystyle\sup_{z\in\partial\widetilde{E}_{a,2}}\mathinner{\!\left\lvert f(% z)\right\rvert}\leq\exp\left(5an/2\right)\cdot\left(\sup_{z\in[1-8a,1]}% \mathinner{\!\left\lvert f(z)\right\rvert}\right)^{1/2}.

∎

3.2 Proof of Lemma 3.2: A Counting Argument

We prove Lemma 3.2 in this section.

We first prove a technical lemma lower bounding the number of binary strings in which all 1s are far away from each other.

Lemma 3.3.

Let $S_{n,r}\subseteq\set{0,1}^{n}$ be the collection of all $n$ -bit strings with the property that any two 1’s are separated by at least $r$ many 0’s. Then $\mathinner{\!\left\lvert S_{n,r}\right\rvert}\geq(\sqrt{r+1})^{n/r-1}$ .

Proof.

For ease of notation we fix $r$ and denote $f(n)\coloneqq\mathinner{\!\left\lvert S_{n,r}\right\rvert}$ . We observe that $f$ satisfies the following recurrence relation

\displaystyle f(n)=\begin{cases}n+1,&\text{ for }0\leq n\leq r\\ f(n-1)+f(n-r-1),&\text{ for }n\geq r+1\end{cases}.

We prove by induction that $f(n)\geq(\sqrt{r+1})^{n/r-1}$ . The base case is trivial since $(\sqrt{r+1})^{n/r-1}\leq 1$ when $n\leq r$ .

Now suppose $f(k)\geq(\sqrt{r+1})^{k/r-1}$ for $k\leq n-1$ . This gives, for $k=n$ , the following bound

	$\displaystyle f(n)$	$\displaystyle=f(n-1)+f(n-r-1)$
		$\displaystyle\geq(\sqrt{r+1})^{(n-1)/r-1}+(\sqrt{r+1})^{(n-r-1)/r-1}$
		$\displaystyle=(\sqrt{r+1})^{(n-1)/r-1}\left(1+\frac{1}{\sqrt{r+1}}\right).$

Since by the AM-GM inequality we have

\displaystyle r+\frac{r}{\sqrt{r+1}}=r-\frac{1}{\sqrt{r+1}}+\sqrt{r+1}\geq% \underbrace{1+1+\dots+1}_{r-1\textrm{ 1s}}+\sqrt{r+1}\geq r(\sqrt{r+1})^{1/r},

or equivalently $1+1/\sqrt{r+1}\geq(\sqrt{r+1})^{1/r}$ , we obtain

\displaystyle f(n)\geq(\sqrt{r+1})^{(n-1)/r-1}\cdot(\sqrt{r+1})^{1/r}=(\sqrt{r% +1})^{n/r-1}.

This completes the inductive step, and hence $\mathinner{\!\left\lvert S_{n,r}\right\rvert}\geq(\sqrt{r+1})^{n/r-1}$ for all $n,r\in\mathbb{N}$ . ∎

In the following, we fix $k\coloneqq L^{1/3}$ , and let $S\coloneqq 0^{k}\circ S_{L-k,k-1}$ . The proof will focus on binary strings in the set $S$ . We have $\mathinner{\!\left\lvert S\right\rvert}\geq(\sqrt{(k-1)+1})^{(L-k)/(k-1)-1}% \geq 2^{(L^{2/3}\log_{2}L)/6}$ .

Below we characterize some properties of $k$ -mer generating polynomials of strings in $S$ .

Lemma 3.4.

Let $S$ be a set of strings defined as above. For $j=1,2,\dots,k$ , denote by $\mathbf{e}_{j}$ the string with a single “1” located at index $j-1$ (indices begin with 0). The following properties hold.

1.

For any $k$ -mer $w\notin\set{0^{k},\mathbf{e}_{1},\dots,\mathbf{e}_{k}}$ , $P_{w,\mathbf{x}}(z)$ is the zero polynomial.
2.

For any $\mathbf{x}\in S$ and $1\leq j<k$ , $P_{\mathbf{e}_{j},\mathbf{x}}(z)=(p+(1-p)z)\cdot P_{\mathbf{e}_{j+1},\mathbf{x% }}(z)$ .
3.

For any $\mathbf{x},\mathbf{y}\in S$ and $\mathinner{\!\left\lvert z\right\rvert}\leq 1$ , $\mathinner{\!\left\lvert P_{0^{k},\mathbf{x}}(z)-P_{0^{k},\mathbf{y}}(z)\right% \rvert}\leq k\cdot\mathinner{\!\left\lvert P_{\mathbf{e}_{k},\mathbf{x}}(z)-P_% {\mathbf{e}_{k},\mathbf{y}}(z)\right\rvert}$ .

Proof.

Item 1: By definition of $S$ , $\mathbf{x}[j\mathrel{\mathop{:}}j+k-1]$ contains at most one “1” for any string $\mathbf{x}\in S$ . Therefore, if $w$ contains at least two “1”s, then for any $0\leq\ell<L-k$ ,

\displaystyle K_{w,\mathbf{x}}[\ell]=\sum_{j=0}^{L-k}\binom{j}{\ell}p^{\ell}(1% -p)^{j-\ell}\mathbf{1}\Set{\mathbf{x}[j\mathrel{\mathop{:}}j+k-1]=w}=0.

This means all the coefficients of $P_{w,\mathbf{x}}(z)$ is zero, and hence $P_{w,\mathbf{x}}(z)$ is the zero polynomial.

Item 2: Since any two consecutive “1”s in $\mathbf{x}\in S$ are separated by at least $k-1$ “0”s, $\mathbf{x}[i\mathrel{\mathop{:}}i+k-1]=\mathbf{e}_{j}$ if and only if $\mathbf{x}[i-1\mathrel{\mathop{:}}i+k-2]=\mathbf{e}_{j+1}$ . We thus have

	$\displaystyle P_{\mathbf{e}_{j},\mathbf{x}}(z)$	$\displaystyle=\sum_{i=0}^{L-k}\mathbf{1}\Set{\mathbf{x}[i\mathrel{\mathop{:}}i% +k-1]=\mathbf{e}_{j}}\cdot(p+(1-p)z)^{i}$
		$\displaystyle=\sum_{i=1}^{L-k}\mathbf{1}\Set{\mathbf{x}[i-1\mathrel{\mathop{:}% }i+k-2]=\mathbf{e}_{j+1}}\cdot(p+(1-p)z)^{i}$
		$\displaystyle=(p+(1-p)z)\cdot\sum_{i=0}^{L-k-1}\mathbf{1}\Set{\mathbf{x}[i% \mathrel{\mathop{:}}i+k-1]=\mathbf{e}_{j+1}}\cdot(p+(1-p)z)^{i}$
		$\displaystyle=(p+(1-p)z)\cdot P_{\mathbf{e}_{j+1},\mathbf{x}}(z).$

We have used the fact that for $1\leq j<k$ , $\mathbf{1}\Set{\mathbf{x}[0\mathrel{\mathop{:}}k-1]=\mathbf{e}_{j}}=\mathbf{1}% \Set{\mathbf{x}[L-k\mathrel{\mathop{:}}L-1]=\mathbf{e}_{j+1}}=0$ .

Item 3: We observe that $\mathbf{x}[i\mathrel{\mathop{:}}i+k-1]\in\set{0^{k},\mathbf{e}_{1},\dots,% \mathbf{e}_{k}}$ . That implies

	$\displaystyle\sum_{w\in\set{0^{k},\mathbf{e}_{1},\dots,\mathbf{e}_{k}}}P_{w,% \mathbf{x}}(z)$	$\displaystyle=\sum_{w\in\set{0^{k},\mathbf{e}_{1},\dots,\mathbf{e}_{k}}}\sum_{% i=0}^{L-k}\mathbf{1}\Set{\mathbf{x}[i\mathrel{\mathop{:}}i+k-1]=w}\cdot(p+(1-p% )z)^{i}$
		$\displaystyle=\sum_{i=0}^{L-k}\left(\sum_{w\in\set{0^{k},\mathbf{e}_{1},\dots,% \mathbf{e}_{k}}}\mathbf{1}\Set{\mathbf{x}[i\mathrel{\mathop{:}}i+k-1]=w}\right% )(p+(1-p)z)^{i}$
		$\displaystyle=\sum_{i=0}^{L-k}(p+(1-p)z)^{i}.$

Note the the right-hand-side is independent of $\mathbf{x}$ . Therefore

	$\displaystyle\mathinner{\!\left\lvert P_{0^{k},\mathbf{x}}(z)-P_{0^{k},\mathbf% {y}}(z)\right\rvert}$	$\displaystyle=\mathinner{\!\left\lvert\sum_{w\in\set{\mathbf{e}_{1},\dots,% \mathbf{e}_{k}}}P_{w,\mathbf{x}}(z)-\sum_{w\in\set{\mathbf{e}_{1},\dots,% \mathbf{e}_{k}}}P_{w,\mathbf{y}}(z)\right\rvert}$
		$\displaystyle\leq\sum_{j=1}^{k}\mathinner{\!\left\lvert P_{\mathbf{e}_{j},% \mathbf{x}}(z)-P_{\mathbf{e}_{j},\mathbf{y}}(z)\right\rvert}$
		$\displaystyle=\sum_{j=1}^{k}\mathinner{\!\left\lvert\left(p+(1-p)z\right)^{k-j% }\cdot\left(P_{\mathbf{e}_{k},\mathbf{x}}(z)-P_{\mathbf{e}_{k},\mathbf{y}}(z)% \right)\right\rvert}$
		$\displaystyle\leq k\cdot\mathinner{\!\left\lvert P_{\mathbf{e}_{k},\mathbf{x}}% (z)-P_{\mathbf{e}_{k},\mathbf{y}}(z)\right\rvert}.$

The second last line is obtained by inductively applying Item 2. ∎

Below we give the proof of Lemma 3.2. We use the notations $\exp(x)\coloneqq e^{x}$ , and $\exp_{2}(x)\coloneqq 2^{x}$ .

Proof of Lemma 3.2.

Let $\mathbf{x}\in S$ be a string of length $L$ . In light of Lemma 3.4, we only need to consider a fixed $k$ -mer $w=\mathbf{e}_{k}$ , where $k\leq L^{1/3}$ . Define

\displaystyle g_{\mathbf{x}}(z)\coloneqq\sum_{j=0}^{L-k}\mathbf{1}\Set{\mathbf% {x}[j\colon j+k-1]=\mathbf{e}_{k}}\cdot z^{j}.

Recall that $g_{\mathbf{x}}(p+qz)=\sum_{j=0}^{L-1}K_{\mathbf{e}_{k},\mathbf{x}}[j]\cdot z^{% j}=P_{\mathbf{e}_{k},\mathbf{x}}(z)$ . Denote by $a_{0}(\mathbf{x}),\dots,a_{L-k}(\mathbf{x})$ the Chebyshev coefficients of

\displaystyle f_{\mathbf{x}}(z)\coloneqq g_{\mathbf{x}}(1-4a+4a\cdot z),

where $a\coloneqq L^{-2/3}\log^{1/4}L$ (equivalently, the coordinates of $f_{\mathbf{x}}$ in the Chebyshev basis). In other words, we can write

\displaystyle f_{\mathbf{x}}(z)=\sum_{j=0}^{L-k}a_{j}(\mathbf{x})\cdot T_{j}(z).

We first argue that only the first few coefficients are significant. This can be done by applying Theorem 5 to $f_{\mathbf{x}}(z)$ , say, with $\rho=2$ . To that end, we first upper bound $\mathinner{\!\left\lvert f_{\mathbf{x}}(z)\right\rvert}$ for $z\in E_{2}$ . Denoting $z^{\prime}=1-4a+4a\cdot z$ , we have that $z^{\prime}\in\widetilde{E}_{a,2}$ when $z\in E_{2}$ . By item (1) of Proposition 2, we have $\mathinner{\!\left\lvert z^{\prime}\right\rvert}\leq 1+a$ . It follows that

\displaystyle\sup_{z\in E_{2}}\mathinner{\!\left\lvert f_{\mathbf{x}}(z)\right% \rvert}=\sup_{z^{\prime}\in\widetilde{E}_{a,2}}\mathinner{\!\left\lvert g_{% \mathbf{x}}(z^{\prime})\right\rvert}\leq L(1+a)^{L}\leq L\exp(aL)=L\exp(L^{1/3% }\log^{1/4}L).

Therefore, we can apply Theorem 5 to $f_{\mathbf{x}}(z)$ with $\rho=2$ , $M=L\exp(L^{1/3}\log^{1/4}L)$ and get (for large enough $L$ )

\displaystyle\forall j\geq L^{1/3}\sqrt{\log L},\quad\mathinner{\!\left\lvert a% _{j}(\mathbf{x})\right\rvert}\leq L\exp(L^{1/3}\log^{1/4}L)\cdot 2^{-L^{1/3}% \sqrt{\log L}}\leq 2^{-L^{1/3}\sqrt{\log L}/8}.

To each string $\mathbf{x}\in\set{0,1}^{L}$ we associate a vector

\displaystyle\phi(\mathbf{x})\coloneqq\left(a_{j}(\mathbf{x})\colon j=0,1,% \dots,L^{1/3}\sqrt{\log L}-1\right).

Proposition 1 implies each entry of $\phi(\mathbf{x})$ belongs to the interval $[-2(L-k+1),2(L-k+1)]\subseteq[-2L,2L]$ . We now partition $[-2L,2L]$ into $m$ smaller intervals $I_{1},\dots,I_{m}$ , each of length $2^{-L^{1/3}\sqrt{\log L}/8}$ , meaning that $m=4L\cdot 2^{L^{1/3}\sqrt{\log L}/8}$ . The vector $\phi(\mathbf{x})$ must fall into one of the sub-cubes of the form

\displaystyle\mathcal{I}(r)\coloneqq\prod_{0\leq j<L^{1/3}\sqrt{\log L}}I_{r(j% )},

where $r\colon[L^{1/3}\sqrt{\log L}]\rightarrow[m]$ is a map** that uniquely identifies the sub-cube. It follows that the total number of such sub-cubes is

	$\displaystyle m^{L^{1/3}\sqrt{\log L}}$	$\displaystyle=\left(4L\cdot 2^{L^{1/3}\sqrt{\log L}/8}\right)^{L^{1/3}\sqrt{% \log L}}$
		$\displaystyle=\exp_{2}\left(\left(\frac{L^{1/3}\sqrt{\log L}}{8}+\log_{2}{L}+2% \right)\cdot L^{1/3}\sqrt{\log L}\right)$
		$\displaystyle\leq\exp_{2}\left(\frac{L^{2/3}\log L}{8}+O(L^{1/3}\log^{3/2}{L})% \right)<2^{(L^{2/3}\log L)/6}\leq\mathinner{\!\left\lvert S\right\rvert}$

for large enough $L$ . By the Pigeonhole Principle, there must be two distinct strings $\mathbf{x},\mathbf{y}\in S$ such that $\phi(\mathbf{x}),\phi(\mathbf{y})$ fall into the same sub-cube. In other words, we have

\displaystyle\forall 0\leq j<L^{1/3}\sqrt{\log L},\quad\mathinner{\!\left% \lvert a_{j}(\mathbf{x})-a_{j}(\mathbf{y})\right\rvert}\leq 2^{-L^{1/3}\sqrt{% \log L}/8}.

It follows that

	$\displaystyle\sup_{z\in[1-8a,1]}\mathinner{\!\left\lvert g_{\mathbf{x}}(z)-g_{% \mathbf{y}}(z)\right\rvert}$	$\displaystyle=\sup_{z\in[-1,1]}\mathinner{\!\left\lvert f_{\mathbf{x}}(z)-f_{% \mathbf{y}}(z)\right\rvert}$
		$\displaystyle\leq\sup_{z\in[-1,1]}\sum_{j=0}^{L-k}\mathinner{\!\left\lvert a_{% j}(\mathbf{x})-a_{j}(\mathbf{y})\right\rvert}\cdot\mathinner{\!\left\lvert T_{% j}(z)\right\rvert}$
		$\displaystyle\leq\sum_{j=0}^{L^{1/3}\sqrt{\log L}-1}2^{-L^{1/3}\sqrt{\log L}/8% }+\sum_{j=L^{1/3}\sqrt{\log L}}^{L-k}2\cdot 2^{-L^{1/3}\sqrt{\log L}/8}$
		$\displaystyle\leq 2^{-L^{1/3}\sqrt{\log L}/7}.$

Applying Corollary 3.1 to $g_{\mathbf{x}}-g_{\mathbf{y}}$ with $a=L^{-2/3}\log^{1/4}L$ gives (for large enough $L$ )

	$\displaystyle\sup_{z\in\partial\widetilde{E}_{a,2}}\mathinner{\!\left\lvert g_% {\mathbf{x}}(z)-g_{\mathbf{y}}(z)\right\rvert}$	$\displaystyle\leq\exp\left(5aL/2\right)\cdot\sup_{z\in[1-8a,1]}\mathinner{\!% \left\lvert g_{\mathbf{x}}(z)-g_{\mathbf{y}}(z)\right\rvert}$
		$\displaystyle\leq\exp\left(5L^{1/3}\log^{1/4}L/2\right)\cdot 2^{-L^{1/3}\sqrt{% \log L}/14}\leq 2^{-L^{1/3}\sqrt{\log L}/15}.$

Let $\Gamma$ be the sub-arc of the circle $\set{p+qz\colon\mathinner{\!\left\lvert z\right\rvert}=1}$ which lies completely inside the ellipse $\widetilde{E}_{a,2}$ . Item (2) of Proposition 2 implies that the length of $\Gamma$ is at least $a=L^{-2/3}\log^{1/4}L$ . Therefore the Maximum Modulus Principle implies

\displaystyle\sup_{\theta\colon\mathinner{\!\left\lvert\theta\right\rvert}\leq a% }\mathinner{\!\left\lvert P_{\mathbf{e}_{k},\mathbf{x}}(e^{i\theta})-P_{% \mathbf{e}_{k},\mathbf{y}}(e^{i\theta})\right\rvert}\leq\sup_{z\in\Gamma}% \mathinner{\!\left\lvert g_{\mathbf{x}}(z)-g_{\mathbf{y}}(z)\right\rvert}\leq% \sup_{z\in\widetilde{E}_{a,2}}\mathinner{\!\left\lvert g_{\mathbf{x}}(z)-g_{% \mathbf{y}}(z)\right\rvert}\leq 2^{-L^{1/3}\sqrt{\log L}/15}.

Now we have established the lemma for a fixed $k$ -mer $w=\mathbf{e}_{k}$ . Since $\mathbf{x},\mathbf{y}\in S$ , Lemma 3.4 says that for any other $k$ -mer $w\in\set{0,1}^{k}$ either both $P_{w,\mathbf{x}}(z)$ and $P_{w,\mathbf{y}}(z)$ are zero polynomials or $w\in\set{0^{k},\mathbf{e}_{1},\dots,\mathbf{e}_{k}}$ and

\displaystyle\sup_{\theta\colon\mathinner{\!\left\lvert\theta\right\rvert}\leq a% }\mathinner{\!\left\lvert P_{w,\mathbf{x}}(e^{i\theta})-P_{w,\mathbf{y}}(e^{i% \theta})\right\rvert}\leq k\cdot\sup_{\theta\colon\mathinner{\!\left\lvert% \theta\right\rvert}\leq a}\mathinner{\!\left\lvert P_{\mathbf{e}_{k},\mathbf{x% }}(e^{i\theta})-P_{\mathbf{e}_{k},\mathbf{y}}(e^{i\theta})\right\rvert}\leq 2^% {-L^{1/3}\sqrt{\log L}/20}.

Finally, we note that both $\mathbf{x}$ and $\mathbf{y}$ start with a run of 0’s of length $k=L^{1/3}$ .

∎

Remark 4.

A much simpler proof for the slightly weaker bound $2^{-\Omega(L^{1/3})}$ is possible based on the complex analytical result of Borwein and Erdelyi [BE97, Theorem 3.3] (see also [DOS19, NP17]): there exist strings $\mathbf{x},\mathbf{y}\in\set{0,1}^{L^{2/3}}$ such that

\displaystyle 2^{-\Omega(L^{1/3})}\geq\sup_{z\in\Gamma_{L^{-1/3}}}\mathinner{% \!\left\lvert P_{\mathbf{x}}(z)-P_{\mathbf{y}}(z)\right\rvert}=\sup_{z\in% \Gamma_{L^{-2/3}}}\mathinner{\!\left\lvert P_{\mathbf{x}}(z^{L^{1/3}})-P_{% \mathbf{y}}(z^{L^{1/3}})\right\rvert},

where $P_{\mathbf{x}}(z)\coloneqq\sum_{j=0}^{\mathinner{\!\left\lvert\mathbf{x}\right% \rvert}-1}x_{j}z^{j}$ , $\Gamma_{a}$ stands for the sub-arc $\set{e^{i\theta}\colon\mathinner{\!\left\lvert\theta\right\rvert}<a}$ . Now we observe that $P_{\mathbf{x}}(z^{L^{1/3}})=P_{\mathbf{x}^{\prime}}(z)$ where $\mathbf{x}^{\prime}\in\set{0,1}^{L}$ is the string obtained by inserting $L^{1/3}-1$ many 0’s before every bit of $\mathbf{x}$ ( $\mathbf{y}^{\prime}$ is defined similarly). Clearly, $\mathbf{x}^{\prime},\mathbf{y}^{\prime}\in S$ since any two 1’s are separated by at least $L^{1/3}-1$ many 0’s. Therefore, they enjoy the properties in Lemma 3.4, and Lemma 3.1 follows with a weaker bound.¹¹1We thank an anonymous reviewer for pointing this observation out to us.

4 Optimality of the Maximum Likelihood Estimation

Proof of Theorem 3.

For $1\leq i\leq m$ define $S_{i}\coloneqq\set{x\in\Omega\colon D_{0}(x)>D_{i}(x)}$ , and let $S\coloneqq S_{1}\cap S_{2}\cap\dots\cap S_{m}$ . By definition of the total variation distance, we have

\displaystyle 1-\varepsilon\leq d_{\mathrm{TV}}\left(D_{0},D_{i}\right)=D_{0}(% S_{i})-D_{i}(S_{i})\leq D_{0}(S_{i}).

The Union Bound thus implies $D_{0}(S)\geq 1-m\varepsilon$ . Moreover, by Definition 3, when $x\in S$ it must hold that $\mathrm{MLE}(x;\mathcal{D})=0$ . Therefore

\displaystyle\Pr_{x\sim\mathcal{D}_{0}}\left[\mathrm{MLE}(x;\mathcal{D})=0% \right]\geq\Pr_{x\sim\mathcal{D}_{0}}\left[x\in S\right]=D_{0}(S)\geq 1-m\varepsilon.

∎

Proof of Corollary 1.1.

The Chernoff bound implies that if we repeat the purported reconstruction algorithm $8\ln(1/\varepsilon)n$ times and output the majority, it succeeds with probability at least $1-\varepsilon/2^{n+1}$ .

Let $A$ be such a (deterministic) reconstruction algorithm with $T^{\prime}=8\ln(1/\varepsilon)\cdot nT$ traces described as above, which successfully outputs the source string $x$ with probability at least $1-\varepsilon/2^{n+1}$ . Formally, for any source string $x\in\set{0,1}^{n}$ , it holds that

\displaystyle\Pr_{\widetilde{x}_{1},\dots,\widetilde{x}_{T^{\prime}}\sim D_{x}% }\left[A(\widetilde{x}_{1},\dots,\widetilde{x}_{T^{\prime}})=x\right]\geq 1-% \varepsilon/2^{n+1}.

Let $R_{x}\subseteq\left(\set{0,1}^{\leq n}\right)^{T^{\prime}}$ be exactly the collection of $T^{\prime}$ -tuples of strings on which $A$ outputs $x$ . We thus have

\displaystyle\forall x\in\set{0,1}^{n},\quad D_{x}^{\otimes T^{\prime}}(R_{x})% \geq 1-\varepsilon/2^{n+1},

where $D_{x}^{\otimes T^{\prime}}$ denotes the $T^{\prime}$ -fold product of $D_{x}$ with itself, capturing the distribution of $\left(\widetilde{x}_{1},\dots,\widetilde{x}_{T^{\prime}}\right)$ . On the other hand, for distinct strings $x$ and $y$ we have $R_{x}\cap R_{y}=\varnothing$ (by definition, $A$ cannot output both $x$ and $y$ on the same input), and hence the bound

\displaystyle D_{y}^{\otimes T^{\prime}}(R_{x})\leq 1-D_{y}^{\otimes T^{\prime% }}(R_{y})\leq\varepsilon.

This implies

\displaystyle d_{\mathrm{TV}}\left(D_{x}^{\otimes T^{\prime}},D_{y}^{\otimes T% ^{\prime}}\right)=\sup_{S}\mathinner{\!\left\lvert D_{x}^{\otimes T^{\prime}}(% S)-D_{y}^{\otimes T^{\prime}}(S)\right\rvert}\geq D_{x}^{\otimes T^{\prime}}(R% _{x})-D_{y}^{\otimes T^{\prime}}(R_{x})\geq 1-2\varepsilon/2^{n+1}=1-% \varepsilon/2^{n}.

We stress that the above bound holds for any pair of distinct strings $x,y\in\set{0,1}^{n}$ . Applying Theorem 3 to $\mathcal{D}\coloneqq\set{D_{x}^{\otimes T^{\prime}}\colon x\in\set{0,1}^{n}}$ gives

\displaystyle\forall x\in\set{0,1}^{n},\quad\Pr_{\widetilde{x}_{1},\dots,% \widetilde{x}_{T^{\prime}}\sim D_{x}}\left[\mathrm{MLE}(\widetilde{x}_{1},% \dots,\widetilde{x}_{T^{\prime}};\mathcal{D})=x\right]\geq 1-(2^{n}-1)\cdot% \varepsilon/2^{n}\geq 1-\varepsilon.

∎

Proof of Theorem 4.

The distributions are defined as follows. Let $t=\left\lfloor{n/4}\right\rfloor$ , and so $m=\binom{n}{t}$ . The domain $\Omega=\Omega_{1}\cup\Omega_{2}$ where $\Omega_{1}=\binom{[n]}{t}$ is the collection of all subsets of $[n]$ of size exactly $t$ , and $\Omega_{2}=[n]$ . We have

\displaystyle\mathinner{\!\left\lvert\Omega\right\rvert}=\binom{n}{t}+n=m+n.

We first define $D_{0}$ to be the uniform distribution over $\Omega_{2}$ , i.e., $D_{0}(\mathfrak{s})=1/n$ for any $\mathfrak{s}\in[n]$ .

For each one of the remaining $m$ distributions, we identify it with a $t$ -subset $S\in\binom{[n]}{t}$ . The precise definition of $D_{S}$ is as follows.

\displaystyle\forall S\in\binom{[n]}{t},\quad D_{S}(\mathfrak{s})=\begin{cases% }2/3&\textup{if $\mathfrak{s}\in\Omega_{1}$, and $\mathfrak{s}=S$}\\ 1/(3t)&\textup{if $\mathfrak{s}\in\Omega_{2}$, and $\mathfrak{s}\in S$}\\ 0&\textup{otherwise}\end{cases}.

In other words, $\mathfrak{s}\in\Omega_{1}$ occurs with probability $2/3$ , conditioned on which $D_{S}$ is the point distribution supported on $\set{S}$ ; $\mathfrak{s}\in\Omega_{2}$ occurs with probability $1/3$ , conditioned on which $D_{S}$ is the uniform distribution over $S$ . Now we verify that $\mathcal{D}=\set{D_{0},D_{1},\dots,D_{m}}$ satisfies the two conditions.

For Condition 1, consider a distinguisher $A$ which on sample $\mathfrak{s}\in\Omega$ , outputs $S$ if $\mathfrak{s}=S\in\Omega_{1}$ , and outputs 0 if $\mathfrak{s}\in\Omega_{2}$ . We have

\displaystyle\Pr_{\mathfrak{s}\sim D_{0}}\left[A(\mathfrak{s})=0\right]=1\geq 2% /3,\quad\Pr_{\mathfrak{s}\sim D_{S}}\left[A(\mathfrak{s})=S\right]=2/3.

To see Condition 2, let $\mathfrak{s}_{1},\dots,\mathfrak{s}_{T}\sim D_{0}$ be $T\leq\left\lfloor{n/4}\right\rfloor$ samples. Since $D_{0}$ is supported on $\Omega_{2}=[n]$ , the samples are all elements of $[n]$ , meaning that there is at least one $S\in\binom{[n]}{t}$ containing all samples. Calculating the likelihoods gives

\displaystyle\prod_{i=1}^{T}D_{S}(\mathfrak{s}_{i})=\left(\frac{1}{3t}\right)^% {T}\geq\left(\frac{4}{3n}\right)^{T}>\left(\frac{1}{n}\right)^{T}=\prod_{i=1}^% {T}D_{0}(\mathfrak{s}_{i}).

Therefore, the output of the Maximum Likelihood Estimation on $\mathfrak{s}_{1},\dots,\mathfrak{s}_{T}\sim D_{0}$ will never be $0$ .

∎

5 Acknowledgements

We are thankful to several anonymous reviewers for their valuable suggestions and comments.

References

[BCF ${}^{+}$ 19] Frank Ban, Xi Chen, Adam Freilich, Rocco A Servedio, and Sandip Sinha. Beyond trace reconstruction: Population recovery from the deletion channel. In 60th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2019, pages 745–768. IEEE, 2019.
[BE97] Peter Borwein and Tamás Erdélyi. Littlewood-type problems on subarcs of the unit circle. Indiana University mathematics journal, pages 1323–1346, 1997.
[BEK99] Peter Borwein, Tamás Erdélyi, and Géza Kós. Littlewood-type problems on $[0,1]$ . Proceedings of the London Mathematical Society, 79(1):22–46, 1999.
[BKKM04] Tugkan Batu, Sampath Kannan, Sanjeev Khanna, and Andrew McGregor. Reconstructing strings from random traces. In J. Ian Munro, editor, Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2004, New Orleans, Louisiana, USA, January 11-14, 2004, pages 910–918. SIAM, 2004.
[BLS20] Joshua Brakensiek, Ray Li, and Bruce Spang. Coded trace reconstruction in a constant number of traces. In 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, pages 482–493. IEEE, 2020.
[CDK21] Diptarka Chakraborty, Debarati Das, and Robert Krauthgamer. Approximate trace reconstruction via median string (in average-case). In 41st IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science, FSTTCS 2021, volume 213 of LIPIcs, pages 11:1–11:23, 2021.
[CDL ${}^{+}$ 21a] Xi Chen, Anindya De, Chin Ho Lee, Rocco A Servedio, and Sandip Sinha. Near-optimal average-case approximate trace reconstruction from few traces. arXiv preprint arXiv:2107.11530, 2021. (To appear in SODA 2022).
[CDL ${}^{+}$ 21b] Xi Chen, Anindya De, Chin Ho Lee, Rocco A. Servedio, and Sandip Sinha. Polynomial-time trace reconstruction in the smoothed complexity model. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, pages 54–73. SIAM, 2021.
[CDRV21] Mahdi Cheraghchi, Joseph Downs, João L. Ribeiro, and Alexandra Veliche. Mean-based trace reconstruction over practically any replication-insertion channel. In IEEE International Symposium on Information Theory, ISIT 2021, pages 2459–2464. IEEE, 2021.
[CGMR20] Mahdi Cheraghchi, Ryan Gabrys, Olgica Milenkovic, and João Ribeiro. Coded trace reconstruction. IEEE Transactions on Information Theory, 66(10):6084–6103, 2020.
[Cha21a] Zachary Chase. New lower bounds for trace reconstruction. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 57, pages 627–643. Institut Henri Poincaré, 2021.
[Cha21b] Zachary Chase. Separating words and trace reconstruction. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021, pages 21–31. ACM, 2021.
[CP21] Zachary Chase and Yuval Peres. Approximate trace reconstruction of random strings from a constant number of traces. arXiv preprint arXiv:2107.06454, 2021.
[DOS19] Anindya De, Ryan O’Donnell, and Rocco A Servedio. Optimal mean-based algorithms for trace reconstruction. The Annals of Applied Probability, 29(2):851–874, 2019.
[DRSR21] Sami Davies, Miklós Z Rácz, Benjamin G Schiffer, and Cyrus Rashtchian. Approximate trace reconstruction: Algorithms. In IEEE International Symposium on Information Theory, ISIT 2021, pages 2525–2530. IEEE, 2021.
[DS03] Miroslav Dudık and Leonard J Schulman. Reconstruction from subsequences. Journal of Combinatorial Theory, Series A, 103(2):337–348, 2003.
[GM17] Ryan Gabrys and Olgica Milenkovic. The hybrid k-deck problem: Reconstructing sequences from short and long traces. In IEEE International Symposium on Information Theory, ISIT 2017, pages 1306–1310. IEEE, 2017.
[GM19] Ryan Gabrys and Olgica Milenkovic. Unique reconstruction of coded strings from multiset substring spectra. IEEE Transactions on Information Theory, 65(12):7682–7696, 2019.
[GSZ22] Elena Grigorescu, Madhu Sudan, and Minshen Zhu. Limitations of mean-based algorithms for trace reconstruction at small edit distance. IEEE Trans. Inf. Theory, 68(10):6790–6801, 2022.
[HHP18] Lisa Hartung, Nina Holden, and Yuval Peres. Trace reconstruction with varying deletion probabilities. In Proceedings of the Fifteenth Workshop on Analytic Algorithmics and Combinatorics, ANALCO 2018, pages 54–61. SIAM, 2018.
[HL20] Nina Holden and Russell Lyons. Lower bounds for trace reconstruction. The Annals of Applied Probability, 30(2):503–525, 2020.
[HMPW08] Thomas Holenstein, Michael Mitzenmacher, Rina Panigrahy, and Udi Wieder. Trace reconstruction with constant deletion probability and related results. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2008, pages 389–398. SIAM, 2008.
[HPP18] Nina Holden, Robin Pemantle, and Yuval Peres. Subpolynomial trace reconstruction for random strings and arbitrary deletion probability. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, volume 75 of Proceedings of Machine Learning Research, pages 1799–1840. PMLR, 2018.
[KM05] Sampath Kannan and Andrew McGregor. More on reconstructing strings from random traces: insertions and deletions. In IEEE International Symposium on Information Theory, ISIT 2005, pages 297–301. IEEE, 2005.
[KMMP21] Akshay Krishnamurthy, Arya Mazumdar, Andrew McGregor, and Soumyabrata Pal. Trace reconstruction: Generalized and parameterized. IEEE Transactions on Information Theory, 67(6):3233–3250, 2021.
[KR97] Ilia Krasikov and Yehuda Roditty. On a reconstruction problem for sequences,. J. Comb. Theory, Ser. A, 77(2):344–348, 1997.
[Lan13] Serge Lang. Complex analysis, volume 103. Springer Science & Business Media, 2013.
[Lev01a] Vladimir I. Levenshtein. Efficient reconstruction of sequences. IEEE Transactions on Information Theory, 47(1):2–22, 2001.
[Lev01b] Vladimir I. Levenshtein. Efficient reconstruction of sequences from their subsequences or supersequences. J. Comb. Theory, Ser. A, 93(2):310–332, 2001.
[Mas80] John C Mason. Near-best multivariate approximation by fourier series, chebyshev series and chebyshev interpolation. Journal of Approximation Theory, 28(4):349–358, 1980.
[MPV14] Andrew McGregor, Eric Price, and Sofya Vorotnikova. Trace reconstruction revisited. In 22th Annual European Symposium on Algorithms, ESA 2014, volume 8737 of Lecture Notes in Computer Science, pages 689–700. Springer, 2014.
[MS22] Kayvon Mazooji and Ilan Shomorony. Substring density estimation from traces. CoRR, abs/2210.10917, 2022.
[NP17] Fedor Nazarov and Yuval Peres. Trace reconstruction with $\exp(O(n^{1/3}))$ samples. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pages 1042–1046. ACM, 2017.
[NR21] Shyam Narayanan and Michael Ren. Circular trace reconstruction. In 12th Innovations in Theoretical Computer Science Conference (ITCS 2021). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2021.
[PZ17] Yuval Peres and Alex Zhai. Average-case reconstruction for the deletion channel: Subpolynomially many traces suffice. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, pages 228–239. IEEE Computer Society, 2017.
[Rub23] Ittai Rubinstein. Average-case to (shifted) worst-case reduction for the trace reconstruction problem. In Kousha Etessami, Uriel Feige, and Gabriele Puppis, editors, 50th International Colloquium on Automata, Languages, and Programming, ICALP 2023, July 10-14, 2023, Paderborn, Germany, volume 261 of LIPIcs, pages 102:1–102:20. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023.
[SB21] ** Sima and Jehoshua Bruck. Trace reconstruction with bounded edit distance. In IEEE International Symposium on Information Theory, ISIT 2021, pages 2519–2524. IEEE, 2021.
[Sco97] Alex D Scott. Reconstructing sequences. Discrete Mathematics, 175(1-3):231–238, 1997.
[Tre12] Lloyd N. Trefethen. Approximation Theory and Approximation Practice. SIAM, 2012.
[Tre17] Lloyd Trefethen. Multivariate polynomial approximation in the hypercube. Proceedings of the American Mathematical Society, 145(11):4837–4844, 2017.
[VS08] Krishnamurthy Viswanathan and Ram Swaminathan. Improved string reconstruction over insertion-deletion channels. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2008, pages 399–408. SIAM, 2008.
[YGM17] S. M. Hossein Tabatabaei Yazdi, Ryan Gabrys, and Olgica Milenkovic. Portable and error-free dna-based data storage. Scientific Reports, 7:2045–2322, 2017.

On k𝑘kitalic_k-Mer-Based and Maximum Likelihood Estimation Algorithms for Trace Reconstruction

Abstract

1 Introduction

Algorithms based on k𝑘kitalic_k-bit statistics

Algorithms based on k𝑘kitalic_k-mer statistics

Definition 1 ([MS22]).

Definition 2.

The Maximum Likelihood Estimator

Definition 3 (Maximum Likelihood Estimation).

1.1 Our Contributions

The power of k𝑘kitalic_k-mer-based algorithms

Theorem 1 (Implied by [Cha21b]).

Theorem 2.

Remark 1.

Maximum Likelihood Estimator: an optimal algorithm

Theorem 3.

Corollary 1.1.

Theorem 4.

Remark 2.

1.2 Overview of the techniques

Lower bounds for k𝑘kitalic_k-mer-based algorithms

Maximum Likelihood Estimation

1.3 Related work

1.4 Organization

2 k𝑘kitalic_k-mer-based algorithms: the upper bound

Definition 4 (k𝑘kitalic_k-mer generating polynomial).

Lemma 2.1.

Remark 3.

2.1 An upper bound for k𝑘kitalic_k-mer based algorithms

Lemma 2.2.

Proof of Theorem 1.

3 A lower bound for k𝑘kitalic_k-mer based algorithms: Proof of Theorem 2

Lemma 3.1.

Proof of Theorem 2 using Lemma 3.1.

Lemma 3.2.

Proof of Lemma 3.1 using Lemma 3.2.

3.1 Some helpful results in complex analysis

Proposition 1.

Definition 5 (Bernstein Ellipse).

Proposition 2.

Proof.

Theorem 5 (Theorem 8.1, [Tre12]).

Proof.

Theorem 6 (Hadamard Three Circles Theorem).

Corollary 3.1.

Proof.

3.2 Proof of Lemma 3.2: A Counting Argument

Lemma 3.3.

Proof.

Lemma 3.4.

Proof.

Proof of Lemma 3.2.

Remark 4.

4 Optimality of the Maximum Likelihood Estimation

Proof of Theorem 3.

Proof of Corollary 1.1.

Proof of Theorem 4.

5 Acknowledgements

References

On $k$ -Mer-Based and Maximum Likelihood Estimation Algorithms for Trace Reconstruction

Algorithms based on $k$ -bit statistics

Algorithms based on $k$ -mer statistics

The power of $k$ -mer-based algorithms

Lower bounds for $k$ -mer-based algorithms

2 $k$ -mer-based algorithms: the upper bound

Definition 4 ( $k$ -mer generating polynomial).

2.1 An upper bound for $k$ -mer based algorithms

3 A lower bound for $k$ -mer based algorithms: Proof of Theorem 2