A faster algorithm for the construction of optimal factoring automata

Thomas Erlebach    Kleitos Papadopoulos
Abstract

The problem of constructing optimal factoring automata arises in the context of unification factoring for the efficient execution of logic programs. Given an ordered set of n𝑛nitalic_n strings of length m𝑚mitalic_m, the problem is to construct a trie-like tree structure of minimum size in which the leaves in left-to-right order represent the input strings in the given order. Contrary to standard tries, the order in which the characters of a string are encountered can be different on different root-to-leaf paths. Dawson et al. [ACM Trans. Program. Lang. Syst. 18(5):528–563, 1996] gave an algorithm that solves the problem in time O(n2m(n+m))𝑂superscript𝑛2𝑚𝑛𝑚O(n^{2}m(n+m))italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_m ( italic_n + italic_m ) ). In this paper, we present an improved algorithm with running-time O(n2m)𝑂superscript𝑛2𝑚O(n^{2}m)italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_m ).

1 Introduction

The execution of programs written in a logic programming language such as Prolog relies on unification as the basic computational mechanism. A Prolog program consists of a set of rules, and the system needs to match the head of the goal with the head of each of the rules that can be unified with the goal. Therefore, preprocessing the rule heads in order to speed up the unification process, called unification factoring, is important. Dawson et al. [2] describe how this preprocessing problem translates into the problem of constructing an optimal factoring automaton. This problem can be viewed as the purely combinatorial problem of computing a certain trie-like tree structure of minimum size for a given ordered set of strings. We focus on this combinatorial problem in the remainder of the paper and refer to [2] for the relevant background on logic programming and for the details of how the preprocessing problem for the rule heads translates into the problem of constructing optimal factoring automata. Unification factoring has been successfully implemented in at least one Prolog system, namely XSB [4].

r𝑟ritalic_r1111u𝑢uitalic_u2222v𝑣vitalic_v2222w𝑤witalic_w3333x𝑥xitalic_x3333y𝑦yitalic_y3333z𝑧zitalic_z2222aaabbcaabacbabaabbacac
Figure 1: Optimal FA for the string tuple 𝒮=(aaa,bbc,aab,acb)𝒮aaabbcaabacb\mathcal{S}=(\textsf{aaa},\textsf{bbc},\textsf{aab},\textsf{acb})caligraphic_S = ( aaa , bbc , aab , acb ). For each non-leaf node q𝑞qitalic_q, the position p(q)𝑝𝑞p(q)italic_p ( italic_q ) is shown to the right of the node.

The trie-like tree structures of interest can be defined as follows. Given an ordered set 𝒮𝒮\mathcal{S}caligraphic_S of n𝑛nitalic_n strings (S1,S2,,Sn)subscript𝑆1subscript𝑆2subscript𝑆𝑛(S_{1},S_{2},\ldots,S_{n})( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) of equal length m1𝑚1m\geq 1italic_m ≥ 1 over some alphabet ΣΣ\Sigmaroman_Σ, a factoring automaton (FA) is a rooted tree T=(V,E)𝑇𝑉𝐸T=(V,E)italic_T = ( italic_V , italic_E ) with n𝑛nitalic_n leaves, all of which have depth m𝑚mitalic_m. Each non-leaf node v𝑣vitalic_v is labeled with a position p(v)[m]𝑝𝑣delimited-[]𝑚p(v)\in[m]italic_p ( italic_v ) ∈ [ italic_m ], and each edge is labeled with a character from ΣΣ\Sigmaroman_Σ. For an edge e𝑒eitalic_e between a non-leaf node v𝑣vitalic_v and a child of v𝑣vitalic_v, we say that v𝑣vitalic_v is the parent node of e𝑒eitalic_e. On each root-to-leaf path, every position in [m]delimited-[]𝑚[m][ italic_m ] occurs as the label of a non-leaf node exactly once. For every non-leaf node, the edges to its children are ordered, and the labels of any two consecutive such edges are different (but it is possible that the labels of two edges that are not consecutive are the same). The path from the root to the i𝑖iitalic_i-th leaf, for i[n]𝑖delimited-[]𝑛i\in[n]italic_i ∈ [ italic_n ], must produce the string Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Here, a path P𝑃Pitalic_P produces the string of length m𝑚mitalic_m whose character in position j𝑗jitalic_j, for j[m]𝑗delimited-[]𝑚j\in[m]italic_j ∈ [ italic_m ], is equal to the label of the edge between the node v𝑣vitalic_v on P𝑃Pitalic_P that has label p(v)=j𝑝𝑣𝑗p(v)=jitalic_p ( italic_v ) = italic_j and the child node of v𝑣vitalic_v on P𝑃Pitalic_P. We say that the i𝑖iitalic_i-th leaf represents the string Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and we may refer to the i𝑖iitalic_i-th leaf simply as the leaf Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Figure 1 shows an example of a FA for the ordered set of strings, or string tuple, 𝒮=(aaa,bbc,aab,acb)𝒮aaabbcaabacb\mathcal{S}=(\textsf{aaa},\textsf{bbc},\textsf{aab},\textsf{acb})caligraphic_S = ( aaa , bbc , aab , acb ).

The size of a FA is the number of edges. For a given ordered set of strings of equal length, the goal of the optimal factoring automaton (OFA) problem is to compute a FA of minimum size. The FA shown in Fig. 1 has size 10101010 and is in fact optimal for the given string tuple of that example. Dawson et al. [2] show that the OFA problem can be solved in O(n2m(n+m))𝑂superscript𝑛2𝑚𝑛𝑚O(n^{2}m(n+m))italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_m ( italic_n + italic_m ) ) time using dynamic programming. The problem and solution approach are also featured in Skiena’s algorithms textbook as a ‘War Story’ illustrating how dynamic programming can solve practical problems [5, Section 10.1]. In this paper, we present an improved algorithm that solves the problem in O(n2m)𝑂superscript𝑛2𝑚O(n^{2}m)italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_m ) time, which is better than the previously known time bound by a factor of n+m𝑛𝑚n+mitalic_n + italic_m. Furthermore, we show that our algorithm can be implemented using O(n2+nm)𝑂superscript𝑛2𝑛𝑚O(n^{2}+nm)italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_n italic_m ) space, which is the same as the space used by the algorithm by Dawson et al. [2].

Dawson et al. [2] also consider a weighted variant of the OFA problem in which the cost of a non-leaf node with label k𝑘kitalic_k that has at least two children is cchoice(k)subscript𝑐choice𝑘c_{\mathrm{choice}}(k)italic_c start_POSTSUBSCRIPT roman_choice end_POSTSUBSCRIPT ( italic_k ) and the cost of an edge with label c𝑐citalic_c whose parent node v𝑣vitalic_v has label p(v)=k𝑝𝑣𝑘p(v)=kitalic_p ( italic_v ) = italic_k is cunify(k,c)subscript𝑐unify𝑘𝑐c_{\mathrm{unify}}(k,c)italic_c start_POSTSUBSCRIPT roman_unify end_POSTSUBSCRIPT ( italic_k , italic_c ), where cchoicesubscript𝑐choicec_{\mathrm{choice}}italic_c start_POSTSUBSCRIPT roman_choice end_POSTSUBSCRIPT and cunifysubscript𝑐unifyc_{\mathrm{unify}}italic_c start_POSTSUBSCRIPT roman_unify end_POSTSUBSCRIPT are two non-negative cost functions given as part of the input. The goal of the weighted OFA problem is to compute a FA of minimum total cost. Dawson et al. [2] show that their algorithm for the unweighted OFA problem extends to the weighted case. We show that the same holds for our faster algorithm.

We remark that the variant of the OFA problem where there is no restriction on the order in which the given strings are represented by the leaves (and where the labels of the downward edges incident with a non-leaf node must be pairwise different) has been shown to be NP-complete by Comer and Sethi [1].

The remainder of the paper is organized as follows. Section 2 introduces relevant notation and definitions. Then, in Section 3, we briefly recall the original algorithm by Dawson et al. [2]. In Section 4, we present our faster algorithm for the OFA problem and its adaptation to the weighted OFA problem. We present our conclusions in Section 5.

2 Preliminaries

For any natural number x𝑥xitalic_x we write [x]delimited-[]𝑥[x][ italic_x ] for the set {1,2,3,,x}123𝑥\{1,2,3,\ldots,x\}{ 1 , 2 , 3 , … , italic_x }. Let 𝒮=(S1,S2,,Sn)𝒮subscript𝑆1subscript𝑆2subscript𝑆𝑛\mathcal{S}=(S_{1},S_{2},\ldots,S_{n})caligraphic_S = ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) be a string tuple with n𝑛nitalic_n strings of length m𝑚mitalic_m. For 1in1𝑖𝑛1\leq i\leq n1 ≤ italic_i ≤ italic_n and 1jm1𝑗𝑚1\leq j\leq m1 ≤ italic_j ≤ italic_m, we use Si[j]subscript𝑆𝑖delimited-[]𝑗S_{i}[j]italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_j ] to denote the character in position j𝑗jitalic_j of the string Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For 1iin1𝑖superscript𝑖𝑛1\leq i\leq i^{\prime}\leq n1 ≤ italic_i ≤ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_n, we write 𝒮[i,i]𝒮𝑖superscript𝑖\mathcal{S}[i,i^{\prime}]caligraphic_S [ italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] for (Si,Si+1,,S[i])subscript𝑆𝑖subscript𝑆𝑖1𝑆delimited-[]superscript𝑖(S_{i},S_{i+1},\ldots,S[i^{\prime}])( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , … , italic_S [ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] ), the tuple of consecutive strings in 𝒮𝒮\mathcal{S}caligraphic_S starting with the i𝑖iitalic_i-th and ending with the isuperscript𝑖i^{\prime}italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT-th. We call any such 𝒮[i,i]𝒮𝑖superscript𝑖\mathcal{S}[i,i^{\prime}]caligraphic_S [ italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] a subtuple of 𝒮𝒮\mathcal{S}caligraphic_S.

For 1iin1𝑖superscript𝑖𝑛1\leq i\leq i^{\prime}\leq n1 ≤ italic_i ≤ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_n, we denote by 𝑐𝑜𝑚(i,i)𝑐𝑜𝑚𝑖superscript𝑖\mathit{com}(i,i^{\prime})italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) the set of positions with the property that all strings in 𝒮[i,i]𝒮𝑖superscript𝑖\mathcal{S}[i,i^{\prime}]caligraphic_S [ italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] have the same character in that position. For our example from Fig. 1 with 𝒮=(aaa,bbc,aab,acb)𝒮aaabbcaabacb\mathcal{S}=(\textsf{aaa},\textsf{bbc},\textsf{aab},\textsf{acb})caligraphic_S = ( aaa , bbc , aab , acb ), we have 𝑐𝑜𝑚(1,4)=𝑐𝑜𝑚14\mathit{com}(1,4)=\emptysetitalic_com ( 1 , 4 ) = ∅ and 𝑐𝑜𝑚(3,4)={1,3}𝑐𝑜𝑚3413\mathit{com}(3,4)=\{1,3\}italic_com ( 3 , 4 ) = { 1 , 3 }. We denote the complement of 𝑐𝑜𝑚(i,i)𝑐𝑜𝑚𝑖superscript𝑖\mathit{com}(i,i^{\prime})italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) by 𝑢𝑛𝑐(i,i)=[m]𝑐𝑜𝑚(i,i)𝑢𝑛𝑐𝑖superscript𝑖delimited-[]𝑚𝑐𝑜𝑚𝑖superscript𝑖\mathit{unc}(i,i^{\prime})=[m]\setminus\mathit{com}(i,i^{\prime})italic_unc ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = [ italic_m ] ∖ italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). The function name 𝑐𝑜𝑚𝑐𝑜𝑚\mathit{com}italic_com is motivated by the fact that 𝑐𝑜𝑚(i,i)𝑐𝑜𝑚𝑖superscript𝑖\mathit{com}(i,i^{\prime})italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) represent the positions where all strings in 𝒮[i,i]𝒮𝑖superscript𝑖\mathcal{S}[i,i^{\prime}]caligraphic_S [ italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] have the same letter in common, and the name 𝑢𝑛𝑐𝑢𝑛𝑐\mathit{unc}italic_unc is short for uncommon and represents the complement. For 1ijjin1𝑖𝑗superscript𝑗superscript𝑖𝑛1\leq i\leq j\leq j^{\prime}\leq i^{\prime}\leq n1 ≤ italic_i ≤ italic_j ≤ italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_n, define Δ(i,i,j,j)=𝑐𝑜𝑚(j,j)𝑐𝑜𝑚(i,i)Δ𝑖superscript𝑖𝑗superscript𝑗𝑐𝑜𝑚𝑗superscript𝑗𝑐𝑜𝑚𝑖superscript𝑖\Delta(i,i^{\prime},j,j^{\prime})=\mathit{com}(j,j^{\prime})\setminus\mathit{% com}(i,i^{\prime})roman_Δ ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_com ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∖ italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). Intuitively, Δ(i,i,j,j)Δ𝑖superscript𝑖𝑗superscript𝑗\Delta(i,i^{\prime},j,j^{\prime})roman_Δ ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) denotes the positions where all strings in 𝒮[j,j]𝒮𝑗superscript𝑗\mathcal{S}[j,j^{\prime}]caligraphic_S [ italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] have the same character but not all strings in 𝒮[i,i]𝒮𝑖superscript𝑖\mathcal{S}[i,i^{\prime}]caligraphic_S [ italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] have the same character. Observe that |Δ(i,i,j,j)|=|𝑐𝑜𝑚(j,j)||𝑐𝑜𝑚(i,i)|Δ𝑖superscript𝑖𝑗superscript𝑗𝑐𝑜𝑚𝑗superscript𝑗𝑐𝑜𝑚𝑖superscript𝑖|\Delta(i,i^{\prime},j,j^{\prime})|=|\mathit{com}(j,j^{\prime})|-|\mathit{com}% (i,i^{\prime})|| roman_Δ ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | = | italic_com ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | - | italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | as 𝑐𝑜𝑚(i,i)𝑐𝑜𝑚(j,j)𝑐𝑜𝑚𝑖superscript𝑖𝑐𝑜𝑚𝑗superscript𝑗\mathit{com}(i,i^{\prime})\subseteq\mathit{com}(j,j^{\prime})italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⊆ italic_com ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).

Let T=(V,E)𝑇𝑉𝐸T=(V,E)italic_T = ( italic_V , italic_E ) be a FA for 𝒮𝒮\mathcal{S}caligraphic_S. Each node qV𝑞𝑉q\in Vitalic_q ∈ italic_V is (implicitly) associated with the subtuple τ(q)=𝒮[q,rq]𝜏𝑞𝒮subscript𝑞subscript𝑟𝑞\tau(q)=\mathcal{S}[\ell_{q},r_{q}]italic_τ ( italic_q ) = caligraphic_S [ roman_ℓ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ] of the strings that are represented by leaf descendants of q𝑞qitalic_q. For the root node r𝑟ritalic_r of T𝑇Titalic_T we have τ(r)=𝒮𝜏𝑟𝒮\tau(r)=\mathcal{S}italic_τ ( italic_r ) = caligraphic_S. For the FA shown in Fig. 1, we have τ(r)=𝒮=𝒮[1,4]𝜏𝑟𝒮𝒮14\tau(r)=\mathcal{S}=\mathcal{S}[1,4]italic_τ ( italic_r ) = caligraphic_S = caligraphic_S [ 1 , 4 ] and τ(w)=τ(z)=𝒮[3,4]𝜏𝑤𝜏𝑧𝒮34\tau(w)=\tau(z)=\mathcal{S}[3,4]italic_τ ( italic_w ) = italic_τ ( italic_z ) = caligraphic_S [ 3 , 4 ], for example.

For a node qV𝑞𝑉q\in Vitalic_q ∈ italic_V, let α(q)𝛼𝑞\alpha(q)italic_α ( italic_q ) be the set of the positions p(s)𝑝𝑠p(s)italic_p ( italic_s ) that appear as labels of the strict ancestors s𝑠sitalic_s of q𝑞qitalic_q (and hence the empty set if q𝑞qitalic_q is the root of T𝑇Titalic_T). For the FA shown in Fig. 1, we have α(r)=𝛼𝑟\alpha(r)=\emptysetitalic_α ( italic_r ) = ∅, α(w)={1}𝛼𝑤1\alpha(w)=\{1\}italic_α ( italic_w ) = { 1 }, and α(z)={1,3}𝛼𝑧13\alpha(z)=\{1,3\}italic_α ( italic_z ) = { 1 , 3 }, for example. Let β(q)=[m]α(q)𝛽𝑞delimited-[]𝑚𝛼𝑞\beta(q)=[m]\setminus\alpha(q)italic_β ( italic_q ) = [ italic_m ] ∖ italic_α ( italic_q ) and note that p(q)β(q)𝑝𝑞𝛽𝑞p(q)\in\beta(q)italic_p ( italic_q ) ∈ italic_β ( italic_q ) for every non-leaf node v𝑣vitalic_v, as each position in [m]delimited-[]𝑚[m][ italic_m ] occurs only once on each root-to-leaf path in T𝑇Titalic_T.

For a non-leaf node qV𝑞𝑉q\in Vitalic_q ∈ italic_V with τ(q)=𝒮[q,rq]𝜏𝑞𝒮subscript𝑞subscript𝑟𝑞\tau(q)=\mathcal{S}[\ell_{q},r_{q}]italic_τ ( italic_q ) = caligraphic_S [ roman_ℓ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ] and any position jβ(q)𝑗𝛽𝑞j\in\beta(q)italic_j ∈ italic_β ( italic_q ), the subtuple τ(q)𝜏𝑞\tau(q)italic_τ ( italic_q ) can be partitioned into subtuples in such a way that the strings in each subtuple have the same character in position j𝑗jitalic_j, while the strings of consecutive subtuples differ in position j𝑗jitalic_j. We also refer to these subtuples as runs. Let 𝑝𝑎𝑟𝑡(q,rq,j)𝑝𝑎𝑟𝑡subscript𝑞subscript𝑟𝑞𝑗\mathit{part}(\ell_{q},r_{q},j)italic_part ( roman_ℓ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_j ) denote the ordered set of these runs, each represented as a triple (i,i,c)𝑖superscript𝑖𝑐(i,i^{\prime},c)( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c ) with qiirqsubscript𝑞𝑖superscript𝑖subscript𝑟𝑞\ell_{q}\leq i\leq i^{\prime}\leq r_{q}roman_ℓ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ≤ italic_i ≤ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_r start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and cΣ𝑐Σc\in\Sigmaitalic_c ∈ roman_Σ, meaning that the run is 𝒮[i,i]𝒮𝑖superscript𝑖\mathcal{S}[i,i^{\prime}]caligraphic_S [ italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] and all strings in the run have the character c𝑐citalic_c in position j𝑗jitalic_j. Intuitively, the runs in 𝑝𝑎𝑟𝑡(q,rq,j)𝑝𝑎𝑟𝑡subscript𝑞subscript𝑟𝑞𝑗\mathit{part}(\ell_{q},r_{q},j)italic_part ( roman_ℓ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_j ) are the subtuples associated with the children of q𝑞qitalic_q if p(q)𝑝𝑞p(q)italic_p ( italic_q ) is set to j𝑗jitalic_j. Let 𝑙𝑎𝑠𝑡(q,rq,j)𝑙𝑎𝑠𝑡subscript𝑞subscript𝑟𝑞𝑗\mathit{last}(\ell_{q},r_{q},j)italic_last ( roman_ℓ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_j ) denote the last run in 𝑝𝑎𝑟𝑡(q,rq,j)𝑝𝑎𝑟𝑡subscript𝑞subscript𝑟𝑞𝑗\mathit{part}(\ell_{q},r_{q},j)italic_part ( roman_ℓ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_j ), and let 𝑙𝑎𝑠𝑡𝑗(q,rq,j)𝑙𝑎𝑠𝑡𝑗subscript𝑞subscript𝑟𝑞𝑗\mathit{lastj}(\ell_{q},r_{q},j)italic_lastj ( roman_ℓ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_j ) denote the first component of the triple representing that last run. For the FA shown in Fig. 1, we have 𝑝𝑎𝑟𝑡(1,4,1)=((1,1,a),(2,2,b),(3,4,a))𝑝𝑎𝑟𝑡14111a22b34a\mathit{part}(1,4,1)=((1,1,\textsf{a}),(2,2,\textsf{b}),(3,4,\textsf{a}))italic_part ( 1 , 4 , 1 ) = ( ( 1 , 1 , a ) , ( 2 , 2 , b ) , ( 3 , 4 , a ) ), 𝑙𝑎𝑠𝑡(1,4,1)=(3,4,a)𝑙𝑎𝑠𝑡14134a\mathit{last}(1,4,1)=(3,4,\textsf{a})italic_last ( 1 , 4 , 1 ) = ( 3 , 4 , a ) and 𝑙𝑎𝑠𝑡𝑗(1,4,1)=3𝑙𝑎𝑠𝑡𝑗1413\mathit{lastj}(1,4,1)=3italic_lastj ( 1 , 4 , 1 ) = 3, for example.

Dawson et al. show that in an optimal FA it holds that, for any node v𝑣vitalic_v associated with a subtuple τ(v)=𝒮[i,i]𝜏𝑣𝒮𝑖superscript𝑖\tau(v)=\mathcal{S}[i,i^{\prime}]italic_τ ( italic_v ) = caligraphic_S [ italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] such that 𝑐𝑜𝑚(i,i)β(v)𝑐𝑜𝑚𝑖superscript𝑖𝛽𝑣\mathit{com}(i,i^{\prime})\cap\beta(v)\neq\emptysetitalic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∩ italic_β ( italic_v ) ≠ ∅, the part of the FA below v𝑣vitalic_v will start with a path with |𝑐𝑜𝑚(i,i)β(v)|𝑐𝑜𝑚𝑖superscript𝑖𝛽𝑣|\mathit{com}(i,i^{\prime})\cap\beta(v)|| italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∩ italic_β ( italic_v ) | edges whose labels are the letters that all the strings of 𝒮[i,i]𝒮𝑖superscript𝑖\mathcal{S}[i,i^{\prime}]caligraphic_S [ italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] have in the positions in 𝑐𝑜𝑚(i,i)β(v)𝑐𝑜𝑚𝑖superscript𝑖𝛽𝑣\mathit{com}(i,i^{\prime})\cap\beta(v)italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∩ italic_β ( italic_v ), in arbitrary order [2, Property 2].

3 The algorithm by Dawson et al.

To aid the understanding of our improved algorithm, we explain in this section the algorithm by Dawson et al. [2] for solving the OFA problem, but using the notation and terminology of our paper. We refer to their algorithm as the DRSS algorithm.

The DRSS algorithm is based on dynamic programming. For 1iin1𝑖superscript𝑖𝑛1\leq i\leq i^{\prime}\leq n1 ≤ italic_i ≤ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_n, we define a FA for the uncommon positions of 𝒮[i,i]𝒮𝑖superscript𝑖\mathcal{S}[i,i^{\prime}]caligraphic_S [ italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] to be a FA Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for 𝒮[i,i]𝒮𝑖superscript𝑖\mathcal{S}[i,i^{\prime}]caligraphic_S [ italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] rooted at a node v𝑣vitalic_v but with the assumption that Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is part of a larger FA T𝑇Titalic_T and the positions associated with the strict ancestors of v𝑣vitalic_v in T𝑇Titalic_T are exactly those in 𝑐𝑜𝑚(i,i)𝑐𝑜𝑚𝑖superscript𝑖\mathit{com}(i,i^{\prime})italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) in arbitrary order, i.e., α(v)=𝑐𝑜𝑚(i,i)𝛼𝑣𝑐𝑜𝑚𝑖superscript𝑖\alpha(v)=\mathit{com}(i,i^{\prime})italic_α ( italic_v ) = italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). Let D(i,i)𝐷𝑖superscript𝑖D(i,i^{\prime})italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) denote the minimum size of a FA for the uncommon positions of 𝒮[i,i]𝒮𝑖superscript𝑖\mathcal{S}[i,i^{\prime}]caligraphic_S [ italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ]. For example, for the instance of Fig. 1, we have D(3,4)=2𝐷342D(3,4)=2italic_D ( 3 , 4 ) = 2 because the subtree rooted at z𝑧zitalic_z in that figure has 2222 edges and is an optimal FA for the uncommon positions (in this case the only uncommon position is position 2222) of 𝒮[3,4]=(aab,acb)𝒮34aabacb\mathcal{S}[3,4]=(\textsf{aab},\textsf{acb})caligraphic_S [ 3 , 4 ] = ( aab , acb ).

The key observation by Dawson et al. is that the values D(i,i)𝐷𝑖superscript𝑖D(i,i^{\prime})italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) for 1i<in1𝑖superscript𝑖𝑛1\leq i<i^{\prime}\leq n1 ≤ italic_i < italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_n can be computed by dynamic programming via the following equation:

D(i,i)=mink𝑢𝑛𝑐(i,i)(j,j,c)𝑝𝑎𝑟𝑡(i,i,k)(|Δ(i,i,j,j)|+D(j,j))𝐷𝑖superscript𝑖subscript𝑘𝑢𝑛𝑐𝑖superscript𝑖subscript𝑗superscript𝑗𝑐𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘Δ𝑖superscript𝑖𝑗superscript𝑗𝐷𝑗superscript𝑗\displaystyle D(i,i^{\prime})=\min_{k\in\mathit{unc}(i,i^{\prime})}\sum_{(j,j^% {\prime},c)\in\mathit{part}(i,i^{\prime},k)}\left(|\Delta(i,i^{\prime},j,j^{% \prime})|+D(j,j^{\prime})\right)italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_min start_POSTSUBSCRIPT italic_k ∈ italic_unc ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c ) ∈ italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) end_POSTSUBSCRIPT ( | roman_Δ ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | + italic_D ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) (1)

The base case is D(i,i)=0𝐷𝑖𝑖0D(i,i)=0italic_D ( italic_i , italic_i ) = 0 for all 1in1𝑖𝑛1\leq i\leq n1 ≤ italic_i ≤ italic_n. The size of an optimal FA for 𝒮𝒮\mathcal{S}caligraphic_S can be calculated as |𝑐𝑜𝑚(1,n)|+D(1,n)𝑐𝑜𝑚1𝑛𝐷1𝑛|\mathit{com}(1,n)|+D(1,n)| italic_com ( 1 , italic_n ) | + italic_D ( 1 , italic_n ). This corresponds to a FA with the following structure: Starting from its root, there is a path consisting of |𝑐𝑜𝑚(1,n)|+1𝑐𝑜𝑚1𝑛1|\mathit{com}(1,n)|+1| italic_com ( 1 , italic_n ) | + 1 nodes, the first |𝑐𝑜𝑚(1,n)|𝑐𝑜𝑚1𝑛|\mathit{com}(1,n)|| italic_com ( 1 , italic_n ) | of which are associated with distinct positions in 𝑐𝑜𝑚(1,n)𝑐𝑜𝑚1𝑛\mathit{com}(1,n)italic_com ( 1 , italic_n ). The bottom node of that path is the root of an optimal FA for the uncommon positions of 𝒮=𝒮[1,n]𝒮𝒮1𝑛\mathcal{S}=\mathcal{S}[1,n]caligraphic_S = caligraphic_S [ 1 , italic_n ]. Equation (1) is correct because it takes the minimum, over all possible choices of the position k𝑢𝑛𝑐(i,i)𝑘𝑢𝑛𝑐𝑖superscript𝑖k\in\mathit{unc}(i,i^{\prime})italic_k ∈ italic_unc ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) that could be used as the label of the root of the FA for the uncommon positions of 𝒮(i,i)𝒮𝑖superscript𝑖\mathcal{S}(i,i^{\prime})caligraphic_S ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), of the size of the resulting FA for the uncommon positions of 𝒮[i,i]𝒮𝑖superscript𝑖\mathcal{S}[i,i^{\prime}]caligraphic_S [ italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ]: The root of that FA will have |𝑝𝑎𝑟𝑡(i,i,k)|𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘|\mathit{part}(i,i^{\prime},k)|| italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) | children, one for each triple in 𝑝𝑎𝑟𝑡(i,i,k)𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘\mathit{part}(i,i^{\prime},k)italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ). For each triple (j,j,c)𝑝𝑎𝑟𝑡(i,i,k)𝑗superscript𝑗𝑐𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘(j,j^{\prime},c)\in\mathit{part}(i,i^{\prime},k)( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c ) ∈ italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ), the FA will contain a path starting from the root with |Δ(i,i,j,j)|Δ𝑖superscript𝑖𝑗superscript𝑗|\Delta(i,i^{\prime},j,j^{\prime})|| roman_Δ ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | edges (to be labeled with the characters in Δ(i,i,j,j)=𝑐𝑜𝑚(j,j)𝑐𝑜𝑚(i,i)Δ𝑖superscript𝑖𝑗superscript𝑗𝑐𝑜𝑚𝑗superscript𝑗𝑐𝑜𝑚𝑖superscript𝑖\Delta(i,i^{\prime},j,j^{\prime})=\mathit{com}(j,j^{\prime})\setminus\mathit{% com}(i,i^{\prime})roman_Δ ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_com ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∖ italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )), and the bottom node of that path will be the root of an optimal FA for the uncommon positions of 𝒮[j,j]𝒮𝑗superscript𝑗\mathcal{S}[j,j^{\prime}]caligraphic_S [ italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ]. In the example of Fig. 1, the value of k𝑘kitalic_k that minimizes the expression for D(1,4)𝐷14D(1,4)italic_D ( 1 , 4 ) is k=1𝑘1k=1italic_k = 1, and the size of the resulting FA is

D(1,4)𝐷14\displaystyle D(1,4)italic_D ( 1 , 4 ) =(j,j,c)𝑝𝑎𝑟𝑡(1,4,1)(|Δ(1,4,j,j)|+D(j,j))absentsubscript𝑗superscript𝑗𝑐𝑝𝑎𝑟𝑡141Δ14𝑗superscript𝑗𝐷𝑗superscript𝑗\displaystyle=\sum_{(j,j^{\prime},c)\in\mathit{part}(1,4,1)}\left(|\Delta(1,4,% j,j^{\prime})|+D(j,j^{\prime})\right)= ∑ start_POSTSUBSCRIPT ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c ) ∈ italic_part ( 1 , 4 , 1 ) end_POSTSUBSCRIPT ( | roman_Δ ( 1 , 4 , italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | + italic_D ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) )
=(|Δ(1,4,1,1)|+D(1,1))+(|Δ(1,4,2,2)|+D(2,2))absentΔ1411𝐷11Δ1422𝐷22\displaystyle=\left(|\Delta(1,4,1,1)|+D(1,1)\right)+\left(|\Delta(1,4,2,2)|+D(% 2,2)\right)= ( | roman_Δ ( 1 , 4 , 1 , 1 ) | + italic_D ( 1 , 1 ) ) + ( | roman_Δ ( 1 , 4 , 2 , 2 ) | + italic_D ( 2 , 2 ) )
+(|Δ(1,4,3,4)|+D(3,4))Δ1434𝐷34\displaystyle\phantom{=}\mbox{}+\left(|\Delta(1,4,3,4)|+D(3,4)\right)+ ( | roman_Δ ( 1 , 4 , 3 , 4 ) | + italic_D ( 3 , 4 ) )
=(3+0)+(3+0)+(2+D(3,4))=10absent30302𝐷3410\displaystyle=(3+0)+(3+0)+(2+D(3,4))=10= ( 3 + 0 ) + ( 3 + 0 ) + ( 2 + italic_D ( 3 , 4 ) ) = 10

Dawson et al. analyze the running-time of algorithm DRSS as follows. The algorithm computes O(n2)𝑂superscript𝑛2O(n^{2})italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) values D(i,i)𝐷𝑖superscript𝑖D(i,i^{\prime})italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). Each such value is calculated as the minimum of at most m𝑚mitalic_m expressions, one for each value of k𝑢𝑛𝑐(i,i)𝑘𝑢𝑛𝑐𝑖superscript𝑖k\in\mathit{unc}(i,i^{\prime})italic_k ∈ italic_unc ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). For each such expression, 𝑝𝑎𝑟𝑡(i,i,k)𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘\mathit{part}(i,i^{\prime},k)italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) can be determined in O(n)𝑂𝑛O(n)italic_O ( italic_n ) time. Dawson et al. state that the values |Δ(i,i,j,j)|Δ𝑖superscript𝑖𝑗superscript𝑗|\Delta(i,i^{\prime},j,j^{\prime})|| roman_Δ ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | can be determined for all (j,j,c)𝑝𝑎𝑟𝑡(i,i,k)𝑗superscript𝑗𝑐𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘(j,j^{\prime},c)\in\mathit{part}(i,i^{\prime},k)( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c ) ∈ italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) together in O(m)𝑂𝑚O(m)italic_O ( italic_m ) time, after a preprocessing that requires O(mn)𝑂𝑚𝑛O(mn)italic_O ( italic_m italic_n ) time and space to compute a matrix that allows one to check in O(1)𝑂1O(1)italic_O ( 1 ) time for given d[m]𝑑delimited-[]𝑚d\in[m]italic_d ∈ [ italic_m ] and 1jjn1𝑗superscript𝑗𝑛1\leq j\leq j^{\prime}\leq n1 ≤ italic_j ≤ italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_n whether all strings in 𝒮(j,j)𝒮𝑗superscript𝑗\mathcal{S}(j,j^{\prime})caligraphic_S ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) have the same character in a position d𝑑ditalic_d. This gives a bound of O(n2m(n+m))𝑂superscript𝑛2𝑚𝑛𝑚O(n^{2}m(n+m))italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_m ( italic_n + italic_m ) ) on the running-time of the algorithm. The space usage of their algorithm is O(n2+nm)𝑂superscript𝑛2𝑛𝑚O(n^{2}+nm)italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_n italic_m ), as storing the values D(i,i)𝐷𝑖superscript𝑖D(i,i^{\prime})italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) requires O(n2)𝑂superscript𝑛2O(n^{2})italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) space and the matrix computed during the preprocessing takes O(nm)𝑂𝑛𝑚O(nm)italic_O ( italic_n italic_m ) space.

4 A faster algorithm for optimal factoring automata

We first describe our algorithm for the OFA problem in Section 4.1. Then, in Section 4.2, we explain how the algorithm can be adapted to the weighted OFA problem. In Section 4.3, we describe the preprocessing step to construct a data structure that is used by the main parts of our algorithms to look up values such as |𝑐𝑜𝑚(i,i)|𝑐𝑜𝑚𝑖superscript𝑖|\mathit{com}(i,i^{\prime})|| italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | for 1iin1𝑖superscript𝑖𝑛1\leq i\leq i^{\prime}\leq n1 ≤ italic_i ≤ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_n efficiently.

4.1 Algorithm for the OFA problem

The key idea of our faster algorithm for the OFA problem is to reuse information from the computation of D(i,i1)𝐷𝑖superscript𝑖1D(i,i^{\prime}-1)italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 ) when calculating D(i,i)𝐷𝑖superscript𝑖D(i,i^{\prime})italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) in such a way that D(i,i)𝐷𝑖superscript𝑖D(i,i^{\prime})italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) can be determined in O(m)𝑂𝑚O(m)italic_O ( italic_m ) time. More precisely, we want to evaluate the expression

D(i,i,k)=(j,j,c)𝑝𝑎𝑟𝑡(i,i,k)(|Δ(i,i,j,j)|+D(j,j))𝐷𝑖superscript𝑖𝑘subscript𝑗superscript𝑗𝑐𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘Δ𝑖superscript𝑖𝑗superscript𝑗𝐷𝑗superscript𝑗D(i,i^{\prime},k)=\sum_{(j,j^{\prime},c)\in\mathit{part}(i,i^{\prime},k)}\left% (|\Delta(i,i^{\prime},j,j^{\prime})|+D(j,j^{\prime})\right)italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) = ∑ start_POSTSUBSCRIPT ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c ) ∈ italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) end_POSTSUBSCRIPT ( | roman_Δ ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | + italic_D ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) (2)

for each k𝑢𝑛𝑐(i,i)𝑘𝑢𝑛𝑐𝑖superscript𝑖k\in\mathit{unc}(i,i^{\prime})italic_k ∈ italic_unc ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) in O(1)𝑂1O(1)italic_O ( 1 ) time, so that D(i,i)=mink𝑢𝑛𝑐(i,i)D(i,i,k)𝐷𝑖superscript𝑖subscript𝑘𝑢𝑛𝑐𝑖superscript𝑖𝐷𝑖superscript𝑖𝑘D(i,i^{\prime})=\min_{k\in\mathit{unc}(i,i^{\prime})}D(i,i^{\prime},k)italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_min start_POSTSUBSCRIPT italic_k ∈ italic_unc ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) (cf. Equation (1)) can be obtained in O(m)𝑂𝑚O(m)italic_O ( italic_m ) time

When going from D(i,i1,k)𝐷𝑖superscript𝑖1𝑘D(i,i^{\prime}-1,k)italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 , italic_k ) to D(i,i,k)𝐷𝑖superscript𝑖𝑘D(i,i^{\prime},k)italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ), the relevant changes of (2) are: The runs in 𝑝𝑎𝑟𝑡(i,i,k)𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘\mathit{part}(i,i^{\prime},k)italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) differ from the runs in 𝑝𝑎𝑟𝑡(i,i1,k)𝑝𝑎𝑟𝑡𝑖superscript𝑖1𝑘\mathit{part}(i,i^{\prime}-1,k)italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 , italic_k ), but not by much: 𝑝𝑎𝑟𝑡(i,i,k)𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘\mathit{part}(i,i^{\prime},k)italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) can be obtained from 𝑝𝑎𝑟𝑡(i,i1,k)𝑝𝑎𝑟𝑡𝑖superscript𝑖1𝑘\mathit{part}(i,i^{\prime}-1,k)italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 , italic_k ) either by adding the string Sisubscript𝑆superscript𝑖S_{i^{\prime}}italic_S start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT to the last run in 𝑝𝑎𝑟𝑡(i,i1,k)𝑝𝑎𝑟𝑡𝑖superscript𝑖1𝑘\mathit{part}(i,i^{\prime}-1,k)italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 , italic_k ), or by adding a new run consisting only of Sisubscript𝑆superscript𝑖S_{i^{\prime}}italic_S start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. Furthermore, the terms |Δ(i,i1,j,j)|Δ𝑖superscript𝑖1𝑗superscript𝑗|\Delta(i,i^{\prime}-1,j,j^{\prime})|| roman_Δ ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 , italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | in the sum change to |Δ(i,i,j,j)|Δ𝑖superscript𝑖𝑗superscript𝑗|\Delta(i,i^{\prime},j,j^{\prime})|| roman_Δ ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) |. To handle the latter change without having to process all terms of the sum separately, we use the identity |Δ(i,i,j,j)|=|𝑐𝑜𝑚(j,j)||𝑐𝑜𝑚(i,i)|Δ𝑖superscript𝑖𝑗superscript𝑗𝑐𝑜𝑚𝑗superscript𝑗𝑐𝑜𝑚𝑖superscript𝑖|\Delta(i,i^{\prime},j,j^{\prime})|=|\mathit{com}(j,j^{\prime})|-|\mathit{com}% (i,i^{\prime})|| roman_Δ ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | = | italic_com ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | - | italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | that we noted earlier to rewrite Equation (2) as follows:

D(i,i,k)=((j,j,c)𝑝𝑎𝑟𝑡(i,i,k)(|𝑐𝑜𝑚(j,j)|+D(j,j)))|𝑝𝑎𝑟𝑡(i,i,k)||𝑐𝑜𝑚(i,i)|𝐷𝑖superscript𝑖𝑘subscript𝑗superscript𝑗𝑐𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘𝑐𝑜𝑚𝑗superscript𝑗𝐷𝑗superscript𝑗𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘𝑐𝑜𝑚𝑖superscript𝑖\begin{split}D(i,i^{\prime},k)=&\left(\sum_{(j,j^{\prime},c)\in\mathit{part}(i% ,i^{\prime},k)}\left(|\mathit{com}(j,j^{\prime})|+D(j,j^{\prime})\right)\right% )\\ &-|\mathit{part}(i,i^{\prime},k)|\cdot|\mathit{com}(i,i^{\prime})|\end{split}start_ROW start_CELL italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) = end_CELL start_CELL ( ∑ start_POSTSUBSCRIPT ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c ) ∈ italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) end_POSTSUBSCRIPT ( | italic_com ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | + italic_D ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - | italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) | ⋅ | italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | end_CELL end_ROW (3)

In this way, the term |𝑐𝑜𝑚(i,i)|𝑐𝑜𝑚𝑖superscript𝑖|\mathit{com}(i,i^{\prime})|| italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | that changes when going from D(i,i1,k)𝐷𝑖superscript𝑖1𝑘D(i,i^{\prime}-1,k)italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 , italic_k ) to D(i,i,k)𝐷𝑖superscript𝑖𝑘D(i,i^{\prime},k)italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) is moved outside the sum, and so a large part of the sum (i.e., all terms except possibly the one corresponding to the last run) can be reused when determining D(i,i,k)𝐷𝑖superscript𝑖𝑘D(i,i^{\prime},k)italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ). We now split the expression given for D(i,i,k)𝐷𝑖superscript𝑖𝑘D(i,i^{\prime},k)italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) in Equation (3) into two separate parts as follows:

A(i,i,k)𝐴𝑖superscript𝑖𝑘\displaystyle A(i,i^{\prime},k)italic_A ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) =(j,j,c)𝑝𝑎𝑟𝑡(i,i,k)(|𝑐𝑜𝑚(j,j)|+D(j,j))absentsubscript𝑗superscript𝑗𝑐𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘𝑐𝑜𝑚𝑗superscript𝑗𝐷𝑗superscript𝑗\displaystyle=\sum_{(j,j^{\prime},c)\in\mathit{part}(i,i^{\prime},k)}\left(|% \mathit{com}(j,j^{\prime})|+D(j,j^{\prime})\right)= ∑ start_POSTSUBSCRIPT ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c ) ∈ italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) end_POSTSUBSCRIPT ( | italic_com ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | + italic_D ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) )
B(i,i,k)𝐵𝑖superscript𝑖𝑘\displaystyle B(i,i^{\prime},k)italic_B ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) =|𝑝𝑎𝑟𝑡(i,i,k)||𝑐𝑜𝑚(i,i)|absent𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘𝑐𝑜𝑚𝑖superscript𝑖\displaystyle=|\mathit{part}(i,i^{\prime},k)|\cdot|\mathit{com}(i,i^{\prime})|= | italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) | ⋅ | italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) |

Note that D(i,i)=mink𝑢𝑛𝑐(i,i)(A(i,i,k)B(i,i,k))𝐷𝑖superscript𝑖subscript𝑘𝑢𝑛𝑐𝑖superscript𝑖𝐴𝑖superscript𝑖𝑘𝐵𝑖superscript𝑖𝑘D(i,i^{\prime})=\min_{k\in\mathit{unc}(i,i^{\prime})}(A(i,i^{\prime},k)-B(i,i^% {\prime},k))italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_min start_POSTSUBSCRIPT italic_k ∈ italic_unc ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_A ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) - italic_B ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) ). With suitable bookkee** and preprocessing, the two factors in B(i,i,k)𝐵𝑖superscript𝑖𝑘B(i,i^{\prime},k)italic_B ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) can be computed in constant time, as we will show later. The more challenging task is to compute A(i,i,k)𝐴𝑖superscript𝑖𝑘A(i,i^{\prime},k)italic_A ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) in constant time, which we tackle next.

Regarding 𝑝𝑎𝑟𝑡(i,i,k)𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘\mathit{part}(i,i^{\prime},k)italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ), there are two cases for how it can be obtained from 𝑝𝑎𝑟𝑡(i,i1,k)𝑝𝑎𝑟𝑡𝑖superscript𝑖1𝑘\mathit{part}(i,i^{\prime}-1,k)italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 , italic_k ):

  • Case 1: Si1[k]=Si[k]subscript𝑆superscript𝑖1delimited-[]𝑘subscript𝑆superscript𝑖delimited-[]𝑘S_{i^{\prime}-1}[k]=S_{i^{\prime}}[k]italic_S start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 end_POSTSUBSCRIPT [ italic_k ] = italic_S start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_k ]. Let (j,i1,c)=𝑙𝑎𝑠𝑡(i,i1,k)𝑗superscript𝑖1𝑐𝑙𝑎𝑠𝑡𝑖superscript𝑖1𝑘(j,i^{\prime}-1,c)=\mathit{last}(i,i^{\prime}-1,k)( italic_j , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 , italic_c ) = italic_last ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 , italic_k ). Then 𝑝𝑎𝑟𝑡(i,i,k)𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘\mathit{part}(i,i^{\prime},k)italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) is obtained from 𝑝𝑎𝑟𝑡(i,i1,k)𝑝𝑎𝑟𝑡𝑖superscript𝑖1𝑘\mathit{part}(i,i^{\prime}-1,k)italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 , italic_k ) simply be extending the last run, i.e., by changing the run (j,i1,c)𝑗superscript𝑖1𝑐(j,i^{\prime}-1,c)( italic_j , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 , italic_c ) to (j,i,c)𝑗superscript𝑖𝑐(j,i^{\prime},c)( italic_j , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c ). We have 𝑙𝑎𝑠𝑡(i,i,k)=(j,i,c)𝑙𝑎𝑠𝑡𝑖superscript𝑖𝑘𝑗superscript𝑖𝑐\mathit{last}(i,i^{\prime},k)=(j,i^{\prime},c)italic_last ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) = ( italic_j , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c ) and 𝑙𝑎𝑠𝑡𝑗(i,i,k)=𝑙𝑎𝑠𝑡𝑗(i,i1,k)=j𝑙𝑎𝑠𝑡𝑗𝑖superscript𝑖𝑘𝑙𝑎𝑠𝑡𝑗𝑖superscript𝑖1𝑘𝑗\mathit{lastj}(i,i^{\prime},k)=\mathit{lastj}(i,i^{\prime}-1,k)=jitalic_lastj ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) = italic_lastj ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 , italic_k ) = italic_j.

  • Case 2: Si1[k]Si[k]subscript𝑆superscript𝑖1delimited-[]𝑘subscript𝑆superscript𝑖delimited-[]𝑘S_{i^{\prime}-1}[k]\neq S_{i^{\prime}}[k]italic_S start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 end_POSTSUBSCRIPT [ italic_k ] ≠ italic_S start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_k ]. In this case, 𝑝𝑎𝑟𝑡(i,i,k)𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘\mathit{part}(i,i^{\prime},k)italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) is obtained by taking 𝑝𝑎𝑟𝑡(i,i1,k)𝑝𝑎𝑟𝑡𝑖superscript𝑖1𝑘\mathit{part}(i,i^{\prime}-1,k)italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 , italic_k ) and appending the new run (i,i,Si[k])superscript𝑖superscript𝑖subscript𝑆superscript𝑖delimited-[]𝑘(i^{\prime},i^{\prime},S_{i^{\prime}}[k])( italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_k ] ). We have 𝑙𝑎𝑠𝑡(i,i,k)=(i,i,Si[k])𝑙𝑎𝑠𝑡𝑖superscript𝑖𝑘superscript𝑖superscript𝑖subscript𝑆superscript𝑖delimited-[]𝑘\mathit{last}(i,i^{\prime},k)=(i^{\prime},i^{\prime},S_{i^{\prime}}[k])italic_last ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) = ( italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_k ] ) and 𝑙𝑎𝑠𝑡𝑗(i,i,k)=i𝑙𝑎𝑠𝑡𝑗𝑖superscript𝑖𝑘superscript𝑖\mathit{lastj}(i,i^{\prime},k)=i^{\prime}italic_lastj ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

In Case 1, we can compute A(i,i,k)𝐴𝑖superscript𝑖𝑘A(i,i^{\prime},k)italic_A ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) from A(i,i1,k)𝐴𝑖superscript𝑖1𝑘A(i,i^{\prime}-1,k)italic_A ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 , italic_k ) as follows:

A(i,i,k)=A(i,i1,k)(|𝑐𝑜𝑚(jl,i1)|+D(jl,i1))+(|𝑐𝑜𝑚(jl,i)|+D(jl,i))𝐴𝑖superscript𝑖𝑘𝐴𝑖superscript𝑖1𝑘𝑐𝑜𝑚subscript𝑗𝑙superscript𝑖1𝐷subscript𝑗𝑙superscript𝑖1𝑐𝑜𝑚subscript𝑗𝑙superscript𝑖𝐷subscript𝑗𝑙superscript𝑖\begin{split}A(i,i^{\prime},k)=&A(i,i^{\prime}-1,k)-(|\mathit{com}(j_{l},i^{% \prime}-1)|+D(j_{l},i^{\prime}-1))\\ &+(|\mathit{com}(j_{l},i^{\prime})|+D(j_{l},i^{\prime}))\end{split}start_ROW start_CELL italic_A ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) = end_CELL start_CELL italic_A ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 , italic_k ) - ( | italic_com ( italic_j start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 ) | + italic_D ( italic_j start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ( | italic_com ( italic_j start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | + italic_D ( italic_j start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) end_CELL end_ROW

where jl=𝑙𝑎𝑠𝑡𝑗(i,i1,k)subscript𝑗𝑙𝑙𝑎𝑠𝑡𝑗𝑖superscript𝑖1𝑘j_{l}=\mathit{lastj}(i,i^{\prime}-1,k)italic_j start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_lastj ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 , italic_k ). This is correct because all terms of the sum except the final one are the same in A(i,i1,k)𝐴𝑖superscript𝑖1𝑘A(i,i^{\prime}-1,k)italic_A ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 , italic_k ) and A(i,i,k)𝐴𝑖superscript𝑖𝑘A(i,i^{\prime},k)italic_A ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ), so it suffices to subtract the final term of the sum for A(i,i1,k)𝐴𝑖superscript𝑖1𝑘A(i,i^{\prime}-1,k)italic_A ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 , italic_k ) and add the final term of the sum for A(i,i,k)𝐴𝑖superscript𝑖𝑘A(i,i^{\prime},k)italic_A ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ). In Case 2, the formula becomes:

A(i,i,k)=A(i,i1,k)+(|𝑐𝑜𝑚(i,i)|+D(i,i))=A(i,i1,k)+m𝐴𝑖superscript𝑖𝑘𝐴𝑖superscript𝑖1𝑘𝑐𝑜𝑚superscript𝑖superscript𝑖𝐷superscript𝑖superscript𝑖𝐴𝑖superscript𝑖1𝑘𝑚A(i,i^{\prime},k)=A(i,i^{\prime}-1,k)+(|\mathit{com}(i^{\prime},i^{\prime})|+D% (i^{\prime},i^{\prime}))=A(i,i^{\prime}-1,k)+mitalic_A ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) = italic_A ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 , italic_k ) + ( | italic_com ( italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | + italic_D ( italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) = italic_A ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 , italic_k ) + italic_m
Data: n𝑛nitalic_n strings S1,S2,,Snsubscript𝑆1subscript𝑆2subscript𝑆𝑛S_{1},S_{2},\ldots,S_{n}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of equal length m𝑚mitalic_m
Result: D(i,i)𝐷𝑖superscript𝑖D(i,i^{\prime})italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) for 1iin1𝑖superscript𝑖𝑛1\leq i\leq i^{\prime}\leq n1 ≤ italic_i ≤ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_n
1 for in𝑖𝑛i\leftarrow nitalic_i ← italic_n downto 1111 do
2       D(i,i)0𝐷𝑖𝑖0D(i,i)\leftarrow 0italic_D ( italic_i , italic_i ) ← 0;
3       for k1𝑘1k\leftarrow 1italic_k ← 1 to m𝑚mitalic_m do
             p(k)1𝑝𝑘1p(k)\leftarrow 1italic_p ( italic_k ) ← 1 ;
              /* p(k)𝑝𝑘p(k)italic_p ( italic_k ) is |𝑝𝑎𝑟𝑡(i,i,k)|𝑝𝑎𝑟𝑡𝑖𝑖𝑘|\mathit{part}(i,i,k)|| italic_part ( italic_i , italic_i , italic_k ) | */
             l(k)i𝑙𝑘𝑖l(k)\leftarrow iitalic_l ( italic_k ) ← italic_i ;
              /* l(k)𝑙𝑘l(k)italic_l ( italic_k ) is 𝑙𝑎𝑠𝑡𝑗(i,i,k)𝑙𝑎𝑠𝑡𝑗𝑖𝑖𝑘\mathit{lastj}(i,i,k)italic_lastj ( italic_i , italic_i , italic_k ) */
             a(k)0𝑎𝑘0a(k)\leftarrow 0italic_a ( italic_k ) ← 0 ;
              /* a(k)𝑎𝑘a(k)italic_a ( italic_k ) is A(i,i,k)𝐴𝑖𝑖𝑘A(i,i,k)italic_A ( italic_i , italic_i , italic_k ) */
4            
5       end for
6      for ii+1superscript𝑖𝑖1i^{\prime}\leftarrow i+1italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_i + 1 to n𝑛nitalic_n do
7             foreach k𝑢𝑛𝑐(i,i)𝑘𝑢𝑛𝑐𝑖superscript𝑖k\in\mathit{unc}(i,i^{\prime})italic_k ∈ italic_unc ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) do
8                   if Si1[k]=Si[k]subscript𝑆superscript𝑖1delimited-[]𝑘subscript𝑆superscript𝑖delimited-[]𝑘S_{i^{\prime}-1}[k]=S_{i^{\prime}}[k]italic_S start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 end_POSTSUBSCRIPT [ italic_k ] = italic_S start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_k ] then /* Case 1111 */
                         /* p(k)𝑝𝑘p(k)italic_p ( italic_k ) and l(k)𝑙𝑘l(k)italic_l ( italic_k ) remain unchanged */
9                         a(k)a(k)(|𝑐𝑜𝑚(l(k),i1)|+D(l(k),i1))+(|𝑐𝑜𝑚(l(k),i)|+D(l(k),i))𝑎𝑘𝑎𝑘𝑐𝑜𝑚𝑙𝑘superscript𝑖1𝐷𝑙𝑘superscript𝑖1𝑐𝑜𝑚𝑙𝑘superscript𝑖𝐷𝑙𝑘superscript𝑖a(k)\leftarrow a(k)-(|\mathit{com}(l(k),i^{\prime}-1)|+D(l(k),i^{\prime}-1))+(% |\mathit{com}(l(k),i^{\prime})|+D(l(k),i^{\prime}))italic_a ( italic_k ) ← italic_a ( italic_k ) - ( | italic_com ( italic_l ( italic_k ) , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 ) | + italic_D ( italic_l ( italic_k ) , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 ) ) + ( | italic_com ( italic_l ( italic_k ) , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | + italic_D ( italic_l ( italic_k ) , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ;
                         /* a(k)𝑎𝑘a(k)italic_a ( italic_k ) is now A(i,i,k)𝐴𝑖superscript𝑖𝑘A(i,i^{\prime},k)italic_A ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) */
10                        
11                  else /* Case 2222 */
                         p(k)p(k)+1𝑝𝑘𝑝𝑘1p(k)\leftarrow p(k)+1italic_p ( italic_k ) ← italic_p ( italic_k ) + 1 ;
                          /* p(k)𝑝𝑘p(k)italic_p ( italic_k ) is |𝑝𝑎𝑟𝑡(i,i,k)|𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘|\mathit{part}(i,i^{\prime},k)|| italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) | */
                         l(k)i𝑙𝑘superscript𝑖l(k)\leftarrow i^{\prime}italic_l ( italic_k ) ← italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ;
                          /* l(k)𝑙𝑘l(k)italic_l ( italic_k ) is 𝑙𝑎𝑠𝑡𝑗(i,i,k)𝑙𝑎𝑠𝑡𝑗𝑖superscript𝑖𝑘\mathit{lastj}(i,i^{\prime},k)italic_lastj ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) */
                         a(k)a(k)+m𝑎𝑘𝑎𝑘𝑚a(k)\leftarrow a(k)+mitalic_a ( italic_k ) ← italic_a ( italic_k ) + italic_m ;
                          /* a(k)𝑎𝑘a(k)italic_a ( italic_k ) is A(i,i,k)𝐴𝑖superscript𝑖𝑘A(i,i^{\prime},k)italic_A ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) */
12                        
13                   end if
14                  
15             end foreach
16            D(i,i)mink𝑢𝑛𝑐(i,i)(a(k)p(k)|𝑐𝑜𝑚(i,i)|)𝐷𝑖superscript𝑖subscript𝑘𝑢𝑛𝑐𝑖superscript𝑖𝑎𝑘𝑝𝑘𝑐𝑜𝑚𝑖superscript𝑖D(i,i^{\prime})\leftarrow\min_{k\in\mathit{unc}(i,i^{\prime})}\left(a(k)-p(k)% \cdot|\mathit{com}(i,i^{\prime})|\right)italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ← roman_min start_POSTSUBSCRIPT italic_k ∈ italic_unc ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_a ( italic_k ) - italic_p ( italic_k ) ⋅ | italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ) ;
17             k(i,i)superscript𝑘𝑖superscript𝑖absentk^{*}(i,i^{\prime})\leftarrowitalic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ← the value of k𝑘kitalic_k that yielded the minimum for D(i,i)𝐷𝑖superscript𝑖D(i,i^{\prime})italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ;
18             foreach k𝑐𝑜𝑚(i,i)𝑘𝑐𝑜𝑚𝑖superscript𝑖k\in\mathit{com}(i,i^{\prime})italic_k ∈ italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) do
                   /* p(k)𝑝𝑘p(k)italic_p ( italic_k ) and l(k)=i𝑙𝑘𝑖l(k)=iitalic_l ( italic_k ) = italic_i remain unchanged */
19                   a(k)a(k)((|𝑐𝑜𝑚(i,i1)|+D(i,i1))+(|𝑐𝑜𝑚(i,i)|+D(i,i))a(k)\leftarrow a(k)\!-\!((|\mathit{com}(i,i^{\prime}\!-\!1)|+D(i,i^{\prime}\!-% \!1))+(|\mathit{com}(i,i^{\prime})|+D(i,i^{\prime}))italic_a ( italic_k ) ← italic_a ( italic_k ) - ( ( | italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 ) | + italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 ) ) + ( | italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | + italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ;
20                  
21             end foreach
22            
23       end for
24      
25 end for
Algorithm 1 Algorithm for the OFA problem

For an efficient implementation, we first create a data structure that allows us to look up any value |𝑐𝑜𝑚(i,i)|𝑐𝑜𝑚𝑖superscript𝑖|\mathit{com}(i,i^{\prime})|| italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | for 1iin1𝑖superscript𝑖𝑛1\leq i\leq i^{\prime}\leq n1 ≤ italic_i ≤ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_n in constant time and any set 𝑐𝑜𝑚(i,i)𝑐𝑜𝑚𝑖superscript𝑖\mathit{com}(i,i^{\prime})italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) or 𝑢𝑛𝑐(i,i)𝑢𝑛𝑐𝑖superscript𝑖\mathit{unc}(i,i^{\prime})italic_unc ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) in O(m)𝑂𝑚O(m)italic_O ( italic_m ) time. The data structure also allows us to determine any 𝑝𝑎𝑟𝑡(i,i,k)𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘\mathit{part}(i,i^{\prime},k)italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) in O(|𝑝𝑎𝑟𝑡(i,i,k)|)𝑂𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘O(|\mathit{part}(i,i^{\prime},k)|)italic_O ( | italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) | ) time. The preprocessing carried out to create this data structure using O(n2m)𝑂superscript𝑛2𝑚O(n^{2}m)italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_m ) time and O(n2+nm)𝑂superscript𝑛2𝑛𝑚O(n^{2}+nm)italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_n italic_m ) space is described in Section 4.3. Furthermore, we maintain arrays a𝑎aitalic_a, p𝑝pitalic_p and l𝑙litalic_l of size m𝑚mitalic_m that satisfy the following invariant: At the time when the algorithm considers the indices i𝑖iitalic_i and isuperscript𝑖i^{\prime}italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, the entries in those arrays satisfy a(k)=A(i,i,k)𝑎𝑘𝐴𝑖superscript𝑖𝑘a(k)=A(i,i^{\prime},k)italic_a ( italic_k ) = italic_A ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ), p(k)=|𝑝𝑎𝑟𝑡(i,i,k)|𝑝𝑘𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘p(k)=|\mathit{part}(i,i^{\prime},k)|italic_p ( italic_k ) = | italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) | and l(k)=𝑙𝑎𝑠𝑡𝑗(i,i,k)𝑙𝑘𝑙𝑎𝑠𝑡𝑗𝑖superscript𝑖𝑘l(k)=\mathit{lastj}(i,i^{\prime},k)italic_l ( italic_k ) = italic_lastj ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) for all 1km1𝑘𝑚1\leq k\leq m1 ≤ italic_k ≤ italic_m. When progressing from a pair (i,i1)𝑖superscript𝑖1(i,i^{\prime}-1)( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 ) to a pair (i,i)𝑖superscript𝑖(i,i^{\prime})( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), the O(m)𝑂𝑚O(m)italic_O ( italic_m ) values in these three arrays can be updated in constant time per entry, along the lines discussed above. The resulting algorithm for computing D(i,i)𝐷𝑖superscript𝑖D(i,i^{\prime})italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) for all 1iin1𝑖superscript𝑖𝑛1\leq i\leq i^{\prime}\leq n1 ≤ italic_i ≤ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_n in O(n2m)𝑂superscript𝑛2𝑚O(n^{2}m)italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_m ) time is shown as pseudocode in Algorithm 1. In addition to the space O(nm+n2)𝑂𝑛𝑚superscript𝑛2O(nm+n^{2})italic_O ( italic_n italic_m + italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) used for the preprocessing, the algorithm uses O(m)𝑂𝑚O(m)italic_O ( italic_m ) space for the three arrays p,l,a𝑝𝑙𝑎p,l,aitalic_p , italic_l , italic_a and O(n2)𝑂superscript𝑛2O(n^{2})italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) space for D𝐷Ditalic_D, so the total space usage is O(mn+n2)𝑂𝑚𝑛superscript𝑛2O(mn+n^{2})italic_O ( italic_m italic_n + italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Note that for each pair (i,i)𝑖superscript𝑖(i,i^{\prime})( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) with i>isuperscript𝑖𝑖i^{\prime}>iitalic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > italic_i the algorithm processes the values of k𝑘kitalic_k in 𝑢𝑛𝑐(i,i)𝑢𝑛𝑐𝑖superscript𝑖\mathit{unc}(i,i^{\prime})italic_unc ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) before those in 𝑐𝑜𝑚(i,i)𝑐𝑜𝑚𝑖superscript𝑖\mathit{com}(i,i^{\prime})italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). This is important because the formula for updating a(k)𝑎𝑘a(k)italic_a ( italic_k ) for k𝑐𝑜𝑚(i,i)𝑘𝑐𝑜𝑚𝑖superscript𝑖k\in\mathit{com}(i,i^{\prime})italic_k ∈ italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) uses the value D(i,i)𝐷𝑖superscript𝑖D(i,i^{\prime})italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) (see Line 1), but that value can only be calculated (see Line 1) once a(k)𝑎𝑘a(k)italic_a ( italic_k ) has been determined for all k𝑢𝑛𝑐(i,i)𝑘𝑢𝑛𝑐𝑖superscript𝑖k\in\mathit{unc}(i,i^{\prime})italic_k ∈ italic_unc ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). Note that l(k)=𝑙𝑎𝑠𝑡𝑗(i,i,k)>i𝑙𝑘𝑙𝑎𝑠𝑡𝑗𝑖superscript𝑖𝑘𝑖l(k)=\mathit{lastj}(i,i^{\prime},k)>iitalic_l ( italic_k ) = italic_lastj ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) > italic_i in Line 1 as 𝑝𝑎𝑟𝑡(i,i,k)𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘\mathit{part}(i,i^{\prime},k)italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) contains at least two runs if k𝑢𝑛𝑐(i,i)𝑘𝑢𝑛𝑐𝑖superscript𝑖k\in\mathit{unc}(i,i^{\prime})italic_k ∈ italic_unc ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), so the values D(l(k),i1)𝐷𝑙𝑘superscript𝑖1D(l(k),i^{\prime}-1)italic_D ( italic_l ( italic_k ) , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 ) and D(l(k),i)𝐷𝑙𝑘superscript𝑖D(l(k),i^{\prime})italic_D ( italic_l ( italic_k ) , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) accessed in Line 1 have both been computed already (as the for-loop in Line 1 iterates through the values of i𝑖iitalic_i in decreasing order).

When the values D(i,i)𝐷𝑖superscript𝑖D(i,i^{\prime})italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) (together with the values k(i,i)superscript𝑘𝑖superscript𝑖k^{*}(i,i^{\prime})italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) that represent the choice of k𝑘kitalic_k that yields the minimum in the formula for D(i,i)𝐷𝑖superscript𝑖D(i,i^{\prime})italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), see Line 1) have been calculated using Algorithm 1, the size of the optimal FA is given by |𝑐𝑜𝑚(1,n)|+D(1,n)𝑐𝑜𝑚1𝑛𝐷1𝑛|\mathit{com}(1,n)|+D(1,n)| italic_com ( 1 , italic_n ) | + italic_D ( 1 , italic_n ). To construct the optimal FA itself, we proceed as follows: Create a root node v1subscript𝑣1v_{1}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Let x=|𝑐𝑜𝑚(1,n)|𝑥𝑐𝑜𝑚1𝑛x=|\mathit{com}(1,n)|italic_x = | italic_com ( 1 , italic_n ) |. If x>0𝑥0x>0italic_x > 0, create a path (v1,v2,v3,,vx+1)subscript𝑣1subscript𝑣2subscript𝑣3subscript𝑣𝑥1(v_{1},v_{2},v_{3},\ldots,v_{x+1})( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_x + 1 end_POSTSUBSCRIPT ) such that the nodes v1,,vxsubscript𝑣1subscript𝑣𝑥v_{1},\ldots,v_{x}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT are labeled with the positions in 𝑐𝑜𝑚(1,n)𝑐𝑜𝑚1𝑛\mathit{com}(1,n)italic_com ( 1 , italic_n ) and the downward edges of these nodes are labeled with the characters in those positions of the strings in 𝒮𝒮\mathcal{S}caligraphic_S. Label vx+1subscript𝑣𝑥1v_{x+1}italic_v start_POSTSUBSCRIPT italic_x + 1 end_POSTSUBSCRIPT with p(vx+1)=k(1,n)𝑝subscript𝑣𝑥1superscript𝑘1𝑛p(v_{x+1})=k^{*}(1,n)italic_p ( italic_v start_POSTSUBSCRIPT italic_x + 1 end_POSTSUBSCRIPT ) = italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( 1 , italic_n ), All the nodes vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with 1jx+11𝑗𝑥11\leq j\leq x+11 ≤ italic_j ≤ italic_x + 1 have τ(vj)=𝒮𝜏subscript𝑣𝑗𝒮\tau(v_{j})=\mathcal{S}italic_τ ( italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = caligraphic_S. Create a downward edge from vx+1subscript𝑣𝑥1v_{x+1}italic_v start_POSTSUBSCRIPT italic_x + 1 end_POSTSUBSCRIPT with label c𝑐citalic_c for each run (j,j,c)𝑗superscript𝑗𝑐(j,j^{\prime},c)( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c ) in 𝑝𝑎𝑟𝑡(1,n,k(1,n))𝑝𝑎𝑟𝑡1𝑛superscript𝑘1𝑛\mathit{part}(1,n,k^{*}(1,n))italic_part ( 1 , italic_n , italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( 1 , italic_n ) ). Note that the child node w𝑤witalic_w at the bottom of a downward edge for run (j,j,c)𝑗superscript𝑗𝑐(j,j^{\prime},c)( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c ) has τ(w)=𝒮[j,j]𝜏𝑤𝒮𝑗superscript𝑗\tau(w)=\mathcal{S}[j,j^{\prime}]italic_τ ( italic_w ) = caligraphic_S [ italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ]. For any child node w𝑤witalic_w with τ(w)=𝒮[j,j]𝜏𝑤𝒮𝑗superscript𝑗\tau(w)=\mathcal{S}[j,j^{\prime}]italic_τ ( italic_w ) = caligraphic_S [ italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] for j<j𝑗superscript𝑗j<j^{\prime}italic_j < italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of a parent node v𝑣vitalic_v with τ(v)=𝒮[i,i]𝜏𝑣𝒮𝑖superscript𝑖\tau(v)=\mathcal{S}[i,i^{\prime}]italic_τ ( italic_v ) = caligraphic_S [ italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ], construct the FA rooted at w𝑤witalic_w analogously: Start with a path of Δ(i,i,j,j)Δ𝑖superscript𝑖𝑗superscript𝑗\Delta(i,i^{\prime},j,j^{\prime})roman_Δ ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) edges (the first of which is the edge from v𝑣vitalic_v to w𝑤witalic_w and is given label Sj[k(i,i)]subscript𝑆𝑗delimited-[]superscript𝑘𝑖superscript𝑖S_{j}[k^{*}(i,i^{\prime})]italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT [ italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ]), assign label p(z)=k(j,j)𝑝𝑧superscript𝑘𝑗superscript𝑗p(z)=k^{*}(j,j^{\prime})italic_p ( italic_z ) = italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) to the node z𝑧zitalic_z at the end of that path, and create a downward edge from z𝑧zitalic_z with label c𝑐citalic_c for each run (r,r,c)𝑟superscript𝑟𝑐(r,r^{\prime},c)( italic_r , italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c ) in 𝑝𝑎𝑟𝑡(j,j,k(j,j))𝑝𝑎𝑟𝑡𝑗superscript𝑗superscript𝑘𝑗superscript𝑗\mathit{part}(j,j^{\prime},k^{*}(j,j^{\prime}))italic_part ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ). The recursion stops at depth m𝑚mitalic_m, i.e., when the nodes created are leaves of the optimal FA.

The data structure described in Section 4.3 also allows us to iterate over the runs in pr(i,i,k)𝑝𝑟𝑖superscript𝑖𝑘pr(i,i^{\prime},k)italic_p italic_r ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ), for any given values i𝑖iitalic_i, isuperscript𝑖i^{\prime}italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and k𝑘kitalic_k, in time O(|𝑝𝑎𝑟𝑡(i,i,k)|)𝑂𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘O(|\mathit{part}(i,i^{\prime},k)|)italic_O ( | italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) | ). For a node v𝑣vitalic_v in the optimal FA that is a leaf or has more than one child, let w𝑤witalic_w be the node of maximum depth among all strict ancestors of v𝑣vitalic_v that have more than one child (or the root if no strict ancestor of v𝑣vitalic_v has more than one child). Assume that τ(v)=𝒮[j,j]𝜏𝑣𝒮𝑗superscript𝑗\tau(v)=\mathcal{S}[j,j^{\prime}]italic_τ ( italic_v ) = caligraphic_S [ italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] and τ[w]=𝒮[i,i]𝜏delimited-[]𝑤𝒮𝑖superscript𝑖\tau[w]=\mathcal{S}[i,i^{\prime}]italic_τ [ italic_w ] = caligraphic_S [ italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ]. In our construction of the optimal FA, it takes O(m)𝑂𝑚O(m)italic_O ( italic_m ) time to determine Δ(i,i,j,j)Δ𝑖superscript𝑖𝑗superscript𝑗\Delta(i,i^{\prime},j,j^{\prime})roman_Δ ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) and create the path from w𝑤witalic_w to v𝑣vitalic_v. If v𝑣vitalic_v is not a leaf, it takes O(|𝑝𝑎𝑟𝑡(j,j,k(j,j))|)𝑂𝑝𝑎𝑟𝑡𝑗superscript𝑗superscript𝑘𝑗superscript𝑗O(|\mathit{part}(j,j^{\prime},k^{*}(j,j^{\prime}))|)italic_O ( | italic_part ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) | ) time to create the |𝑝𝑎𝑟𝑡(j,j,k(j,j))|𝑝𝑎𝑟𝑡𝑗superscript𝑗superscript𝑘𝑗superscript𝑗|\mathit{part}(j,j^{\prime},k^{*}(j,j^{\prime}))|| italic_part ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) | children of v𝑣vitalic_v. Adding up these times over all the O(n)𝑂𝑛O(n)italic_O ( italic_n ) nodes v𝑣vitalic_v that have more than one child or are leaves, the total time for constructing paths is O(mn)𝑂𝑚𝑛O(mn)italic_O ( italic_m italic_n ) and the total time for creating children of nodes with more than one child is O(n)𝑂𝑛O(n)italic_O ( italic_n ). Thus, once the values D(i,i)𝐷𝑖superscript𝑖D(i,i^{\prime})italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) and k(i,i)superscript𝑘𝑖superscript𝑖k^{*}(i,i^{\prime})italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) have been computed using Algorithm 1, the optimal FA can be constructed in O(nm)𝑂𝑛𝑚O(nm)italic_O ( italic_n italic_m ) time.

Thus, we obtain the following theorem.

Theorem 1.

The OFA problem can be solved in O(n2m)𝑂superscript𝑛2𝑚O(n^{2}m)italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_m ) time using O(nm+n2)𝑂𝑛𝑚superscript𝑛2O(nm+n^{2})italic_O ( italic_n italic_m + italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) space.

4.2 Adaptation to the weighted OFA problem

Recall that, in the weighted OFA problem, the cost of a non-leaf node with label k𝑘kitalic_k that has at least two children is cchoice(k)subscript𝑐choice𝑘c_{\mathrm{choice}}(k)italic_c start_POSTSUBSCRIPT roman_choice end_POSTSUBSCRIPT ( italic_k ) and the cost of an edge with label c𝑐citalic_c whose parent node v𝑣vitalic_v has label p(v)=k𝑝𝑣𝑘p(v)=kitalic_p ( italic_v ) = italic_k is cunify(k,c)subscript𝑐unify𝑘𝑐c_{\mathrm{unify}}(k,c)italic_c start_POSTSUBSCRIPT roman_unify end_POSTSUBSCRIPT ( italic_k , italic_c ). With Dw(i,i)subscript𝐷𝑤𝑖superscript𝑖D_{w}(i,i^{\prime})italic_D start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) denoting the minimum cost of a FA for the uncommon positions of 𝒮[i,i]𝒮𝑖superscript𝑖\mathcal{S}[i,i^{\prime}]caligraphic_S [ italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ], Equation (1) can be adapted to the weighted case as follows [2]:

Dw(i,i)=mink𝑢𝑛𝑐(i,i)cchoice(k)+(j,j,c)𝑝𝑎𝑟𝑡(i,i,k)(Δw(i,i,j,j)+Dw(j,j)),subscript𝐷𝑤𝑖superscript𝑖subscript𝑘𝑢𝑛𝑐𝑖superscript𝑖subscript𝑐choice𝑘subscript𝑗superscript𝑗𝑐𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘subscriptΔ𝑤𝑖superscript𝑖𝑗superscript𝑗subscript𝐷𝑤𝑗superscript𝑗\displaystyle D_{w}(i,i^{\prime})=\min_{k\in\mathit{unc}(i,i^{\prime})}c_{% \mathrm{choice}}(k)+\sum_{(j,j^{\prime},c)\in\mathit{part}(i,i^{\prime},k)}% \left(\Delta_{w}(i,i^{\prime},j,j^{\prime})+D_{w}(j,j^{\prime})\right)\,,italic_D start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_min start_POSTSUBSCRIPT italic_k ∈ italic_unc ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT roman_choice end_POSTSUBSCRIPT ( italic_k ) + ∑ start_POSTSUBSCRIPT ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c ) ∈ italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_D start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ,

where Δw(i,i,j,j)=hΔ(i,i,j,j)cunify(h,Sj[h])subscriptΔ𝑤𝑖superscript𝑖𝑗superscript𝑗subscriptΔ𝑖superscript𝑖𝑗superscript𝑗subscript𝑐unifysubscript𝑆𝑗delimited-[]\Delta_{w}(i,i^{\prime},j,j^{\prime})=\sum_{h\in\Delta(i,i^{\prime},j,j^{% \prime})}c_{\mathrm{unify}}(h,S_{j}[h])roman_Δ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_h ∈ roman_Δ ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT roman_unify end_POSTSUBSCRIPT ( italic_h , italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT [ italic_h ] ). We again consider the terms for each value of k𝑘kitalic_k separately:

Dw(i,i,k)=cchoice(k)+(j,j,c)𝑝𝑎𝑟𝑡(i,i,k)(Δw(i,i,j,j)+Dw(j,j))subscript𝐷𝑤𝑖superscript𝑖𝑘subscript𝑐choice𝑘subscript𝑗superscript𝑗𝑐𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘subscriptΔ𝑤𝑖superscript𝑖𝑗superscript𝑗subscript𝐷𝑤𝑗superscript𝑗D_{w}(i,i^{\prime},k)=c_{\mathrm{choice}}(k)+\sum_{(j,j^{\prime},c)\in\mathit{% part}(i,i^{\prime},k)}\left(\Delta_{w}(i,i^{\prime},j,j^{\prime})+D_{w}(j,j^{% \prime})\right)italic_D start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) = italic_c start_POSTSUBSCRIPT roman_choice end_POSTSUBSCRIPT ( italic_k ) + ∑ start_POSTSUBSCRIPT ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c ) ∈ italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) end_POSTSUBSCRIPT ( roman_Δ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_D start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) (4)

If we let 𝑐𝑜𝑚w(i,i)=h𝑐𝑜𝑚(i,i)cunify(h,Sj[h])subscript𝑐𝑜𝑚𝑤𝑖superscript𝑖subscript𝑐𝑜𝑚𝑖superscript𝑖subscript𝑐unifysubscript𝑆𝑗delimited-[]\mathit{com}_{w}(i,i^{\prime})=\sum_{h\in\mathit{com}(i,i^{\prime})}c_{\mathrm% {unify}}(h,S_{j}[h])italic_com start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_h ∈ italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT roman_unify end_POSTSUBSCRIPT ( italic_h , italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT [ italic_h ] ), we have Δw(i,i,j,j)=𝑐𝑜𝑚w(j,j)𝑐𝑜𝑚w(i,i)subscriptΔ𝑤𝑖superscript𝑖𝑗superscript𝑗subscript𝑐𝑜𝑚𝑤𝑗superscript𝑗subscript𝑐𝑜𝑚𝑤𝑖superscript𝑖\Delta_{w}(i,i^{\prime},j,j^{\prime})=\mathit{com}_{w}(j,j^{\prime})-\mathit{% com}_{w}(i,i^{\prime})roman_Δ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_com start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_com start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) and can rewrite (4) as follows (analogously to (3)):

Dw(i,i,k)=cchoice(k)+((j,j,c)𝑝𝑎𝑟𝑡(i,i,k)(𝑐𝑜𝑚w(j,j)+Dw(j,j)))|𝑝𝑎𝑟𝑡(i,i,k)|𝑐𝑜𝑚w(i,i)subscript𝐷𝑤𝑖superscript𝑖𝑘subscript𝑐choice𝑘subscript𝑗superscript𝑗𝑐𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘subscript𝑐𝑜𝑚𝑤𝑗superscript𝑗subscript𝐷𝑤𝑗superscript𝑗𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘subscript𝑐𝑜𝑚𝑤𝑖superscript𝑖\begin{split}D_{w}(i,i^{\prime},k)=&c_{\mathrm{choice}}(k)+\left(\sum_{(j,j^{% \prime},c)\in\mathit{part}(i,i^{\prime},k)}\left(\mathit{com}_{w}(j,j^{\prime}% )+D_{w}(j,j^{\prime})\right)\right)\\ &-|\mathit{part}(i,i^{\prime},k)|\cdot\mathit{com}_{w}(i,i^{\prime})\end{split}start_ROW start_CELL italic_D start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) = end_CELL start_CELL italic_c start_POSTSUBSCRIPT roman_choice end_POSTSUBSCRIPT ( italic_k ) + ( ∑ start_POSTSUBSCRIPT ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c ) ∈ italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) end_POSTSUBSCRIPT ( italic_com start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_D start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - | italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) | ⋅ italic_com start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL end_ROW

We define

Aw(i,i,k)=(j,j,c)𝑝𝑎𝑟𝑡(i,i,k)(𝑐𝑜𝑚w(j,j)+Dw(j,j))subscript𝐴𝑤𝑖superscript𝑖𝑘subscript𝑗superscript𝑗𝑐𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘subscript𝑐𝑜𝑚𝑤𝑗superscript𝑗subscript𝐷𝑤𝑗superscript𝑗A_{w}(i,i^{\prime},k)=\sum_{(j,j^{\prime},c)\in\mathit{part}(i,i^{\prime},k)}% \left(\mathit{com}_{w}(j,j^{\prime})+D_{w}(j,j^{\prime})\right)italic_A start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) = ∑ start_POSTSUBSCRIPT ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c ) ∈ italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) end_POSTSUBSCRIPT ( italic_com start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_D start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) )

and can compute these values analogously: In Algorithm 1, we change Line 1 to

a(k)a(k)(𝑐𝑜𝑚w(l(k),i1)+Dw(l(k),i1))+(𝑐𝑜𝑚w(l(k),i)+Dw(l(k),i))𝑎𝑘𝑎𝑘subscript𝑐𝑜𝑚𝑤𝑙𝑘superscript𝑖1subscript𝐷𝑤𝑙𝑘superscript𝑖1subscript𝑐𝑜𝑚𝑤𝑙𝑘superscript𝑖subscript𝐷𝑤𝑙𝑘superscript𝑖a(k)\leftarrow a(k)-(\mathit{com}_{w}(l(k),i^{\prime}-1)+D_{w}(l(k),i^{\prime}% -1))+(\mathit{com}_{w}(l(k),i^{\prime})+D_{w}(l(k),i^{\prime}))italic_a ( italic_k ) ← italic_a ( italic_k ) - ( italic_com start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_l ( italic_k ) , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 ) + italic_D start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_l ( italic_k ) , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 ) ) + ( italic_com start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_l ( italic_k ) , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_D start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_l ( italic_k ) , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) )

and Line 1 to

a(k)a(k)+𝑐𝑜𝑚w(i,i)𝑎𝑘𝑎𝑘subscript𝑐𝑜𝑚𝑤superscript𝑖superscript𝑖a(k)\leftarrow a(k)+\mathit{com}_{w}(i^{\prime},i^{\prime})italic_a ( italic_k ) ← italic_a ( italic_k ) + italic_com start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )

and the computation of D(i,i)𝐷𝑖superscript𝑖D(i,i^{\prime})italic_D ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) in Line 1 becomes:

Dw(i,i)=mink𝑢𝑛𝑐(i,i)(cchoice(k)+a(k)|𝑝𝑎𝑟𝑡(i,i,k)|𝑐𝑜𝑚w(i,i))subscript𝐷𝑤𝑖superscript𝑖subscript𝑘𝑢𝑛𝑐𝑖superscript𝑖subscript𝑐choice𝑘𝑎𝑘𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘subscript𝑐𝑜𝑚𝑤𝑖superscript𝑖D_{w}(i,i^{\prime})=\min_{k\in\mathit{unc}(i,i^{\prime})}\left(c_{\mathrm{% choice}}(k)+a(k)-|\mathit{part}(i,i^{\prime},k)|\cdot\mathit{com}_{w}(i,i^{% \prime})\right)italic_D start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_min start_POSTSUBSCRIPT italic_k ∈ italic_unc ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT roman_choice end_POSTSUBSCRIPT ( italic_k ) + italic_a ( italic_k ) - | italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) | ⋅ italic_com start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) )

As also shown in the following section, the data structure constructed in the preprocessing can be extended (without affecting the asymptotic time and space bounds) so that it can be used to determine 𝑐𝑜𝑚w(i,i)subscript𝑐𝑜𝑚𝑤𝑖superscript𝑖\mathit{com}_{w}(i,i^{\prime})italic_com start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) in constant time for any 1iin1𝑖superscript𝑖𝑛1\leq i\leq i^{\prime}\leq n1 ≤ italic_i ≤ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_n as well.

4.3 Preprocessing

For a given string tuple 𝒮=(S1,,Sn)𝒮subscript𝑆1subscript𝑆𝑛\mathcal{S}=(S_{1},\ldots,S_{n})caligraphic_S = ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) with n𝑛nitalic_n strings of length m𝑚mitalic_m each, we want to compute a data structure that allows us to determine |𝑐𝑜𝑚(i,i)|𝑐𝑜𝑚𝑖superscript𝑖|\mathit{com}(i,i^{\prime})|| italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | or 𝑐𝑜𝑚w(i,i)subscript𝑐𝑜𝑚𝑤𝑖superscript𝑖\mathit{com}_{w}(i,i^{\prime})italic_com start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) in constant time, to compute 𝑐𝑜𝑚(i,i)𝑐𝑜𝑚𝑖superscript𝑖\mathit{com}(i,i^{\prime})italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) or 𝑢𝑛𝑐(i,i)𝑢𝑛𝑐𝑖superscript𝑖\mathit{unc}(i,i^{\prime})italic_unc ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) in O(m)𝑂𝑚O(m)italic_O ( italic_m ) time, and to compute 𝑝𝑎𝑟𝑡(i,i,k)𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘\mathit{part}(i,i^{\prime},k)italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) in O(|𝑝𝑎𝑟𝑡(i,i,k)|)𝑂𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘O(|\mathit{part}(i,i^{\prime},k)|)italic_O ( | italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) | ) time.

First, we create an n×m𝑛𝑚n\times mitalic_n × italic_m matrix R𝑅Ritalic_R such that R(i,k)𝑅𝑖𝑘R(i,k)italic_R ( italic_i , italic_k ) is the number of consecutive strings in 𝒮𝒮\mathcal{S}caligraphic_S, starting with Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, that have the same character in position k𝑘kitalic_k. Formally, R(i,k)=max{Si[k]=Si+1[k]==Si+1[k]}𝑅𝑖𝑘conditionalsubscript𝑆𝑖delimited-[]𝑘subscript𝑆𝑖1delimited-[]𝑘subscript𝑆𝑖1delimited-[]𝑘R(i,k)=\max\{\ell\mid S_{i}[k]=S_{i+1}[k]=\cdots=S_{i+\ell-1}[k]\}italic_R ( italic_i , italic_k ) = roman_max { roman_ℓ ∣ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] = italic_S start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT [ italic_k ] = ⋯ = italic_S start_POSTSUBSCRIPT italic_i + roman_ℓ - 1 end_POSTSUBSCRIPT [ italic_k ] }. R𝑅Ritalic_R can be computed in O(nm)𝑂𝑛𝑚O(nm)italic_O ( italic_n italic_m ) time by setting R(n,k)=1𝑅𝑛𝑘1R(n,k)=1italic_R ( italic_n , italic_k ) = 1 for all k𝑘kitalic_k and using the equation

R(i,k)={R(i+1,k)+1if Si[k]=Si+1[k]1if Si[k]Si+1[k]𝑅𝑖𝑘cases𝑅𝑖1𝑘1if Si[k]=Si+1[k]1if Si[k]Si+1[k]R(i,k)=\left\{\begin{array}[]{ll}R(i+1,k)+1&\mbox{if $S_{i}[k]=S_{i+1}[k]$}\\ 1&\mbox{if $S_{i}[k]\neq S_{i+1}[k]$}\end{array}\right.italic_R ( italic_i , italic_k ) = { start_ARRAY start_ROW start_CELL italic_R ( italic_i + 1 , italic_k ) + 1 end_CELL start_CELL if italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] = italic_S start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT [ italic_k ] end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL if italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] ≠ italic_S start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT [ italic_k ] end_CELL end_ROW end_ARRAY

for 1in11𝑖𝑛11\leq i\leq n-11 ≤ italic_i ≤ italic_n - 1 (in order of decreasing i𝑖iitalic_i) and 1km1𝑘𝑚1\leq k\leq m1 ≤ italic_k ≤ italic_m.

Now, compute an n×n𝑛𝑛n\times nitalic_n × italic_n matrix C𝐶Citalic_C by setting C(i,i)=|{k[m]R(i,k)ii+1}|𝐶𝑖superscript𝑖conditional-set𝑘delimited-[]𝑚𝑅𝑖𝑘superscript𝑖𝑖1C(i,i^{\prime})=|\{k\in[m]\mid R(i,k)\geq i^{\prime}-i+1\}|italic_C ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = | { italic_k ∈ [ italic_m ] ∣ italic_R ( italic_i , italic_k ) ≥ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_i + 1 } | for 1iin1𝑖superscript𝑖𝑛1\leq i\leq i^{\prime}\leq n1 ≤ italic_i ≤ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_n. C(i,i)𝐶𝑖superscript𝑖C(i,i^{\prime})italic_C ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) contains the number of positions k𝑘kitalic_k such that there are at least ii+1superscript𝑖𝑖1i^{\prime}-i+1italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_i + 1 consecutive strings, starting with Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, in 𝒮𝒮\mathcal{S}caligraphic_S that have the same character in position k𝑘kitalic_k. This shows that C(i,i)=|𝑐𝑜𝑚(i,i)|𝐶𝑖superscript𝑖𝑐𝑜𝑚𝑖superscript𝑖C(i,i^{\prime})=|\mathit{com}(i,i^{\prime})|italic_C ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = | italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) |. The computation of C𝐶Citalic_C takes time O(n2m)𝑂superscript𝑛2𝑚O(n^{2}m)italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_m ).

If we want to handle the weighted OFA problem, we additionally (or instead of C𝐶Citalic_C) compute an n×n𝑛𝑛n\times nitalic_n × italic_n matrix Cwsubscript𝐶𝑤C_{w}italic_C start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT by setting

Cw(i,i)=k[m]R(i,k)ii+1cunify(k,Si[k]).subscript𝐶𝑤𝑖superscript𝑖subscript𝑘delimited-[]𝑚𝑅𝑖𝑘superscript𝑖𝑖1subscript𝑐unify𝑘subscript𝑆𝑖delimited-[]𝑘C_{w}(i,i^{\prime})=\sum_{\begin{subarray}{c}k\in[m]\\ R(i,k)\geq i^{\prime}-i+1\end{subarray}}c_{\mathrm{unify}}(k,S_{i}[k])\;.italic_C start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_k ∈ [ italic_m ] end_CELL end_ROW start_ROW start_CELL italic_R ( italic_i , italic_k ) ≥ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_i + 1 end_CELL end_ROW end_ARG end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT roman_unify end_POSTSUBSCRIPT ( italic_k , italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] ) .

Each entry of Cwsubscript𝐶𝑤C_{w}italic_C start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT can be computed in O(m)𝑂𝑚O(m)italic_O ( italic_m ) time, and we have Cw(i,i)=𝑐𝑜𝑚w(i,i)subscript𝐶𝑤𝑖superscript𝑖subscript𝑐𝑜𝑚𝑤𝑖superscript𝑖C_{w}(i,i^{\prime})=\mathit{com}_{w}(i,i^{\prime})italic_C start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_com start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).

The computation of R𝑅Ritalic_R, C𝐶Citalic_C and/or Cwsubscript𝐶𝑤C_{w}italic_C start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT takes O(nm)𝑂superscript𝑛𝑚O(n^{m})italic_O ( italic_n start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) time and O(nm+n2)𝑂𝑛𝑚superscript𝑛2O(nm+n^{2})italic_O ( italic_n italic_m + italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) space.

Once R𝑅Ritalic_R and C𝐶Citalic_C and/or Cwsubscript𝐶𝑤C_{w}italic_C start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT have been computed, queries can be answered as follows: To determine |𝑐𝑜𝑚(i,i)|𝑐𝑜𝑚𝑖superscript𝑖|\mathit{com}(i,i^{\prime})|| italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | in constant time, we return C(i,i)𝐶𝑖superscript𝑖C(i,i^{\prime})italic_C ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). To determine |𝑐𝑜𝑚w(i,i)|subscript𝑐𝑜𝑚𝑤𝑖superscript𝑖|\mathit{com}_{w}(i,i^{\prime})|| italic_com start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | in constant time, we return Cw(i,i)subscript𝐶𝑤𝑖superscript𝑖C_{w}(i,i^{\prime})italic_C start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). To list the positions in 𝑐𝑜𝑚(i,i)𝑐𝑜𝑚𝑖superscript𝑖\mathit{com}(i,i^{\prime})italic_com ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) in O(m)𝑂𝑚O(m)italic_O ( italic_m ) time, we check for each position k[m]𝑘delimited-[]𝑚k\in[m]italic_k ∈ [ italic_m ] whether it satisfies R(i,k)ii+1𝑅𝑖𝑘superscript𝑖𝑖1R(i,k)\geq i^{\prime}-i+1italic_R ( italic_i , italic_k ) ≥ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_i + 1, and return the positions that meet this condition. For 𝑢𝑛𝑐(i,i)𝑢𝑛𝑐𝑖superscript𝑖\mathit{unc}(i,i^{\prime})italic_unc ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), we change the condition to R(i,k)<ii+1𝑅𝑖𝑘superscript𝑖𝑖1R(i,k)<i^{\prime}-i+1italic_R ( italic_i , italic_k ) < italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_i + 1. To list the runs in 𝑝𝑎𝑟𝑡(i,i,k)𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘\mathit{part}(i,i^{\prime},k)italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) in O(|𝑝𝑎𝑟𝑡(i,i,k)|)𝑂𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘O(|\mathit{part}(i,i^{\prime},k)|)italic_O ( | italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) | ) time, we proceed as follows: The first run is (i,min{i+R(i,k)1,i},Si[k])𝑖𝑖𝑅𝑖𝑘1superscript𝑖subscript𝑆𝑖delimited-[]𝑘(i,\min\{i+R(i,k)-1,i^{\prime}\},S_{i}[k])( italic_i , roman_min { italic_i + italic_R ( italic_i , italic_k ) - 1 , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } , italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_k ] ). Once a run (j,j,c)𝑗superscript𝑗𝑐(j,j^{\prime},c)( italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c ) with j<isuperscript𝑗superscript𝑖j^{\prime}<i^{\prime}italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT has been determined, the run following it is (j+1,min{j+R(j+1,k),i},Sj+1[k])superscript𝑗1superscript𝑗𝑅superscript𝑗1𝑘superscript𝑖subscript𝑆superscript𝑗1delimited-[]𝑘(j^{\prime}+1,\min\{j^{\prime}+R(j^{\prime}+1,k),i^{\prime}\},S_{j^{\prime}+1}% [k])( italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 , roman_min { italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_R ( italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 , italic_k ) , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } , italic_S start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT [ italic_k ] ). Therefore, each run in 𝑝𝑎𝑟𝑡(i,i,k)𝑝𝑎𝑟𝑡𝑖superscript𝑖𝑘\mathit{part}(i,i^{\prime},k)italic_part ( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) can be determined in constant time.

5 Conclusions

In this paper, we have given an algorithm that solves the OFA problem for n𝑛nitalic_n given strings of equal length m𝑚mitalic_m in O(n2m)𝑂superscript𝑛2𝑚O(n^{2}m)italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_m ) time and O(n(n+m))𝑂𝑛𝑛𝑚O(n(n+m))italic_O ( italic_n ( italic_n + italic_m ) ) space. The algorithm can be adapted to the weighted OFA problem in the same time and space bounds. The running-time of our algorithm is better than that of the previously known algorithm by Dawson et al. [2] by a factor of n+m𝑛𝑚n+mitalic_n + italic_m. The main idea leading to the improvement is reusing information from previously computed entries of the dynamic programming table when computing new entries.

Our algorithms may be parallelizable using methods such as exploiting table-parallelism [3], which has been used to parallelize the previously known algorithm for the OFA problem.

References

  • [1] Douglas Comer and Ravi Sethi. The complexity of trie index construction. Journal of the ACM, 24(3):428–440, 1977.
  • [2] Steven Dawson, C. R. Ramakrishnan, Steven Skiena, and Terrance Swift. Principles and practice of unification factoring. ACM Trans. Program. Lang. Syst., 18(5):528–563, 1996.
  • [3] Juliana Freire, Rui Hu, Terrance Swift, and David S Warren. Exploiting parallelism in tabled evaluations. In 7th International Symposium on Programming Languages: Implementations, Logics and Programs (PLILP’95), LNCS 982, pages 115–132. Springer, 1995.
  • [4] Prasad Rao, Konstantinos Sagonas, Terrance Swift, David S. Warren, and Juliana Freire. XSB: A system for efficiently computing well-founded semantics. In Jürgen Dix, Ulrich Furbach, and Anil Nerode, editors, 4th International Conference on Logic Programming And Nonmonotonic Reasoning (LPNMR’97), LNAI 1265, pages 430–440. Springer, 1997.
  • [5] Steven Skiena. The Algorithm Design Manual, Third Edition. Texts in Computer Science. Springer, 2020.