Unveiling the connection between the Lyndon factorization and the canonical inverse Lyndon factorization via a border property

Paola Bonizzoni, Brian Riccardi
Dip. di Informatica, Sistemistica e Comunicazione
University of Milano-Bicocca
viale Sarca 336, 20126 Milan, Italy
{paola.bonizzoni,[email protected]}unimib.it
&Clelia De Felice, Rocco Zaccagnino, Rosalba Zizza
Dip. di Informatica
University of Salerno
via Giovanni Paolo II 132, 84084 Fisciano, Italy
{cdefelice,rzaccagnino,rzizza}email@email
Abstract

The notion of Lyndon word and Lyndon factorization has shown to have unexpected applications in theory as well in develo** novel algorithms on words. A counterpart to these notions are those of inverse Lyndon word and inverse Lyndon factorization. Differently from the Lyndon words, the inverse Lyndon words may be bordered. The relationship between the two factorizations is related to the inverse lexicographic ordering, and has only been recently explored. More precisely, a main open question is how to get an inverse Lyndon factorization from a classical Lyndon factorization under the inverse lexicographic ordering, named CFLinsubscriptCFL𝑖𝑛\operatorname{CFL}_{in}roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT. In this paper we reveal a strong connection between these two factorizations where the border plays a relevant role. More precisely, we show two main results. We say that a factorization has the border property if a nonempty border of a factor cannot be a prefix of the next factor. First we show that there exists a unique inverse Lyndon factorization having the border property. Then we show that this unique factorization with the border property is the so-called canonical inverse Lyndon factorization, named ICFLICFL\operatorname{ICFL}roman_ICFL. By showing that ICFLICFL\operatorname{ICFL}roman_ICFL is obtained by compacting factors of the Lyndon factorization over the inverse lexicographic ordering, we provide a linear time algorithm for computing ICFLICFL\operatorname{ICFL}roman_ICFL from CFLinsubscriptCFL𝑖𝑛\operatorname{CFL}_{in}roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT.

Keywords Lyndon words  \cdot Lyndon factorization  \cdot Combinatorial algorithms on words

1 Introduction

The theoretical investigation of combinatorial properties of well-known word factorizations is a research topic that recently have witnessed special interest especially for improving the efficiency of algorithms. Among these, undoubtedly stands out the Lyndon Factorization introduced by Chen, Fox, Lyndon in [1], named CFLCFL\operatorname{CFL}roman_CFL. Any word w𝑤witalic_w admits a unique factorization CFL(w)CFL𝑤\operatorname{CFL}(w)roman_CFL ( italic_w ), that is a lexicographically non-increasing sequence of factors which are Lyndon words. A Lyndon word w𝑤witalic_w is strictly lexicographically smaller than each of its proper cyclic shifts, or, equivalently, than each of its nonempty proper suffixes [2]. Interesting applications of the use of the Lyndon factorization and Lyndon words are the development of the bijective Burrows-Wheeler Transforms [3, 4, 5] and a novel algorithm for sorting suffixes [6]. In particular, the notion of a Lyndon word has been re-discovered various times as a theoretical tool to locate short motifs [7] and relevant k-mers in bioinformatics applications [8]. In this line of research, Lyndon-based word factorizations have been explored to define a novel feature representation for biological sequences based on theoretical combinatorial properties proved to capture sequence similarities [9].

The notion of a Lyndon word has a counterpart that is the notion of an inverse Lyndon word, i.e., a word lexicographically greater than its suffixes. Inverting the relation between a word and its suffixes, as between Lyndon words and inverse Lyndon words, leads to different properties. Indeed, although a word could admit more than one inverse Lyndon factorization, that is a factorization into a nonincreasing product of inverse Lyndon words, in [10] the Canonical Inverse Lyndon Factorization, named ICFLICFL\operatorname{ICFL}roman_ICFL, was introduced. ICFLICFL\operatorname{ICFL}roman_ICFL maintains the main properties of CFLCFL\operatorname{CFL}roman_CFL: it is unique and can be computed in linear time. In addition, it maintains a similar Compatibility Property, used for obtaining the sorting of the suffixes of w𝑤witalic_w (“global suffixes”) by using the sorting of the suffixes of each factor of CFL(w)CFL𝑤\operatorname{CFL}(w)roman_CFL ( italic_w ) (“local suffixes”) [11]. Most notably, ICFL(w)ICFL𝑤\operatorname{ICFL}(w)roman_ICFL ( italic_w ) has another interesting property [10, 12, 13]: we can provide an upper bound on the length of the longest common prefix of two substrings of a word w𝑤witalic_w starting from different positions.

A relationship between ICFL(w)ICFL𝑤\operatorname{ICFL}(w)roman_ICFL ( italic_w ) and CFL(w)CFL𝑤\operatorname{CFL}(w)roman_CFL ( italic_w ) has been proved by using the notion of grou** [10]. First, let CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) be the Lyndon factorization of w𝑤witalic_w with respect to the inverse lexicographic order, it is proved that ICFL(w)ICFL𝑤\operatorname{ICFL}(w)roman_ICFL ( italic_w ) is obtained by concatenating the factors of a non-increasing maximal chain with respect to the prefix order, denoted by 𝒫𝒞𝒫𝒞\mathcal{PMC}caligraphic_P caligraphic_M caligraphic_C, in CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) (see Section 6). Despite this result, the connection between CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) and the inverse Lyndon factorization still remained obscure, mainly by the fact that a word may have multiple inverse Lyndon factorizations.

In this paper, we explore this connection between CFLinsubscriptCFL𝑖𝑛\operatorname{CFL}_{in}roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT and the inverse Lyndon factorizations. Our first main contribution consists in showing that there is a unique inverse Lyndon factorization of a word that has border property. The border property states that any nonempty border of a factor cannot be a prefix of the next factor. We further highlight the aforementioned connection by proving that the inverse Lyndon factorization with the border property is a compact factorization (Definition 12), i.e., each inverse Lyndon factor is the concatenation of compact factors, each obtained by concatenating the longest sequence of identical words in a 𝒫𝒞𝒫𝒞\mathcal{PMC}caligraphic_P caligraphic_M caligraphic_C. We then show the second contribution of this paper: this unique factorization is ICFLICFL\operatorname{ICFL}roman_ICFL itself and then provide a simpler linear time algorithm for computing ICFLICFL\operatorname{ICFL}roman_ICFL. Our algorithm is based on a new property that characterizes ICFL(w)ICFL𝑤\operatorname{ICFL}(w)roman_ICFL ( italic_w ): the last factor in an inverse Lyndon factorization with the border property of w𝑤witalic_w is the longest suffix of w𝑤witalic_w that is an inverse Lyndon word. Recall that the Lyndon factorization of w𝑤witalic_w has a similar property: the last factor is the longest suffix of w𝑤witalic_w that is a Lyndon word.

2 Words

Throughout this paper we follow [14, 15, 16, 17, 18] for the notations. We denote by ΣsuperscriptΣ\Sigma^{*}roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT the free monoid generated by a finite alphabet ΣΣ\Sigmaroman_Σ and we set Σ+=Σ1superscriptΣsuperscriptΣ1\Sigma^{+}=\Sigma^{*}\setminus 1roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∖ 1, where 1111 is the empty word. For a word wΣ𝑤superscriptΣw\in\Sigma^{*}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we denote by |w|𝑤|w|| italic_w | its length. A word xΣ𝑥superscriptΣx\in\Sigma^{*}italic_x ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a factor of wΣ𝑤superscriptΣw\in\Sigma^{*}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT if there are u1,u2Σsubscript𝑢1subscript𝑢2superscriptΣu_{1},u_{2}\in\Sigma^{*}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT such that w=u1xu2𝑤subscript𝑢1𝑥subscript𝑢2w=u_{1}xu_{2}italic_w = italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. If u1=1subscript𝑢11u_{1}=1italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 (resp. u2=1subscript𝑢21u_{2}=1italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1), then x𝑥xitalic_x is a prefix (resp. suffix) of w𝑤witalic_w. A factor (resp. prefix, suffix) x𝑥xitalic_x of w𝑤witalic_w is proper if xw𝑥𝑤x\not=witalic_x ≠ italic_w. Two words x,y𝑥𝑦x,yitalic_x , italic_y are incomparable for the prefix order, denoted as xyjoin𝑥𝑦x\Join yitalic_x ⨝ italic_y, if neither x𝑥xitalic_x is a prefix of y𝑦yitalic_y nor y𝑦yitalic_y is a prefix of x𝑥xitalic_x. Otherwise, x,y𝑥𝑦x,yitalic_x , italic_y are comparable for the prefix order. We write xpysubscript𝑝𝑥𝑦x\leq_{p}yitalic_x ≤ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_y if x𝑥xitalic_x is a prefix of y𝑦yitalic_y and xpysubscript𝑝𝑥𝑦x\geq_{p}yitalic_x ≥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_y if y𝑦yitalic_y is a prefix of x𝑥xitalic_x. The notion of a pair of words comparable (or incomparable) for the suffix order is defined symmetrically.

We recall that, given a nonempty word w𝑤witalic_w, a border of w𝑤witalic_w is a word which is both a proper prefix and a suffix of w𝑤witalic_w [19]. The longest proper prefix of w𝑤witalic_w which is a suffix of w𝑤witalic_w is also called the border of w𝑤witalic_w [19, 17]. A word wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is bordered if it has a nonempty border. Otherwise, w𝑤witalic_w is unbordered. A nonempty word w𝑤witalic_w is primitive if w=xk𝑤superscript𝑥𝑘w=x^{k}italic_w = italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT implies k=1𝑘1k=1italic_k = 1. An unbordered word is primitive. A sesquipower of a word x𝑥xitalic_x is a word w=xnp𝑤superscript𝑥𝑛𝑝w=x^{n}pitalic_w = italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p where p𝑝pitalic_p is a proper prefix of x𝑥xitalic_x and n1𝑛1n\geq 1italic_n ≥ 1. Two words x,y𝑥𝑦x,yitalic_x , italic_y are called conjugate if there exist words u,v𝑢𝑣u,vitalic_u , italic_v such that x=uv,y=vuformulae-sequence𝑥𝑢𝑣𝑦𝑣𝑢x=uv,y=vuitalic_x = italic_u italic_v , italic_y = italic_v italic_u. The conjugacy relation is an equivalence relation. A conjugacy class is a class of this equivalence relation.

Definition 1.

Let (Σ,<)Σ(\Sigma,<)( roman_Σ , < ) be a totally ordered alphabet. The lexicographic (or alphabetic order) precedes\prec on (Σ,<)superscriptΣ(\Sigma^{*},<)( roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , < ) is defined by setting xyprecedes𝑥𝑦x\prec yitalic_x ≺ italic_y if

  • x𝑥xitalic_x is a proper prefix of y𝑦yitalic_y, or

  • x=ras𝑥𝑟𝑎𝑠x=rasitalic_x = italic_r italic_a italic_s, y=rbt𝑦𝑟𝑏𝑡y=rbtitalic_y = italic_r italic_b italic_t, a<b𝑎𝑏a<bitalic_a < italic_b, for a,bΣ𝑎𝑏Σa,b\in\Sigmaitalic_a , italic_b ∈ roman_Σ and r,s,tΣ𝑟𝑠𝑡superscriptΣr,s,t\in\Sigma^{*}italic_r , italic_s , italic_t ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

In the next part of the paper we will implicitly refer to totally ordered alphabets. For two nonempty words x,y𝑥𝑦x,yitalic_x , italic_y, we write xymuch-less-than𝑥𝑦x\ll yitalic_x ≪ italic_y if xyprecedes𝑥𝑦x\prec yitalic_x ≺ italic_y and x𝑥xitalic_x is not a proper prefix of y𝑦yitalic_y [20]. We also write yxsucceeds𝑦𝑥y\succ xitalic_y ≻ italic_x if xyprecedes𝑥𝑦x\prec yitalic_x ≺ italic_y. Basic properties of the lexicographic order are recalled below.

Lemma 1.

For x,yΣ+𝑥𝑦superscriptΣx,y\in\Sigma^{+}italic_x , italic_y ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, the following properties hold.

  • (1)

    xyprecedes𝑥𝑦x\prec yitalic_x ≺ italic_y if and only if zxzyprecedes𝑧𝑥𝑧𝑦zx\prec zyitalic_z italic_x ≺ italic_z italic_y, for every word z𝑧zitalic_z.

  • (2)

    If xymuch-less-than𝑥𝑦x\ll yitalic_x ≪ italic_y, then xuyvmuch-less-than𝑥𝑢𝑦𝑣xu\ll yvitalic_x italic_u ≪ italic_y italic_v for all words u,v𝑢𝑣u,vitalic_u , italic_v.

  • (3)

    If xyxzprecedes𝑥𝑦precedes𝑥𝑧x\prec y\prec xzitalic_x ≺ italic_y ≺ italic_x italic_z for a word z𝑧zitalic_z, then y=xy𝑦𝑥superscript𝑦y=xy^{\prime}italic_y = italic_x italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for some word ysuperscript𝑦y^{\prime}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that yzprecedessuperscript𝑦𝑧y^{\prime}\prec zitalic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≺ italic_z.

  • (4)

    If xymuch-less-than𝑥𝑦x\ll yitalic_x ≪ italic_y and yzmuch-less-than𝑦𝑧y\ll zitalic_y ≪ italic_z, then xzmuch-less-than𝑥𝑧x\ll zitalic_x ≪ italic_z.

Let 𝒮1,,𝒮tsubscript𝒮1subscript𝒮𝑡\mathcal{S}_{1},\ldots,\mathcal{S}_{t}caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be sequences, with 𝒮j=(sj,1,,sj,rj)subscript𝒮𝑗subscript𝑠𝑗1subscript𝑠𝑗subscript𝑟𝑗\mathcal{S}_{j}=(s_{j,1},\ldots,s_{j,r_{j}})caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ( italic_s start_POSTSUBSCRIPT italic_j , 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_j , italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). For abbreviation, we let (𝒮1,,𝒮t)subscript𝒮1subscript𝒮𝑡(\mathcal{S}_{1},\ldots,\mathcal{S}_{t})( caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) stand for the sequence (s1,1,,s1,r1,,st,1,,st,rt)subscript𝑠11subscript𝑠1subscript𝑟1subscript𝑠𝑡1subscript𝑠𝑡subscript𝑟𝑡(s_{1,1},\ldots,s_{1,r_{1}},\ldots,s_{t,1},\ldots,s_{t,r_{t}})( italic_s start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT 1 , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_t , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ).

3 Lyndon words

Definition 2.

A Lyndon word wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is a word which is primitive and the smallest one in its conjugacy class for the lexicographic order.

Example 1.

Let Σ={a,b}Σ𝑎𝑏\Sigma=\{a,b\}roman_Σ = { italic_a , italic_b } with a<b𝑎𝑏a<bitalic_a < italic_b. The words a𝑎aitalic_a, b𝑏bitalic_b, aaab𝑎𝑎𝑎𝑏aaabitalic_a italic_a italic_a italic_b, abbb𝑎𝑏𝑏𝑏abbbitalic_a italic_b italic_b italic_b, aabab𝑎𝑎𝑏𝑎𝑏aababitalic_a italic_a italic_b italic_a italic_b and aababaabb𝑎𝑎𝑏𝑎𝑏𝑎𝑎𝑏𝑏aababaabbitalic_a italic_a italic_b italic_a italic_b italic_a italic_a italic_b italic_b are Lyndon words. On the contrary, abab𝑎𝑏𝑎𝑏ababitalic_a italic_b italic_a italic_b, aba𝑎𝑏𝑎abaitalic_a italic_b italic_a and abaab𝑎𝑏𝑎𝑎𝑏abaabitalic_a italic_b italic_a italic_a italic_b are not Lyndon words.

Proposition 1.

Each Lyndon word w𝑤witalic_w is unbordered.

A class of conjugacy is also called a necklace and often identified with the minimal word for the lexicographic order in it. We will adopt this terminology. Then a word is a necklace if and only if it is a power of a Lyndon word. A prenecklace is a prefix of a necklace. Then clearly any nonempty prenecklace w𝑤witalic_w has the form w=(uv)ku𝑤superscript𝑢𝑣𝑘𝑢w=(uv)^{k}uitalic_w = ( italic_u italic_v ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_u, where uv𝑢𝑣uvitalic_u italic_v is a Lyndon word, uΣ𝑢superscriptΣu\in\Sigma^{*}italic_u ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, vΣ+𝑣superscriptΣv\in\Sigma^{+}italic_v ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, k1𝑘1k\geq 1italic_k ≥ 1, that is, w𝑤witalic_w is a sesquipower of a Lyndon word uv𝑢𝑣uvitalic_u italic_v. The following result has been proved in [21]. It shows that the nonempty prefixes of Lyndon words are exactly the nonempty prefixes of the powers of Lyndon words with the exclusion of cksuperscript𝑐𝑘c^{k}italic_c start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, where c𝑐citalic_c is the maximal letter and k2𝑘2k\geq 2italic_k ≥ 2.

Proposition 2.

A word is a nonempty prefix of a Lyndon word if and only if it is a sesquipower of a Lyndon word distinct of cksuperscript𝑐𝑘c^{k}italic_c start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, where c𝑐citalic_c is the maximal letter and k2𝑘2k\geq 2italic_k ≥ 2.

In the following L=L(Σ,<)𝐿subscript𝐿superscriptΣL=L_{(\Sigma^{*},<)}italic_L = italic_L start_POSTSUBSCRIPT ( roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , < ) end_POSTSUBSCRIPT will be the set of Lyndon words, totally ordered by the relation precedes\prec on (Σ,<)superscriptΣ(\Sigma^{*},<)( roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , < ).

Theorem 3.1.

Any word wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT can be written in a unique way as a nonincreasing product w=12h𝑤subscript1subscript2subscriptw=\ell_{1}\ell_{2}\cdots\ell_{h}italic_w = roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋯ roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT of Lyndon words, i.e., in the form

w𝑤\displaystyle witalic_w =\displaystyle== 12h, with jL and 12hsubscript1subscript2subscript with subscript𝑗𝐿 and subscript1succeeds-or-equalssubscript2succeeds-or-equalssucceeds-or-equalssubscript\displaystyle\ell_{1}\ell_{2}\cdots\ell_{h},\mbox{ with }\ell_{j}\in L\mbox{ % and }\ell_{1}\succeq\ell_{2}\succeq\ldots\succeq\ell_{h}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋯ roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , with roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_L and roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⪰ roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⪰ … ⪰ roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT (3.1)

The sequence CFL(w)=(1,,h)CFL𝑤subscript1subscript\operatorname{CFL}(w)=(\ell_{1},\ldots,\ell_{h})roman_CFL ( italic_w ) = ( roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) in Eq. (3.1) is called the Lyndon decomposition (or Lyndon factorization) of w𝑤witalic_w. It is denoted by CFL(w)CFL𝑤\operatorname{CFL}(w)roman_CFL ( italic_w ) because Theorem 3.1 is usually credited to Chen, Fox and Lyndon [1]. The following result, proved in [21], is necessary for our aims.

Corollary 1.

Let wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, let 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT be its longest prefix which is a Lyndon word and let wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be such that w=1w𝑤subscript1superscript𝑤w=\ell_{1}w^{\prime}italic_w = roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. If w1superscript𝑤1w^{\prime}\not=1italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ 1, then CFL(w)=(1,CFL(w))CFL𝑤subscript1CFLsuperscript𝑤\operatorname{CFL}(w)=(\ell_{1},\operatorname{CFL}(w^{\prime}))roman_CFL ( italic_w ) = ( roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_CFL ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ).

Sometimes we need to emphasize consecutive equal factors in CFLCFL\operatorname{CFL}roman_CFL. We write CFL(w)=(1n1,,rnr)CFL𝑤superscriptsubscript1subscript𝑛1superscriptsubscript𝑟subscript𝑛𝑟\operatorname{CFL}(w)=(\ell_{1}^{n_{1}},\ldots,\ell_{r}^{n_{r}})roman_CFL ( italic_w ) = ( roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) to denote a tuple of n1++nrsubscript𝑛1subscript𝑛𝑟n_{1}+\ldots+n_{r}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + … + italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT Lyndon words, where r>0𝑟0r>0italic_r > 0, n1,,nr1subscript𝑛1subscript𝑛𝑟1n_{1},\ldots,n_{r}\geq 1italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ≥ 1. Precisely 1rsucceedssubscript1succeedssubscript𝑟\ell_{1}\succ\ldots\succ\ell_{r}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ … ≻ roman_ℓ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT are Lyndon words, also named Lyndon factors of w𝑤witalic_w. There is a linear time algorithm to compute the pair (1,n1)subscript1subscript𝑛1(\ell_{1},n_{1})( roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and thus, by iteration, the Lyndon factorization of w𝑤witalic_w [22, 17]. Linear time algorithms may also be found in [21] and in the more recent paper [23].

4 Inverse Lyndon words

For the material in this section see [10, 12, 13]. Inverse Lyndon words are related to the inverse alphabetic order. Its definition is recalled below.

Definition 3.

Let (Σ,<)Σ(\Sigma,<)( roman_Σ , < ) be a totally ordered alphabet. The inverse <insubscript𝑖𝑛<_{in}< start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT of <<< is defined by

a,bΣb<inaa<b\forall a,b\in\Sigma\quad b<_{in}a\Leftrightarrow a<b∀ italic_a , italic_b ∈ roman_Σ italic_b < start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT italic_a ⇔ italic_a < italic_b

The inverse lexicographic or inverse alphabetic order on (Σ,<)superscriptΣ(\Sigma^{*},<)( roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , < ), denoted insubscriptprecedes𝑖𝑛\prec_{in}≺ start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT, is the lexicographic order on (Σ,<in)superscriptΣsubscript𝑖𝑛(\Sigma^{*},<_{in})( roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , < start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ).

Example 2.

Let Σ={a,b,c,d}Σ𝑎𝑏𝑐𝑑\Sigma=\{a,b,c,d\}roman_Σ = { italic_a , italic_b , italic_c , italic_d } with a<b<c<d𝑎𝑏𝑐𝑑a<b<c<ditalic_a < italic_b < italic_c < italic_d. Then dabdabdprecedes𝑑𝑎𝑏𝑑𝑎𝑏𝑑dab\prec dabditalic_d italic_a italic_b ≺ italic_d italic_a italic_b italic_d and dabdadacprecedes𝑑𝑎𝑏𝑑𝑎𝑑𝑎𝑐dabda\prec dacitalic_d italic_a italic_b italic_d italic_a ≺ italic_d italic_a italic_c. We have d<inc<inb<inasubscript𝑖𝑛𝑑𝑐subscript𝑖𝑛𝑏subscript𝑖𝑛𝑎d<_{in}c<_{in}b<_{in}aitalic_d < start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT italic_c < start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT italic_b < start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT italic_a. Therefore dabindabdsubscriptprecedes𝑖𝑛𝑑𝑎𝑏𝑑𝑎𝑏𝑑dab\prec_{in}dabditalic_d italic_a italic_b ≺ start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT italic_d italic_a italic_b italic_d and dacindabdasubscriptprecedes𝑖𝑛𝑑𝑎𝑐𝑑𝑎𝑏𝑑𝑎dac\prec_{in}dabdaitalic_d italic_a italic_c ≺ start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT italic_d italic_a italic_b italic_d italic_a.

Of course for all x,yΣ𝑥𝑦superscriptΣx,y\in\Sigma^{*}italic_x , italic_y ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT such that xyjoin𝑥𝑦x\Join yitalic_x ⨝ italic_y,

yinxxy.subscriptprecedes𝑖𝑛𝑦𝑥precedes𝑥𝑦y\prec_{in}x\Leftrightarrow x\prec y.italic_y ≺ start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT italic_x ⇔ italic_x ≺ italic_y .

Moreover, in this case xymuch-less-than𝑥𝑦x\ll yitalic_x ≪ italic_y. This justifies the adopted terminology.

From now on, Lin=L(Σ,<in)subscript𝐿𝑖𝑛subscript𝐿superscriptΣsubscript𝑖𝑛L_{in}=L_{(\Sigma^{*},<_{in})}italic_L start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT ( roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , < start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT denotes the set of the Lyndon words on ΣsuperscriptΣ\Sigma^{*}roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with respect to the inverse lexicographic order. Following [24], a word wLin𝑤subscript𝐿𝑖𝑛w\in L_{in}italic_w ∈ italic_L start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT will be named an anti-Lyndon word. Correspondingly, an anti-prenecklace will be a prefix of an anti-necklace, which in turn will be a necklace with respect to the inverse lexicographic order.

In the following, we denote by CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) the Lyndon factorization of w𝑤witalic_w with respect to the inverse order <insubscript𝑖𝑛<_{in}< start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT.

Definition 4.

A word wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is an inverse Lyndon word if swprecedes𝑠𝑤s\prec witalic_s ≺ italic_w, for each nonempty proper suffix s𝑠sitalic_s of w𝑤witalic_w.

Example 3.

The words a𝑎aitalic_a, b𝑏bitalic_b, aaaaa𝑎𝑎𝑎𝑎𝑎aaaaaitalic_a italic_a italic_a italic_a italic_a, bbba𝑏𝑏𝑏𝑎bbbaitalic_b italic_b italic_b italic_a, baaab𝑏𝑎𝑎𝑎𝑏baaabitalic_b italic_a italic_a italic_a italic_b, bbaba𝑏𝑏𝑎𝑏𝑎bbabaitalic_b italic_b italic_a italic_b italic_a and bbababbaa𝑏𝑏𝑎𝑏𝑎𝑏𝑏𝑎𝑎bbababbaaitalic_b italic_b italic_a italic_b italic_a italic_b italic_b italic_a italic_a are inverse Lyndon words on {a,b}𝑎𝑏\{a,b\}{ italic_a , italic_b }, with a<b𝑎𝑏a<bitalic_a < italic_b. On the contrary, aaba𝑎𝑎𝑏𝑎aabaitalic_a italic_a italic_b italic_a is not an inverse Lyndon word since aababaprecedes𝑎𝑎𝑏𝑎𝑏𝑎aaba\prec baitalic_a italic_a italic_b italic_a ≺ italic_b italic_a. Analogously, aabbabaprecedes𝑎𝑎𝑏𝑏𝑎𝑏𝑎aabba\prec baitalic_a italic_a italic_b italic_b italic_a ≺ italic_b italic_a and thus aabba𝑎𝑎𝑏𝑏𝑎aabbaitalic_a italic_a italic_b italic_b italic_a is not an inverse Lyndon word.

The following result, proved in [10, 13], and also in [25], summarizes some properties of the inverse Lyndon words.

Proposition 3.

Let wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. Then we have

  1. 1.

    The word w𝑤witalic_w is an anti-Lyndon word if and only if it is an unbordered inverse Lyndon word.

  2. 2.

    The word w𝑤witalic_w is an inverse Lyndon word if and only if w𝑤witalic_w is a nonempty anti-prenecklace.

  3. 3.

    If w𝑤witalic_w is an inverse Lyndon word, then any nonempty prefix of w𝑤witalic_w is an inverse Lyndon word.

Definition 5.

An inverse Lyndon factorization of a word wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is a sequence (m1,,mk)subscript𝑚1subscript𝑚𝑘(m_{1},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) of inverse Lyndon words such that m1mk=wsubscript𝑚1subscript𝑚𝑘𝑤m_{1}\cdots m_{k}=witalic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_w and mimi+1much-less-thansubscript𝑚𝑖subscript𝑚𝑖1m_{i}\ll m_{i+1}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≪ italic_m start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT, 1ik11𝑖𝑘11\leq i\leq k-11 ≤ italic_i ≤ italic_k - 1.

As the following example in [10] shows, a word may have different inverse Lyndon factorizations.

Example 4.

Let Σ={a,b,c,d}Σ𝑎𝑏𝑐𝑑\Sigma=\{a,b,c,d\}roman_Σ = { italic_a , italic_b , italic_c , italic_d } with a<b<c<d𝑎𝑏𝑐𝑑a<b<c<ditalic_a < italic_b < italic_c < italic_d, z=dabdadacddbdc𝑧𝑑𝑎𝑏𝑑𝑎𝑑𝑎𝑐𝑑𝑑𝑏𝑑𝑐z=dabdadacddbdcitalic_z = italic_d italic_a italic_b italic_d italic_a italic_d italic_a italic_c italic_d italic_d italic_b italic_d italic_c. It is easy to see that (dab,dadacd,db,dc)𝑑𝑎𝑏𝑑𝑎𝑑𝑎𝑐𝑑𝑑𝑏𝑑𝑐(dab,dadacd,db,dc)( italic_d italic_a italic_b , italic_d italic_a italic_d italic_a italic_c italic_d , italic_d italic_b , italic_d italic_c ), (dabda,dac,ddbdc)𝑑𝑎𝑏𝑑𝑎𝑑𝑎𝑐𝑑𝑑𝑏𝑑𝑐(dabda,dac,ddbdc)( italic_d italic_a italic_b italic_d italic_a , italic_d italic_a italic_c , italic_d italic_d italic_b italic_d italic_c ), (dab,dadac,ddbdc)𝑑𝑎𝑏𝑑𝑎𝑑𝑎𝑐𝑑𝑑𝑏𝑑𝑐(dab,dadac,ddbdc)( italic_d italic_a italic_b , italic_d italic_a italic_d italic_a italic_c , italic_d italic_d italic_b italic_d italic_c ) are all inverse Lyndon factorizations of z𝑧zitalic_z.

5 The border property

In this section we prove the main result of this paper, namely, for any nonempty word w𝑤witalic_w, there exists a unique inverse Lyndon factorization of w𝑤witalic_w which has a special property, named the border property.

Definition 6 (Border property).

Let wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. A factorization (m1,,mk)subscript𝑚1subscript𝑚𝑘(m_{1},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) of w𝑤witalic_w has the border property if each nonempty border z𝑧zitalic_z of misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is not a prefix of mi+1subscript𝑚𝑖1m_{i+1}italic_m start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT, 1ik11𝑖𝑘11\leq i\leq k-11 ≤ italic_i ≤ italic_k - 1.

We first prove a fundamental property of the inverse Lyndon factorizations of w𝑤witalic_w which have the border property.

Lemma 2.

Let wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, let (m1,,mk)subscript𝑚1subscript𝑚𝑘(m_{1},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) be an inverse Lyndon factorization of w𝑤witalic_w having the border property. If α𝛼\alphaitalic_α is a nonempty border of mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, 1jk11𝑗𝑘11\leq j\leq k-11 ≤ italic_j ≤ italic_k - 1, then there exists a nonempty prefix β𝛽\betaitalic_β of mj+1subscript𝑚𝑗1m_{j+1}italic_m start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT such that |β||α|𝛽𝛼|\beta|\leq|\alpha|| italic_β | ≤ | italic_α | and αβmuch-less-than𝛼𝛽\alpha\ll\betaitalic_α ≪ italic_β.

Proof.

Let wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, let (m1,,mk)subscript𝑚1subscript𝑚𝑘(m_{1},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) be an inverse Lyndon factorization of w𝑤witalic_w having the border property, let α𝛼\alphaitalic_α be a nonempty border of mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, 1jk11𝑗𝑘11\leq j\leq k-11 ≤ italic_j ≤ italic_k - 1. We distinguish two cases: either |mj+1|<|α|subscript𝑚𝑗1𝛼|m_{j+1}|<|\alpha|| italic_m start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT | < | italic_α | or |mj+1||α|subscript𝑚𝑗1𝛼|m_{j+1}|\geq|\alpha|| italic_m start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT | ≥ | italic_α |.

Assume |mj+1|<|α|subscript𝑚𝑗1𝛼|m_{j+1}|<|\alpha|| italic_m start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT | < | italic_α |. By hypothesis (m1,,mk)subscript𝑚1subscript𝑚𝑘(m_{1},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is an inverse Lyndon factorization, hence mjmj+1much-less-thansubscript𝑚𝑗subscript𝑚𝑗1m_{j}\ll m_{j+1}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≪ italic_m start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT, that is, there are r,s,tΣ𝑟𝑠𝑡superscriptΣr,s,t\in\Sigma^{*}italic_r , italic_s , italic_t ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, a,bΣ𝑎𝑏Σa,b\in\Sigmaitalic_a , italic_b ∈ roman_Σ, such that a<b𝑎𝑏a<bitalic_a < italic_b and mj=rassubscript𝑚𝑗𝑟𝑎𝑠m_{j}=rasitalic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_r italic_a italic_s, mj+1=rbtsubscript𝑚𝑗1𝑟𝑏𝑡m_{j+1}=rbtitalic_m start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT = italic_r italic_b italic_t. Obviously |ra||mj+1|<|α|𝑟𝑎subscript𝑚𝑗1𝛼|ra|\leq|m_{j+1}|<|\alpha|| italic_r italic_a | ≤ | italic_m start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT | < | italic_α |, thus there is sΣsuperscript𝑠superscriptΣs^{\prime}\in\Sigma^{*}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT such that α=ras𝛼𝑟𝑎superscript𝑠\alpha=ras^{\prime}italic_α = italic_r italic_a italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Consequently, α=rasrbt=mj+1𝛼𝑟𝑎superscript𝑠much-less-than𝑟𝑏𝑡subscript𝑚𝑗1\alpha=ras^{\prime}\ll rbt=m_{j+1}italic_α = italic_r italic_a italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≪ italic_r italic_b italic_t = italic_m start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT and our claim holds with β=mj+1𝛽subscript𝑚𝑗1\beta=m_{j+1}italic_β = italic_m start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT.

Assume |mj+1||α|subscript𝑚𝑗1𝛼|m_{j+1}|\geq|\alpha|| italic_m start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT | ≥ | italic_α |. Let β𝛽\betaitalic_β be the nonempty prefix of mj+1subscript𝑚𝑗1m_{j+1}italic_m start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT such that |β|=|α|𝛽𝛼|\beta|=|\alpha|| italic_β | = | italic_α |. Clearly βα𝛽𝛼\beta\not=\alphaitalic_β ≠ italic_α because (m1,,mk)subscript𝑚1subscript𝑚𝑘(m_{1},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) has the border property. Since α𝛼\alphaitalic_α and β𝛽\betaitalic_β are two different nonempty words of the same length, either βαmuch-less-than𝛽𝛼\beta\ll\alphaitalic_β ≪ italic_α or αβmuch-less-than𝛼𝛽\alpha\ll\betaitalic_α ≪ italic_β. The first case leads to a contradiction because if βαmuch-less-than𝛽𝛼\beta\ll\alphaitalic_β ≪ italic_α then mj+1mjmuch-less-thansubscript𝑚𝑗1subscript𝑚𝑗m_{j+1}\ll m_{j}italic_m start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ≪ italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT by Lemma 1 and this contradicts the fact that (m1,,mk)subscript𝑚1subscript𝑚𝑘(m_{1},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is an inverse Lyndon factorization. Thus, αβmuch-less-than𝛼𝛽\alpha\ll\betaitalic_α ≪ italic_β and the proof is complete. ∎

Proposition 4.

For each wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, there exists a unique inverse Lyndon factorization of w𝑤witalic_w having the border property.

Proof.

The proof is by induction on |w|𝑤|w|| italic_w |. If |w|=1𝑤1|w|=1| italic_w | = 1, then F1(w)=F2(w)=(w)subscript𝐹1𝑤subscript𝐹2𝑤𝑤F_{1}(w)=F_{2}(w)=(w)italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_w ) = italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_w ) = ( italic_w ) and statement clearly holds. Thus assume |w|>1𝑤1|w|>1| italic_w | > 1. Let F1(w)=(f1,,fk)subscript𝐹1𝑤subscript𝑓1subscript𝑓𝑘F_{1}(w)=(f_{1},\ldots,f_{k})italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_w ) = ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) and F2(w)=(f1,,fv)subscript𝐹2𝑤subscriptsuperscript𝑓1subscriptsuperscript𝑓𝑣F_{2}(w)=(f^{\prime}_{1},\ldots,f^{\prime}_{v})italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_w ) = ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) two inverse Lyndon factorizations of w𝑤witalic_w having the border property. Thus

f1fk=f1fv=wsubscript𝑓1subscript𝑓𝑘subscriptsuperscript𝑓1subscriptsuperscript𝑓𝑣𝑤f_{1}\cdots f_{k}=f^{\prime}_{1}\cdots f^{\prime}_{v}=witalic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_w (5.1)

If |fk|=|fv|subscript𝑓𝑘subscriptsuperscript𝑓𝑣|f_{k}|=|f^{\prime}_{v}|| italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | = | italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT | and v=1𝑣1v=1italic_v = 1 or k=1𝑘1k=1italic_k = 1, clearly fk=fvsubscript𝑓𝑘subscriptsuperscript𝑓𝑣f_{k}=f^{\prime}_{v}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and F1(w)=F2(w)subscript𝐹1𝑤subscript𝐹2𝑤F_{1}(w)=F_{2}(w)italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_w ) = italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_w ). Analogously, if |fk|=|fv|subscript𝑓𝑘subscriptsuperscript𝑓𝑣|f_{k}|=|f^{\prime}_{v}|| italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | = | italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT |, v>1𝑣1v>1italic_v > 1 and k>1𝑘1k>1italic_k > 1, then fk=fvsubscript𝑓𝑘subscriptsuperscript𝑓𝑣f_{k}=f^{\prime}_{v}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and F1(w)=(f1,,fk1)subscriptsuperscript𝐹1superscript𝑤subscript𝑓1subscript𝑓𝑘1F^{\prime}_{1}(w^{\prime})=(f_{1},\ldots,f_{k-1})italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ), F2(w)=(f1,,fv1)subscriptsuperscript𝐹2superscript𝑤subscriptsuperscript𝑓1subscriptsuperscript𝑓𝑣1F^{\prime}_{2}(w^{\prime})=(f^{\prime}_{1},\ldots,f^{\prime}_{v-1})italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v - 1 end_POSTSUBSCRIPT ) would be two inverse Lyndon factorizations of wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT having the border property, where wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is such that w=wfk𝑤superscript𝑤subscript𝑓𝑘w=w^{\prime}f_{k}italic_w = italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Of course, |w|<|w|superscript𝑤𝑤|w^{\prime}|<|w|| italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | < | italic_w |. By induction hypothesis, F1(w)=F2(w)subscriptsuperscript𝐹1superscript𝑤subscriptsuperscript𝐹2superscript𝑤F^{\prime}_{1}(w^{\prime})=F^{\prime}_{2}(w^{\prime})italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), hence F1(w)=F2(w)subscript𝐹1𝑤subscript𝐹2𝑤F_{1}(w)=F_{2}(w)italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_w ) = italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_w ).

By contradiction, let |fk||fv|subscript𝑓𝑘subscriptsuperscript𝑓𝑣|f_{k}|\not=|f^{\prime}_{v}|| italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | ≠ | italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT |. Assume |fk|<|fv|subscript𝑓𝑘subscriptsuperscript𝑓𝑣|f_{k}|<|f^{\prime}_{v}|| italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | < | italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT | (similar arguments apply if |fk|>|fv|subscript𝑓𝑘subscriptsuperscript𝑓𝑣|f_{k}|>|f^{\prime}_{v}|| italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | > | italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT |). The word fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is a proper suffix of fvsubscriptsuperscript𝑓𝑣f^{\prime}_{v}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. Clearly k>1𝑘1k>1italic_k > 1. Let g𝑔gitalic_g be the smallest integer such that fg+1fksubscript𝑓𝑔1subscript𝑓𝑘f_{g+1}\cdots f_{k}italic_f start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT ⋯ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is a proper suffix of fvsubscriptsuperscript𝑓𝑣f^{\prime}_{v}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, 1gk11𝑔𝑘11\leq g\leq k-11 ≤ italic_g ≤ italic_k - 1, that is,

fv=αfg+1fksubscriptsuperscript𝑓𝑣𝛼subscript𝑓𝑔1subscript𝑓𝑘f^{\prime}_{v}=\alpha f_{g+1}\cdots f_{k}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_α italic_f start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT ⋯ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (5.2)

where αΣ+𝛼superscriptΣ\alpha\in\Sigma^{+}italic_α ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is a suffix of fgsubscript𝑓𝑔f_{g}italic_f start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT.

Notice that

α≪̸fg+1not-much-less-than𝛼subscript𝑓𝑔1\alpha\not\ll f_{g+1}italic_α ≪̸ italic_f start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT (5.3)

Indeed, if αfg+1much-less-than𝛼subscript𝑓𝑔1\alpha\ll f_{g+1}italic_α ≪ italic_f start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT, then, by Eq. (5.2), we would have fv=αfg+1fkfg+1fksubscriptsuperscript𝑓𝑣𝛼subscript𝑓𝑔1subscript𝑓𝑘much-less-thansubscript𝑓𝑔1subscript𝑓𝑘f^{\prime}_{v}=\alpha f_{g+1}\cdots f_{k}\ll f_{g+1}\cdots f_{k}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_α italic_f start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT ⋯ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≪ italic_f start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT ⋯ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, which is impossible because fvsubscriptsuperscript𝑓𝑣f^{\prime}_{v}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is an inverse Lyndon word.

The word α𝛼\alphaitalic_α is a nonempty proper suffix of fgsubscript𝑓𝑔f_{g}italic_f start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT since otherwise we would have α=fgfg+1𝛼subscript𝑓𝑔much-less-thansubscript𝑓𝑔1\alpha=f_{g}\ll f_{g+1}italic_α = italic_f start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ≪ italic_f start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT, contrary to Eq. (5.3). Since fgsubscript𝑓𝑔f_{g}italic_f start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is an inverse Lyndon word and α𝛼\alphaitalic_α is a nonempty proper suffix of fgsubscript𝑓𝑔f_{g}italic_f start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, either αpfgsubscript𝑝𝛼subscript𝑓𝑔\alpha\leq_{p}f_{g}italic_α ≤ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT or αfgmuch-less-than𝛼subscript𝑓𝑔\alpha\ll f_{g}italic_α ≪ italic_f start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT.

If αpfgsubscript𝑝𝛼subscript𝑓𝑔\alpha\leq_{p}f_{g}italic_α ≤ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, then α𝛼\alphaitalic_α is a nonempty border of fgsubscript𝑓𝑔f_{g}italic_f start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, then, by Lemma 2, there exists a nonempty prefix β𝛽\betaitalic_β of fg+1subscript𝑓𝑔1f_{g+1}italic_f start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT such that |β||α|𝛽𝛼|\beta|\leq|\alpha|| italic_β | ≤ | italic_α | and αβmuch-less-than𝛼𝛽\alpha\ll\betaitalic_α ≪ italic_β. Thus, αfg+1much-less-than𝛼subscript𝑓𝑔1\alpha\ll f_{g+1}italic_α ≪ italic_f start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT which contradicts Eq. (5.3). Assume αfgmuch-less-than𝛼subscript𝑓𝑔\alpha\ll f_{g}italic_α ≪ italic_f start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT. Since fgfg+1much-less-thansubscript𝑓𝑔subscript𝑓𝑔1f_{g}\ll f_{g+1}italic_f start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ≪ italic_f start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT, by Lemma 1 we have αfg+1much-less-than𝛼subscript𝑓𝑔1\alpha\ll f_{g+1}italic_α ≪ italic_f start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT which contradicts once again Eq. (5.3). This finishes the proof. ∎

6 Grou**s and compact factorizations

In this section we prove a structural property of an inverse Lyndon factorization having the border property, namely it is a compact factorization. This result is crucial to characterize the relationship between CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) and the factorization into inverse Lyndon words of w𝑤witalic_w. First we report the notion of grou** given in [10]. We refer to [10, 13] for a detailed and complete discussion on this topic.

Let CFLin(w)=(1,,h)subscriptCFL𝑖𝑛𝑤subscript1subscript\operatorname{CFL}_{in}(w)=(\ell_{1},\ldots,\ell_{h})roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) = ( roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ), where 1in2ininhsubscriptsucceeds-or-equals𝑖𝑛subscript1subscript2subscriptsucceeds-or-equals𝑖𝑛subscriptsucceeds-or-equals𝑖𝑛subscript\ell_{1}\succeq_{in}\ell_{2}\succeq_{in}\ldots\succeq_{in}\ell_{h}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⪰ start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⪰ start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT … ⪰ start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. Consider the partial order psubscript𝑝\geq_{p}≥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, where xpysubscript𝑝𝑥𝑦x\geq_{p}yitalic_x ≥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_y if y𝑦yitalic_y is a prefix of x𝑥xitalic_x. Recall that a chain is a set of a pairwise comparable elements. We say that a chain is maximal if it is not strictly contained in any other chain. A non-increasing (maximal) chain in CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) is the sequence corresponding to a (maximal) chain in the multiset {1,,h}subscript1subscript\{\ell_{1},\ldots,\ell_{h}\}{ roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } with respect to psubscript𝑝\geq_{p}≥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. We denote by 𝒫𝒞𝒫𝒞\mathcal{PMC}caligraphic_P caligraphic_M caligraphic_C a non-increasing maximal chain in CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ). Looking at the definition of the (inverse) lexicographic order, it is easy to see that a 𝒫𝒞𝒫𝒞\mathcal{PMC}caligraphic_P caligraphic_M caligraphic_C is a sequence of consecutive factors in CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ). Moreover CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) is the concatenation of its 𝒫𝒞𝒫𝒞\mathcal{PMC}caligraphic_P caligraphic_M caligraphic_C. The formal definitions are given below.

Definition 7.

Let wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, let CFLin(w)=(1,,h)subscriptCFL𝑖𝑛𝑤subscript1subscript\operatorname{CFL}_{in}(w)=(\ell_{1},\ldots,\ell_{h})roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) = ( roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) and let 1r<sh1𝑟𝑠1\leq r<s\leq h1 ≤ italic_r < italic_s ≤ italic_h. We say that r,r+1,,ssubscript𝑟subscript𝑟1subscript𝑠\ell_{r},\ell_{r+1},\ldots,\ell_{s}roman_ℓ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT italic_r + 1 end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is a non-increasing maximal chain for the prefix order in CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ), abbreviated 𝒫𝒞𝒫𝒞\mathcal{PMC}caligraphic_P caligraphic_M caligraphic_C, if rpr+1ppssubscript𝑝subscript𝑟subscript𝑟1subscript𝑝subscript𝑝subscript𝑠\ell_{r}\geq_{p}\ell_{r+1}\geq_{p}\ldots\geq_{p}\ell_{s}roman_ℓ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ≥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_r + 1 end_POSTSUBSCRIPT ≥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT … ≥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Moreover, if r>1𝑟1r>1italic_r > 1, then r1prsubscriptnot-greater-than-or-equals𝑝subscript𝑟1subscript𝑟\ell_{r-1}\not\geq_{p}\ell_{r}roman_ℓ start_POSTSUBSCRIPT italic_r - 1 end_POSTSUBSCRIPT ≱ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, if s<h𝑠s<hitalic_s < italic_h, then sps+1subscriptnot-greater-than-or-equals𝑝subscript𝑠subscript𝑠1\ell_{s}\not\geq_{p}\ell_{s+1}roman_ℓ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ≱ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_s + 1 end_POSTSUBSCRIPT. Two 𝒫𝒞𝒫𝒞\mathcal{PMC}caligraphic_P caligraphic_M caligraphic_C 𝒞1=r,r+1,,ssubscript𝒞1subscript𝑟subscript𝑟1subscript𝑠\mathcal{C}_{1}=\ell_{r},\ell_{r+1},\ldots,\ell_{s}caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = roman_ℓ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT italic_r + 1 end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, 𝒞2=r,r+1,,ssubscript𝒞2subscriptsuperscript𝑟subscriptsuperscript𝑟1subscriptsuperscript𝑠\mathcal{C}_{2}=\ell_{r^{\prime}},\ell_{r^{\prime}+1},\ldots,\ell_{s^{\prime}}caligraphic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_ℓ start_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT are consecutive if r=s+1superscript𝑟𝑠1r^{\prime}=s+1italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_s + 1 (or r=s+1𝑟superscript𝑠1r=s^{\prime}+1italic_r = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1).

Definition 8.

Let wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, let CFLin(w)=(1,,h)subscriptCFL𝑖𝑛𝑤subscript1subscript\operatorname{CFL}_{in}(w)=(\ell_{1},\ldots,\ell_{h})roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) = ( roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ). We say that (𝒞1,𝒞2,,𝒞s)subscript𝒞1subscript𝒞2subscript𝒞𝑠(\mathcal{C}_{1},\mathcal{C}_{2},\ldots,\mathcal{C}_{s})( caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) is the decomposition of CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) into its non-increasing maximal chains for the prefix order if the following holds

  • (1)

    Each 𝒞jsubscript𝒞𝑗\mathcal{C}_{j}caligraphic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is a non-increasing maximal chain in CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ).

  • (2)

    𝒞jsubscript𝒞𝑗\mathcal{C}_{j}caligraphic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and 𝒞j+1subscript𝒞𝑗1\mathcal{C}_{j+1}caligraphic_C start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT are consecutive, 1js11𝑗𝑠11\leq j\leq s-11 ≤ italic_j ≤ italic_s - 1.

  • (3)

    CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) is the concatenation of the sequences 𝒞1,𝒞2,,𝒞ssubscript𝒞1subscript𝒞2subscript𝒞𝑠\mathcal{C}_{1},\mathcal{C}_{2},\ldots,\mathcal{C}_{s}caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.

Definition 9.

Let wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. We say that (m1,,mk)subscript𝑚1subscript𝑚𝑘(m_{1},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is a grou** of CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) if the following holds

  • (1)

    (m1,,mk)subscript𝑚1subscript𝑚𝑘(m_{1},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is an inverse Lyndon factorization of w𝑤witalic_w

  • (2)

    Each factor mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, is the product of consecutive factors in a 𝒫𝒞𝒫𝒞\mathcal{PMC}caligraphic_P caligraphic_M caligraphic_C in CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ).

Example 5.

Let Σ={a,b,c,d}Σ𝑎𝑏𝑐𝑑\Sigma=\{a,b,c,d\}roman_Σ = { italic_a , italic_b , italic_c , italic_d }, a<b<c<d𝑎𝑏𝑐𝑑a<b<c<ditalic_a < italic_b < italic_c < italic_d, and w=dabadabdabdadac𝑤𝑑𝑎𝑏𝑎𝑑𝑎𝑏𝑑𝑎𝑏𝑑𝑎𝑑𝑎𝑐w=dabadabdabdadacitalic_w = italic_d italic_a italic_b italic_a italic_d italic_a italic_b italic_d italic_a italic_b italic_d italic_a italic_d italic_a italic_c. We have CFLin(w)=(daba,dab,dab,dadac)subscriptCFL𝑖𝑛𝑤𝑑𝑎𝑏𝑎𝑑𝑎𝑏𝑑𝑎𝑏𝑑𝑎𝑑𝑎𝑐\operatorname{CFL}_{in}(w)=(daba,dab,dab,dadac)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) = ( italic_d italic_a italic_b italic_a , italic_d italic_a italic_b , italic_d italic_a italic_b , italic_d italic_a italic_d italic_a italic_c ). The decomposition of CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) into its 𝒫𝒞𝒫𝒞\mathcal{PMC}caligraphic_P caligraphic_M caligraphic_C is ((daba,dab,dab),(dadac))𝑑𝑎𝑏𝑎𝑑𝑎𝑏𝑑𝑎𝑏𝑑𝑎𝑑𝑎𝑐((daba,dab,dab),(dadac))( ( italic_d italic_a italic_b italic_a , italic_d italic_a italic_b , italic_d italic_a italic_b ) , ( italic_d italic_a italic_d italic_a italic_c ) ). Moreover, (daba,dabdab,dadac)𝑑𝑎𝑏𝑎𝑑𝑎𝑏𝑑𝑎𝑏𝑑𝑎𝑑𝑎𝑐(daba,dabdab,dadac)( italic_d italic_a italic_b italic_a , italic_d italic_a italic_b italic_d italic_a italic_b , italic_d italic_a italic_d italic_a italic_c ) is a grou** of CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) but for the inverse Lyndon factorization (dabadab,dabda,dac)𝑑𝑎𝑏𝑎𝑑𝑎𝑏𝑑𝑎𝑏𝑑𝑎𝑑𝑎𝑐(dabadab,dabda,dac)( italic_d italic_a italic_b italic_a italic_d italic_a italic_b , italic_d italic_a italic_b italic_d italic_a , italic_d italic_a italic_c ) this is no longer true.

Next, let y=dabadabdabdabdadac𝑦𝑑𝑎𝑏𝑎𝑑𝑎𝑏𝑑𝑎𝑏𝑑𝑎𝑏𝑑𝑎𝑑𝑎𝑐y=dabadabdabdabdadacitalic_y = italic_d italic_a italic_b italic_a italic_d italic_a italic_b italic_d italic_a italic_b italic_d italic_a italic_b italic_d italic_a italic_d italic_a italic_c. We have CFLin(y)=(daba,dab,dab,dab,dadac)subscriptCFL𝑖𝑛𝑦𝑑𝑎𝑏𝑎𝑑𝑎𝑏𝑑𝑎𝑏𝑑𝑎𝑏𝑑𝑎𝑑𝑎𝑐\operatorname{CFL}_{in}(y)=(daba,dab,dab,dab,dadac)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_y ) = ( italic_d italic_a italic_b italic_a , italic_d italic_a italic_b , italic_d italic_a italic_b , italic_d italic_a italic_b , italic_d italic_a italic_d italic_a italic_c ). The decomposition of CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) into its 𝒫𝒞𝒫𝒞\mathcal{PMC}caligraphic_P caligraphic_M caligraphic_C is ((daba,dab,dab,dab),(dadac))𝑑𝑎𝑏𝑎𝑑𝑎𝑏𝑑𝑎𝑏𝑑𝑎𝑏𝑑𝑎𝑑𝑎𝑐((daba,dab,dab,dab),(dadac))( ( italic_d italic_a italic_b italic_a , italic_d italic_a italic_b , italic_d italic_a italic_b , italic_d italic_a italic_b ) , ( italic_d italic_a italic_d italic_a italic_c ) ). Moreover, (daba,(dab)3,dadac)𝑑𝑎𝑏𝑎superscript𝑑𝑎𝑏3𝑑𝑎𝑑𝑎𝑐(daba,(dab)^{3},dadac)( italic_d italic_a italic_b italic_a , ( italic_d italic_a italic_b ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , italic_d italic_a italic_d italic_a italic_c ) and (dabadab,(dab)2,dadac)𝑑𝑎𝑏𝑎𝑑𝑎𝑏superscript𝑑𝑎𝑏2𝑑𝑎𝑑𝑎𝑐(dabadab,(dab)^{2},dadac)( italic_d italic_a italic_b italic_a italic_d italic_a italic_b , ( italic_d italic_a italic_b ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_d italic_a italic_d italic_a italic_c ) are two grou**s of CFLin(y)subscriptCFL𝑖𝑛𝑦\operatorname{CFL}_{in}(y)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_y ).

For our aims, we need to consider the words that are concatenations of equal factors in CFLinsubscriptCFL𝑖𝑛\operatorname{CFL}_{in}roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT. This approach leads to a refinement of the partition of CFLinsubscriptCFL𝑖𝑛\operatorname{CFL}_{in}roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT into non-increasing maximal chains for the prefix order, as defined below.

Definition 10 (Compact sequences).

Let 𝒞=(1,,h)𝒞subscript1subscript{\cal C}=(\ell_{1},\ldots,\ell_{h})caligraphic_C = ( roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) be a non-increasing maximal chain for the prefix order in CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ). The decomposition of 𝒞𝒞{\cal C}caligraphic_C into maximal compact sequences is the sequence (𝒢1,,𝒢n)subscript𝒢1subscript𝒢𝑛({\cal G}_{1},\ldots,{\cal G}_{n})( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) such that

  • (1)

    𝒞=(𝒢1,,𝒢n)𝒞subscript𝒢1subscript𝒢𝑛{\cal C}=({\cal G}_{1},\ldots,{\cal G}_{n})caligraphic_C = ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )

  • (2)

    For every i𝑖iitalic_i, 1in1𝑖𝑛1\leq i\leq n1 ≤ italic_i ≤ italic_n, 𝒢isubscript𝒢𝑖{\cal G}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT consists of the longest sequence of consecutive identical elements in 𝒞𝒞{\cal C}caligraphic_C

Let (𝒞1,𝒞2,,𝒞s)subscript𝒞1subscript𝒞2subscript𝒞𝑠(\mathcal{C}_{1},\mathcal{C}_{2},\ldots,\mathcal{C}_{s})( caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) be the decomposition of CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) into its non-increasing maximal chains for the prefix order. The decomposition of CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) into its maximal compact sequences is obtained by replacing each 𝒞jsubscript𝒞𝑗\mathcal{C}_{j}caligraphic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in (𝒞1,𝒞2,,𝒞s)subscript𝒞1subscript𝒞2subscript𝒞𝑠(\mathcal{C}_{1},\mathcal{C}_{2},\ldots,\mathcal{C}_{s})( caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) with its decomposition into maximal compact sequences.

Definition 11 (Compact factor).

Let (𝒢1,,𝒢n)subscript𝒢1subscript𝒢𝑛({\cal G}_{1},\ldots,{\cal G}_{n})( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) be the decomposition of CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) into its maximal compact sequences. For every i𝑖iitalic_i, 1in1𝑖𝑛1\leq i\leq n1 ≤ italic_i ≤ italic_n, the concatenation gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the elements in 𝒢isubscript𝒢𝑖{\cal G}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a compact factor in CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ).

Definition 12 (Compact factorization).

Let wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. We say that (m1,,mk)subscript𝑚1subscript𝑚𝑘(m_{1},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is a compact factorization of w𝑤witalic_w if (m1,,mk)subscript𝑚1subscript𝑚𝑘(m_{1},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is an inverse Lyndon factorization of w𝑤witalic_w and each mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, 1jk1𝑗𝑘1\leq j\leq k1 ≤ italic_j ≤ italic_k, is a concatenation of compact factors in CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ).

Example 6.

Consider again y=dabadabdabdabdadac𝑦𝑑𝑎𝑏𝑎𝑑𝑎𝑏𝑑𝑎𝑏𝑑𝑎𝑏𝑑𝑎𝑑𝑎𝑐y=dabadabdabdabdadacitalic_y = italic_d italic_a italic_b italic_a italic_d italic_a italic_b italic_d italic_a italic_b italic_d italic_a italic_b italic_d italic_a italic_d italic_a italic_c over Σ={a,b,c,d}Σ𝑎𝑏𝑐𝑑\Sigma=\{a,b,c,d\}roman_Σ = { italic_a , italic_b , italic_c , italic_d }, a<b<c<d𝑎𝑏𝑐𝑑a<b<c<ditalic_a < italic_b < italic_c < italic_d, as in Example 5. The decomposition of CFLin(y)=(daba,dab,dab,dab,dadac)subscriptCFL𝑖𝑛𝑦𝑑𝑎𝑏𝑎𝑑𝑎𝑏𝑑𝑎𝑏𝑑𝑎𝑏𝑑𝑎𝑑𝑎𝑐\operatorname{CFL}_{in}(y)=(daba,dab,dab,dab,dadac)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_y ) = ( italic_d italic_a italic_b italic_a , italic_d italic_a italic_b , italic_d italic_a italic_b , italic_d italic_a italic_b , italic_d italic_a italic_d italic_a italic_c ) into its maximal compact sequences is ((daba),(dab,dab,dab),(dadac))𝑑𝑎𝑏𝑎𝑑𝑎𝑏𝑑𝑎𝑏𝑑𝑎𝑏𝑑𝑎𝑑𝑎𝑐((daba),(dab,dab,dab),(dadac))( ( italic_d italic_a italic_b italic_a ) , ( italic_d italic_a italic_b , italic_d italic_a italic_b , italic_d italic_a italic_b ) , ( italic_d italic_a italic_d italic_a italic_c ) ). The compact factors in CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) are daba,(dab)3,dadac𝑑𝑎𝑏𝑎superscript𝑑𝑎𝑏3𝑑𝑎𝑑𝑎𝑐daba,(dab)^{3},dadacitalic_d italic_a italic_b italic_a , ( italic_d italic_a italic_b ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , italic_d italic_a italic_d italic_a italic_c. Moreover, (daba,(dab)3,dadac)𝑑𝑎𝑏𝑎superscript𝑑𝑎𝑏3𝑑𝑎𝑑𝑎𝑐(daba,(dab)^{3},dadac)( italic_d italic_a italic_b italic_a , ( italic_d italic_a italic_b ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , italic_d italic_a italic_d italic_a italic_c ) is a compact factorization whereas (dabadab,(dab)2,dadac)𝑑𝑎𝑏𝑎𝑑𝑎𝑏superscript𝑑𝑎𝑏2𝑑𝑎𝑑𝑎𝑐(dabadab,(dab)^{2},dadac)( italic_d italic_a italic_b italic_a italic_d italic_a italic_b , ( italic_d italic_a italic_b ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_d italic_a italic_d italic_a italic_c ) is a grou** of CFLin(y)subscriptCFL𝑖𝑛𝑦\operatorname{CFL}_{in}(y)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_y ) which is not a compact factorization.

Proposition 5.

Let wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. If (m1,,mk)subscript𝑚1subscript𝑚𝑘(m_{1},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is an inverse Lyndon factorization of w𝑤witalic_w having the border property, then (m1,,mk)subscript𝑚1subscript𝑚𝑘(m_{1},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is a compact factorization of w𝑤witalic_w.

Proof.

Let wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, let (m1,,mk)subscript𝑚1subscript𝑚𝑘(m_{1},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) be an inverse Lyndon factorization of w𝑤witalic_w having the border property. Let CFLin(w)=(1,,h)subscriptCFL𝑖𝑛𝑤subscript1subscript\operatorname{CFL}_{in}(w)=(\ell_{1},\ldots,\ell_{h})roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) = ( roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ), where 1in2ininhsubscriptsucceeds-or-equals𝑖𝑛subscript1subscript2subscriptsucceeds-or-equals𝑖𝑛subscriptsucceeds-or-equals𝑖𝑛subscript\ell_{1}\succeq_{in}\ell_{2}\succeq_{in}\ldots\succeq_{in}\ell_{h}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⪰ start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⪰ start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT … ⪰ start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and 1,,hsubscript1subscript\ell_{1},\ldots,\ell_{h}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT are anti-Lyndon words. First we prove that (m1,,mk)subscript𝑚1subscript𝑚𝑘(m_{1},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is a grou** of CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) by induction on |w|𝑤|w|| italic_w |. If |w|=1𝑤1|w|=1| italic_w | = 1 the statement clearly holds, thus assume |w|>1𝑤1|w|>1| italic_w | > 1.

The words m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are comparable for the prefix order, hence either m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a proper prefix of 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT or 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a prefix of m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Suppose that m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a proper prefix of 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Thus, there are j𝑗jitalic_j, 1<jk1𝑗𝑘1<j\leq k1 < italic_j ≤ italic_k, and x,yΣ𝑥𝑦superscriptΣx,y\in\Sigma^{*}italic_x , italic_y ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, x1𝑥1x\not=1italic_x ≠ 1, such that mj=xysubscript𝑚𝑗𝑥𝑦m_{j}=xyitalic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_x italic_y and 1=m1mj1xsubscript1subscript𝑚1subscript𝑚𝑗1𝑥\ell_{1}=m_{1}\cdots m_{j-1}xroman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_m start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT italic_x. Necessarily it turns out j=2𝑗2j=2italic_j = 2 because otherwise m1mj1much-less-thansubscript𝑚1subscript𝑚𝑗1m_{1}\ll m_{j-1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≪ italic_m start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT, hence, by Lemma 1, 1mj1xmuch-less-thansubscript1subscript𝑚𝑗1𝑥\ell_{1}\ll m_{j-1}xroman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≪ italic_m start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT italic_x and this contradicts the fact that 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is an anti-Lyndon word. In conclusion 1=m1xsubscript1subscript𝑚1𝑥\ell_{1}=m_{1}xroman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x and m2=xysubscript𝑚2𝑥𝑦m_{2}=xyitalic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_x italic_y. We know that m1m2much-less-thansubscript𝑚1subscript𝑚2m_{1}\ll m_{2}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≪ italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, that is, there are r,s,tΣ𝑟𝑠𝑡superscriptΣr,s,t\in\Sigma^{*}italic_r , italic_s , italic_t ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, a,bΣ𝑎𝑏Σa,b\in\Sigmaitalic_a , italic_b ∈ roman_Σ, such that a<b𝑎𝑏a<bitalic_a < italic_b and m1=rassubscript𝑚1𝑟𝑎𝑠m_{1}=rasitalic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_r italic_a italic_s, m2=rbt=xysubscript𝑚2𝑟𝑏𝑡𝑥𝑦m_{2}=rbt=xyitalic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_r italic_b italic_t = italic_x italic_y. If |x||r|𝑥𝑟|x|\leq|r|| italic_x | ≤ | italic_r |, then r𝑟ritalic_r is a nonempty border of 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and if |x|>|r|𝑥𝑟|x|>|r|| italic_x | > | italic_r |, then there is a word tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that x=rbt𝑥𝑟𝑏superscript𝑡x=rbt^{\prime}italic_x = italic_r italic_b italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT which implies 1xmuch-less-thansubscript1𝑥\ell_{1}\ll xroman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≪ italic_x. Both cases again contradict the fact that 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is an anti-Lyndon word.

Therefore, 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a prefix of m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Let i𝑖iitalic_i be the largest integer such that m1=1i1xsubscript𝑚1subscript1subscript𝑖1𝑥m_{1}=\ell_{1}\cdots\ell_{i-1}xitalic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ roman_ℓ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT italic_x, x,yΣ𝑥𝑦superscriptΣx,y\in\Sigma^{*}italic_x , italic_y ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, i=xysubscript𝑖𝑥𝑦\ell_{i}=xyroman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_x italic_y, 1<ih1𝑖1<i\leq h1 < italic_i ≤ italic_h, y1𝑦1y\not=1italic_y ≠ 1. Let (𝒞1,𝒞2,,𝒞s)subscript𝒞1subscript𝒞2subscript𝒞𝑠(\mathcal{C}_{1},\mathcal{C}_{2},\ldots,\mathcal{C}_{s})( caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) be the decomposition of CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) into its non-increasing maximal chains for the prefix order. We claim that 1i1subscript1subscript𝑖1\ell_{1}\cdots\ell_{i-1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ roman_ℓ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT is a prefix of the concatenation of the elements of 𝒞1subscript𝒞1\mathcal{C}_{1}caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, thus (1,,i1)subscript1subscript𝑖1(\ell_{1},\ldots,\ell_{i-1})( roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) is a chain for the prefix order. If i=1𝑖1i=1italic_i = 1 we are done. Let i>1𝑖1i>1italic_i > 1. By contradiction, assume that there is j𝑗jitalic_j, 1<j<i1𝑗𝑖1<j<i1 < italic_j < italic_i, such that j𝒞1subscript𝑗subscript𝒞1\ell_{j}\not\in\mathcal{C}_{1}roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∉ caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Therefore, 1jmuch-less-thansubscript1subscript𝑗\ell_{1}\ll\ell_{j}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≪ roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT which implies m1ji1xmuch-less-thansubscript𝑚1subscript𝑗subscript𝑖1𝑥m_{1}\ll\ell_{j}\cdots\ell_{i-1}xitalic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≪ roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋯ roman_ℓ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT italic_x and this contradicts the fact that m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is an inverse Lyndon word.

We now prove that x=1𝑥1x=1italic_x = 1. Assume x1𝑥1x\not=1italic_x ≠ 1. As a preliminary step, we prove that there is no nonempty prefix β𝛽\betaitalic_β of m2subscript𝑚2m_{2}italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT such that |β||x|𝛽𝑥|\beta|\leq|x|| italic_β | ≤ | italic_x | and xβmuch-less-than𝑥𝛽x\ll\betaitalic_x ≪ italic_β. In fact, if such a prefix existed, there would be r,s,tΣ𝑟𝑠𝑡superscriptΣr,s,t\in\Sigma^{*}italic_r , italic_s , italic_t ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, a,bΣ𝑎𝑏Σa,b\in\Sigmaitalic_a , italic_b ∈ roman_Σ, such that a<b𝑎𝑏a<bitalic_a < italic_b and x=ras𝑥𝑟𝑎𝑠x=rasitalic_x = italic_r italic_a italic_s, β=rbt𝛽𝑟𝑏𝑡\beta=rbtitalic_β = italic_r italic_b italic_t. If |i||xr|subscript𝑖𝑥𝑟|\ell_{i}|\leq|xr|| roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≤ | italic_x italic_r | then i=xr=rasrsubscript𝑖𝑥superscript𝑟𝑟𝑎𝑠superscript𝑟\ell_{i}=xr^{\prime}=rasr^{\prime}roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_x italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_r italic_a italic_s italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, where rsuperscript𝑟r^{\prime}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT would be a nonempty prefix of r𝑟ritalic_r, thus a nonempty border of isubscript𝑖\ell_{i}roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (recall that i=xysubscript𝑖𝑥𝑦\ell_{i}=xyroman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_x italic_y with y1𝑦1y\not=1italic_y ≠ 1). If |i|>|xr|subscript𝑖𝑥𝑟|\ell_{i}|>|xr|| roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | > | italic_x italic_r |, then there would be a word tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that i=rasrbtsubscript𝑖𝑟𝑎𝑠𝑟𝑏superscript𝑡\ell_{i}=rasrbt^{\prime}roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_r italic_a italic_s italic_r italic_b italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT which would imply irbtmuch-less-thansubscript𝑖𝑟𝑏superscript𝑡\ell_{i}\ll rbt^{\prime}roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≪ italic_r italic_b italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Both cases contradict the fact that isubscript𝑖\ell_{i}roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an anti-Lyndon word.

If x1𝑥1x\not=1italic_x ≠ 1, then either isubscript𝑖\ell_{i}roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a prefix of 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT or 1imuch-less-thansubscript1subscript𝑖\ell_{1}\ll\ell_{i}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≪ roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. If isubscript𝑖\ell_{i}roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT were a prefix of 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, then x𝑥xitalic_x would be a nonempty border of m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. By Lemma 2 there would exist a nonempty prefix β𝛽\betaitalic_β of m2subscript𝑚2m_{2}italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT such that |β||x|𝛽𝑥|\beta|\leq|x|| italic_β | ≤ | italic_x | and xβmuch-less-than𝑥𝛽x\ll\betaitalic_x ≪ italic_β which contradicts our preliminary step.

If it were true that 1imuch-less-thansubscript1subscript𝑖\ell_{1}\ll\ell_{i}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≪ roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT then there would be r,s,tΣ𝑟𝑠𝑡superscriptΣr,s,t\in\Sigma^{*}italic_r , italic_s , italic_t ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, a,bΣ𝑎𝑏Σa,b\in\Sigmaitalic_a , italic_b ∈ roman_Σ, such that a<b𝑎𝑏a<bitalic_a < italic_b and 1=rassubscript1𝑟𝑎𝑠\ell_{1}=rasroman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_r italic_a italic_s, i=rbt=xysubscript𝑖𝑟𝑏𝑡𝑥𝑦\ell_{i}=rbt=xyroman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_r italic_b italic_t = italic_x italic_y. If |x|>|r|𝑥𝑟|x|>|r|| italic_x | > | italic_r |, then there would be a word tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that x=rbt𝑥𝑟𝑏superscript𝑡x=rbt^{\prime}italic_x = italic_r italic_b italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT which would imply m1xmuch-less-thansubscript𝑚1𝑥m_{1}\ll xitalic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≪ italic_x and this contradicts the fact that m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is an inverse Lyndon word. If |x||r|𝑥𝑟|x|\leq|r|| italic_x | ≤ | italic_r |, then x𝑥xitalic_x is a prefix of r𝑟ritalic_r and is a nonempty border of m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. By Lemma 2 again, there would exist a nonempty prefix β𝛽\betaitalic_β of m2subscript𝑚2m_{2}italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT such that |β||x|𝛽𝑥|\beta|\leq|x|| italic_β | ≤ | italic_x | and xβmuch-less-than𝑥𝛽x\ll\betaitalic_x ≪ italic_β which contradicts again our preliminary step.

Let wΣsuperscript𝑤superscriptΣw^{\prime}\in\Sigma^{*}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be such that w=m1w𝑤subscript𝑚1superscript𝑤w=m_{1}w^{\prime}italic_w = italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. If w=1superscript𝑤1w^{\prime}=1italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 we are done. Assume w1superscript𝑤1w^{\prime}\not=1italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ 1. Clearly |w|<|w|superscript𝑤𝑤|w^{\prime}|<|w|| italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | < | italic_w |. Of course (m2,,mk)subscript𝑚2subscript𝑚𝑘(m_{2},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is an inverse Lyndon factorization of w𝑤witalic_w having the border property. Moreover, by Corollary 1, CFLin(w)=(i,,h)subscriptCFL𝑖𝑛superscript𝑤subscript𝑖subscript\operatorname{CFL}_{in}(w^{\prime})=(\ell_{i},\ldots,\ell_{h})roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ( roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) and (𝒞1,𝒞2,,𝒞s)subscriptsuperscript𝒞1subscript𝒞2subscript𝒞𝑠(\mathcal{C}^{\prime}_{1},\mathcal{C}_{2},\ldots,\mathcal{C}_{s})( caligraphic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) is the decomposition of CFLin(w)subscriptCFL𝑖𝑛superscript𝑤\operatorname{CFL}_{in}(w^{\prime})roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) into its non-increasing maximal chains for the prefix order, where 𝒞1subscriptsuperscript𝒞1\mathcal{C}^{\prime}_{1}caligraphic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is defined by 𝒞1=(1,,i1,𝒞1)subscript𝒞1subscript1subscript𝑖1subscriptsuperscript𝒞1\mathcal{C}_{1}=(\ell_{1},\ldots,\ell_{i-1},\mathcal{C}^{\prime}_{1})caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , caligraphic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). By induction hypothesis, (m2,,mk)subscript𝑚2subscript𝑚𝑘(m_{2},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is a grou** of CFLin(w)subscriptCFL𝑖𝑛superscript𝑤\operatorname{CFL}_{in}(w^{\prime})roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) and consequently (m1,,mk)subscript𝑚1subscript𝑚𝑘(m_{1},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is a grou** of CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ).

Finally, to obtain a contradiction, suppose that (m1,,mk)subscript𝑚1subscript𝑚𝑘(m_{1},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is a grou** of CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) having the border property such that (m1,,mk)subscript𝑚1subscript𝑚𝑘(m_{1},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is not a compact factorization of w𝑤witalic_w. To adapt the notation to the proof, set CFLin(w)=(1n1,,rnr)subscriptCFL𝑖𝑛𝑤superscriptsubscript1subscript𝑛1superscriptsubscript𝑟subscript𝑛𝑟\operatorname{CFL}_{in}(w)=(\ell_{1}^{n_{1}},\ldots,\ell_{r}^{n_{r}})roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) = ( roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ), where r>0𝑟0r>0italic_r > 0, n1,,nr1subscript𝑛1subscript𝑛𝑟1n_{1},\ldots,n_{r}\geq 1italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ≥ 1 and 1,,rsubscript1subscript𝑟\ell_{1},\ldots,\ell_{r}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT are anti-Lyndon words. By Definitions 9 and 12, there exist integers j,h,ph,qh𝑗subscript𝑝subscript𝑞j,h,p_{h},q_{h}italic_j , italic_h , italic_p start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, 1jk11𝑗𝑘11\leq j\leq k-11 ≤ italic_j ≤ italic_k - 1, 1hr1𝑟1\leq h\leq r1 ≤ italic_h ≤ italic_r, ph1subscript𝑝1p_{h}\geq 1italic_p start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ≥ 1, qh1subscript𝑞1q_{h}\geq 1italic_q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ≥ 1, ph+qhnhsubscript𝑝subscript𝑞subscript𝑛p_{h}+q_{h}\leq n_{h}italic_p start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT + italic_q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ≤ italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, such that mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ends with hphsuperscriptsubscriptsubscript𝑝\ell_{h}^{p_{h}}roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and mj+1subscript𝑚𝑗1m_{j+1}italic_m start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT starts with hqhsuperscriptsubscriptsubscript𝑞\ell_{h}^{q_{h}}roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Thus, by Definition 9, hsubscript\ell_{h}roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is a prefix of mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Moreover, hsubscript\ell_{h}roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is a proper prefix of mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Indeed otherwise h=mjpmj+1subscriptsubscript𝑚𝑗subscript𝑝subscript𝑚𝑗1\ell_{h}=m_{j}\leq_{p}m_{j+1}roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≤ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT which is impossible because mjmj+1much-less-thansubscript𝑚𝑗subscript𝑚𝑗1m_{j}\ll m_{j+1}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≪ italic_m start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ((m1,,mk)subscript𝑚1subscript𝑚𝑘(m_{1},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is an inverse Lyndon factorization). Thus hsubscript\ell_{h}roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is a nonempty border of mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. The word hsubscript\ell_{h}roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is also a prefix of mj+1subscript𝑚𝑗1m_{j+1}italic_m start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT and this contradicts the fact that (m1,,mk)subscript𝑚1subscript𝑚𝑘(m_{1},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) has the border property. ∎

7 The canonical inverse Lyndon factorization: the algorithm

In this section we state another relevant result of the paper related to the main one stated in Section 5. We have shown that a nonempty word w𝑤witalic_w can have more than one inverse Lyndon factorization but w𝑤witalic_w has a unique inverse Lyndon factorization with the border property (Example 4, Proposition 4). Below we highlight that this unique factorization is the canonical one defined in [10, 13].

This special inverse Lyndon factorization is denoted by ICFLICFL\operatorname{ICFL}roman_ICFL because it is the counterpart of the Lyndon factorization CFLCFL\operatorname{CFL}roman_CFL of w𝑤witalic_w, when we use (I)inverse words as factors. Indeed, in [10] it has been proved that ICFL(w)ICFL𝑤\operatorname{ICFL}(w)roman_ICFL ( italic_w ) can be computed in linear time and it is uniquely determined for a word w𝑤witalic_w. See Section A for definitions of ICFLICFL\operatorname{ICFL}roman_ICFL and all related notions. Since ICFL(w)ICFL𝑤\operatorname{ICFL}(w)roman_ICFL ( italic_w ) is the unique inverse Lyndon factorization with the border property, from now on these two notions will be synonymous.

Below we show another interesting property of ICFLICFL\operatorname{ICFL}roman_ICFL: the last factor of the factorization is the longest suffix that is an inverse Lyndon word. Based on this result we provide a new simpler linear algorithm for computing ICFLICFL\operatorname{ICFL}roman_ICFL.

We begin by recalling previously proved results on ICFLICFL\operatorname{ICFL}roman_ICFL, namely Proposition 7.7 in [10] and Proposition 9.5 in [13]. They are merged into Proposition 6.

Proposition 6.

For any wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, ICFL(w)ICFL𝑤\operatorname{ICFL}(w)roman_ICFL ( italic_w ) is a grou** of CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ). Moreover, ICFL(w)ICFL𝑤\operatorname{ICFL}(w)roman_ICFL ( italic_w ) has the border property.

Corollary 2 is a direct consequence of Propositions 4, 5 and 6.

Corollary 2.

For each wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, ICFL(w)ICFL𝑤\operatorname{ICFL}(w)roman_ICFL ( italic_w ) is a compact factorization and it is is the unique inverse Lyndon factorization of w𝑤witalic_w having the border property.

We end the section with a result which has been proved in [25] and which will be used in the next section.

Proposition 7.

Let wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, let CFLin(w)=(1,,h)subscriptCFL𝑖𝑛𝑤subscript1subscript\operatorname{CFL}_{in}(w)=(\ell_{1},\ldots,\ell_{h})roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) = ( roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) and let (𝒞1,𝒞2,,𝒞s)subscript𝒞1subscript𝒞2subscript𝒞𝑠(\mathcal{C}_{1},\mathcal{C}_{2},\ldots,\mathcal{C}_{s})( caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) be the decomposition of CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) into its non-increasing maximal chains for the prefix order. Let w1,,wssubscript𝑤1subscript𝑤𝑠w_{1},\ldots,w_{s}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT be words such that CFLin(wj)=𝒞jsubscriptCFL𝑖𝑛subscript𝑤𝑗subscript𝒞𝑗\operatorname{CFL}_{in}(w_{j})=\mathcal{C}_{j}roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = caligraphic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, 1js1𝑗𝑠1\leq j\leq s1 ≤ italic_j ≤ italic_s. Then ICFL(w)ICFL𝑤\operatorname{ICFL}(w)roman_ICFL ( italic_w ) is the concatenation of the sequences ICFL(w1),,ICFL(ws)ICFLsubscript𝑤1ICFLsubscript𝑤𝑠\operatorname{ICFL}(w_{1}),\ldots,\operatorname{ICFL}(w_{s})roman_ICFL ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_ICFL ( italic_w start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ), that is,

ICFL(w)=(ICFL(w1),,ICFL(ws))ICFL𝑤ICFLsubscript𝑤1ICFLsubscript𝑤𝑠\operatorname{ICFL}(w)=(\operatorname{ICFL}(w_{1}),\ldots,\operatorname{ICFL}(% w_{s}))roman_ICFL ( italic_w ) = ( roman_ICFL ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_ICFL ( italic_w start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) (7.1)

We can now state some results useful to prove the correctness of our algorithm. First we observe that, thanks to Corollary 2 and Proposition 7, to compute ICFLICFL\operatorname{ICFL}roman_ICFL we can limit ourselves to the case in which CFLinsubscriptCFL𝑖𝑛\operatorname{CFL}_{in}roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT is a chain with respect to the prefix order.

Lemma 3.

Let 1,,hsubscript1subscript\ell_{1},\ldots,\ell_{h}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT be anti-Lyndon words over ΣΣ\Sigmaroman_Σ that form a non-increasing chain for the prefix order, that is, 1p2pphsubscript𝑝subscript1subscript2subscript𝑝subscript𝑝subscript\ell_{1}\geq_{p}\ell_{2}\geq_{p}\ldots\geq_{p}\ell_{h}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT … ≥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. If 12subscript1subscript2\ell_{1}\not=\ell_{2}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, then 1p2hsubscriptnot-less-than𝑝subscript1subscript2subscript\ell_{1}\not<_{p}\ell_{2}\cdots\ell_{h}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≮ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋯ roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT.

Proof.

By contradiction, assume that 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a prefix of 2hsubscript2subscript\ell_{2}\cdots\ell_{h}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋯ roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. Then, 1=2tzsubscript1subscript2subscript𝑡𝑧\ell_{1}=\ell_{2}\cdots\ell_{t}zroman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋯ roman_ℓ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_z where either z=1𝑧1z=1italic_z = 1 and 2<th2𝑡2<t\leq h2 < italic_t ≤ italic_h or z𝑧zitalic_z is a nonempty prefix of t+1subscript𝑡1\ell_{t+1}roman_ℓ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, 2t<h2𝑡2\leq t<h2 ≤ italic_t < italic_h. Thus either tsubscript𝑡\ell_{t}roman_ℓ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT or z𝑧zitalic_z is a nonempty border of 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, a contradiction in both cases. ∎

Remark 1.

[13] Let x,y𝑥𝑦x,yitalic_x , italic_y two different borders of a same word wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. If x𝑥xitalic_x is shorter than y𝑦yitalic_y, then x𝑥xitalic_x is a border of y𝑦yitalic_y.

Proposition 8.

Let wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and assume that CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) form a non-increasing chain for the prefix order. If (m1,,mk)subscript𝑚1subscript𝑚𝑘(m_{1},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is a factorization of w𝑤witalic_w such that each mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, 1jk1𝑗𝑘1\leq j\leq k1 ≤ italic_j ≤ italic_k, is a concatenation of compact factors in CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ), then (m1,,mk)subscript𝑚1subscript𝑚𝑘(m_{1},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) has the border property.

Proof.

Let wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and assume that CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) form a non-increasing chain for the prefix order. Let (m1,,mk)subscript𝑚1subscript𝑚𝑘(m_{1},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) be a factorization of w𝑤witalic_w such that each mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, 1jk1𝑗𝑘1\leq j\leq k1 ≤ italic_j ≤ italic_k, is a concatenation of compact factors in CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ). The proof is by induction on k𝑘kitalic_k. If k=1𝑘1k=1italic_k = 1, then the conclusion follows immediately. Assume k>1𝑘1k>1italic_k > 1.

Let wΣ+superscript𝑤superscriptΣw^{\prime}\in\Sigma^{+}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT be such that w=m1w𝑤subscript𝑚1superscript𝑤w=m_{1}w^{\prime}italic_w = italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. It is clear that (m2,,mk)subscript𝑚2subscript𝑚𝑘(m_{2},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is a factorization of wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that each mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, 2jk2𝑗𝑘2\leq j\leq k2 ≤ italic_j ≤ italic_k, is a concatenation of compact factors in CFLin(w)subscriptCFL𝑖𝑛superscript𝑤\operatorname{CFL}_{in}(w^{\prime})roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). Thus, by induction hypothesis, (m2,,mk)subscript𝑚2subscript𝑚𝑘(m_{2},\ldots,m_{k})( italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) has the border property. It remains to prove that each nonempty border of m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is not a prefix of m2subscript𝑚2m_{2}italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The proof is straightforward if m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is unbordered, thus assume that m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is bordered.

Let CFL(w)=(1n1,,rnr)CFL𝑤superscriptsubscript1subscript𝑛1superscriptsubscript𝑟subscript𝑛𝑟\operatorname{CFL}(w)=(\ell_{1}^{n_{1}},\ldots,\ell_{r}^{n_{r}})roman_CFL ( italic_w ) = ( roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ), where 1n1,,rnrsuperscriptsubscript1subscript𝑛1superscriptsubscript𝑟subscript𝑛𝑟\ell_{1}^{n_{1}},\ldots,\ell_{r}^{n_{r}}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are the compact factors in CFL(w)CFL𝑤\operatorname{CFL}(w)roman_CFL ( italic_w ), that is 1,,rsubscript1subscript𝑟\ell_{1},\ldots,\ell_{r}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT are anti-Lyndon words such that 1pphsubscript𝑝subscript1subscript𝑝subscript\ell_{1}\geq_{p}\ldots\geq_{p}\ell_{h}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT … ≥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. Since misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a concatenation of compact factors in CFLin(w)subscriptCFL𝑖𝑛𝑤\operatorname{CFL}_{in}(w)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ), there is hhitalic_h, 1h<r1𝑟1\leq h<r1 ≤ italic_h < italic_r such that

m1=1n1hnhsubscript𝑚1superscriptsubscript1subscript𝑛1superscriptsubscriptsubscript𝑛m_{1}=\ell_{1}^{n_{1}}\cdots\ell_{h}^{n_{h}}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋯ roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

Notice that hsubscript\ell_{h}roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is a nonempty border of m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Furthermore, since hsubscript\ell_{h}roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is unbordered, hsubscript\ell_{h}roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is the shortest nonempty border of m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

If there were a word z𝑧zitalic_z which is a nonempty border of m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and also a prefix of m2subscript𝑚2m_{2}italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, by Remark 1, hsubscript\ell_{h}roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT would be a prefix of m2subscript𝑚2m_{2}italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Therefore, hsubscript\ell_{h}roman_ℓ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT would be a prefix of the word h+1nh+1rnrsuperscriptsubscript1subscript𝑛1superscriptsubscript𝑟subscript𝑛𝑟\ell_{h+1}^{n_{h+1}}\cdots\ell_{r}^{n_{r}}roman_ℓ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋯ roman_ℓ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT which contradicts Lemma 3. ∎

Proposition 9.

Let wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and let ICFL(w)=(m1,,mk)ICFL𝑤subscript𝑚1subscript𝑚𝑘\operatorname{ICFL}(w)=(m_{1},\ldots,m_{k})roman_ICFL ( italic_w ) = ( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) be the unique inverse Lyndon factorization of w𝑤witalic_w having the border property. Then mksubscript𝑚𝑘m_{k}italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the longest suffix of w𝑤witalic_w which is an inverse Lyndon word.

Proof.

Let wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and let (m1,,mk)subscript𝑚1subscript𝑚𝑘(m_{1},\dots,m_{k})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) be the unique inverse Lyndon factorization of w𝑤witalic_w having the border property. If k=1𝑘1k=1italic_k = 1 we are done. Thus suppose k>1𝑘1k>1italic_k > 1. By contradiction, suppose that mksubscript𝑚𝑘m_{k}italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is not the longest suffix of w𝑤witalic_w that is an inverse Lyndon word. Let s𝑠sitalic_s be such longest suffix. Thus, there exist a nonempty suffix x𝑥xitalic_x of mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, 1j<k1𝑗𝑘1\leq j<k1 ≤ italic_j < italic_k such that s=xmj+1mk𝑠𝑥subscript𝑚𝑗1subscript𝑚𝑘s=xm_{j+1}\cdots m_{k}italic_s = italic_x italic_m start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ⋯ italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Furthermore x𝑥xitalic_x must be a proper suffix of mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT or we would have s=mjmkmj+1mk𝑠subscript𝑚𝑗subscript𝑚𝑘much-less-thansubscript𝑚𝑗1subscript𝑚𝑘s=m_{j}\cdots m_{k}\ll m_{j+1}\cdots m_{k}italic_s = italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋯ italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≪ italic_m start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ⋯ italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT contradicting the hypothesis that s𝑠sitalic_s is inverse Lyndon.

We claim that xmj+1much-less-than𝑥subscript𝑚𝑗1x\ll m_{j+1}italic_x ≪ italic_m start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT. Indeed, since mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is an inverse Lyndon word, it holds xmjprecedes-or-equals𝑥subscript𝑚𝑗x\preceq m_{j}italic_x ⪯ italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Thus, if xmjmuch-less-than𝑥subscript𝑚𝑗x\ll m_{j}italic_x ≪ italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT or x=mj𝑥subscript𝑚𝑗x=m_{j}italic_x = italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, it immediately follows that xmj+1much-less-than𝑥subscript𝑚𝑗1x\ll m_{j+1}italic_x ≪ italic_m start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT. Otherwise, xpmjsubscript𝑝𝑥subscript𝑚𝑗x\leq_{p}m_{j}italic_x ≤ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and x𝑥xitalic_x is a nonempty border of mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. By Lemma 2 applied to (m1,,mk)subscript𝑚1subscript𝑚𝑘(m_{1},\dots,m_{k})( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), with x=α𝑥𝛼x=\alphaitalic_x = italic_α, there must exist a prefix β𝛽\betaitalic_β of mj+1subscript𝑚𝑗1m_{j+1}italic_m start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT such that xβmuch-less-than𝑥𝛽x\ll\betaitalic_x ≪ italic_β, hence xmj+1much-less-than𝑥subscript𝑚𝑗1x\ll m_{j+1}italic_x ≪ italic_m start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT.

Since xmj+1much-less-than𝑥subscript𝑚𝑗1x\ll m_{j+1}italic_x ≪ italic_m start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT, we have s=xmj+1mkmj+1mk𝑠𝑥subscript𝑚𝑗1subscript𝑚𝑘much-less-thansubscript𝑚𝑗1subscript𝑚𝑘s=xm_{j+1}\cdots m_{k}\ll m_{j+1}\cdots m_{k}italic_s = italic_x italic_m start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ⋯ italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≪ italic_m start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ⋯ italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, contradicting the hypothesis that s𝑠sitalic_s is an inverse Lyndon word. ∎

Proposition 10.

Let wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT be an inverse Lyndon word, and let Σ+superscriptΣ\ell\in\Sigma^{+}roman_ℓ ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT be an anti-Lyndon word. Then:

  1. 1.

    If wmuch-less-than𝑤\ell\ll wroman_ℓ ≪ italic_w, then for every k1𝑘1k\geq 1italic_k ≥ 1, kwsuperscript𝑘𝑤\ell^{k}wroman_ℓ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w is not an inverse Lyndon word.

  2. 2.

    If w𝑤\ell wroman_ℓ italic_w is not an inverse Lyndon word, then wmuch-less-than𝑤\ell\ll wroman_ℓ ≪ italic_w. Furthermore, for every k1𝑘1k\geq 1italic_k ≥ 1, w𝑤witalic_w is the longest suffix of kwsuperscript𝑘𝑤\ell^{k}wroman_ℓ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w that is an inverse Lyndon word.

Proof.

By Lemma 1, the proof of item 1 is immediate. Suppose w𝑤\ell wroman_ℓ italic_w is not inverse Lyndon. Then, there exists a proper suffix s𝑠sitalic_s of w𝑤\ell wroman_ℓ italic_w such that wsprecedes-or-equals𝑤𝑠\ell w\preceq sroman_ℓ italic_w ⪯ italic_s, hence wsmuch-less-than𝑤𝑠\ell w\ll sroman_ℓ italic_w ≪ italic_s. Since \ellroman_ℓ is anti-Lyndon, for every proper suffix x𝑥xitalic_x of \ellroman_ℓ it follows xmuch-less-than𝑥x\ll\ellitalic_x ≪ roman_ℓ and consequently xwwmuch-less-than𝑥𝑤𝑤xw\ll\ell witalic_x italic_w ≪ roman_ℓ italic_w. Thus, s𝑠sitalic_s must be a suffix of w𝑤witalic_w. Since w𝑤witalic_w is an inverse Lyndon word, one of the following three cases holds: (1) w=s𝑤𝑠w=sitalic_w = italic_s; (2) s<pwsubscript𝑝𝑠𝑤s<_{p}witalic_s < start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_w; (3) swmuch-less-than𝑠𝑤s\ll witalic_s ≪ italic_w. By wsmuch-less-than𝑤𝑠\ell w\ll sroman_ℓ italic_w ≪ italic_s, in each of the three cases it is evident that wwmuch-less-than𝑤𝑤\ell w\ll wroman_ℓ italic_w ≪ italic_w. Thus there are r,t,tΣ𝑟𝑡superscript𝑡superscriptΣr,t,t^{\prime}\in\Sigma^{*}italic_r , italic_t , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and a,bΣ𝑎𝑏Σa,b\in\Sigmaitalic_a , italic_b ∈ roman_Σ with a<b𝑎𝑏a<bitalic_a < italic_b such that w=rat𝑤𝑟𝑎𝑡\ell w=ratroman_ℓ italic_w = italic_r italic_a italic_t, w=rbt𝑤𝑟𝑏superscript𝑡w=rbt^{\prime}italic_w = italic_r italic_b italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. If |||ra|𝑟𝑎|\ell|\geq|ra|| roman_ℓ | ≥ | italic_r italic_a |, then clearly wmuch-less-than𝑤\ell\ll wroman_ℓ ≪ italic_w. Otherwise, |||r|𝑟|\ell|\leq|r|| roman_ℓ | ≤ | italic_r | and there is rΣsuperscript𝑟superscriptΣr^{\prime}\in\Sigma^{*}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT such that r=r𝑟superscript𝑟r=\ell r^{\prime}italic_r = roman_ℓ italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Consequently, w𝑤witalic_w starts with rasuperscript𝑟𝑎r^{\prime}aitalic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_a. On the other hand, rsuperscript𝑟r^{\prime}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is a border of r𝑟ritalic_r, hence w=rbt𝑤superscript𝑟𝑏superscript𝑡w=\ell r^{\prime}bt^{\prime}italic_w = roman_ℓ italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_b italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and rbtsuperscript𝑟𝑏superscript𝑡r^{\prime}bt^{\prime}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_b italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is a suffix of w𝑤witalic_w. This contradicts the fact that w𝑤witalic_w is an inverse Lyndon word.

For every k1𝑘1k\geq 1italic_k ≥ 1, w𝑤witalic_w is a suffix of kwsuperscript𝑘𝑤\ell^{k}wroman_ℓ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w that is an inverse Lyndon word. Let x𝑥xitalic_x be a proper nonempty suffix of \ellroman_ℓ. Of course xmuch-less-than𝑥x\ll\ellitalic_x ≪ roman_ℓ. The word xw𝑥𝑤xwitalic_x italic_w is not an inverse Lyndon word, otherwise we would have wxwwmuch-less-than𝑤precedes-or-equals𝑥𝑤much-less-than𝑤\ell\ll w\preceq xw\ll\ell wroman_ℓ ≪ italic_w ⪯ italic_x italic_w ≪ roman_ℓ italic_w, a contradiction. Moreover, by Lemma 1, for any j𝑗jitalic_j, 1j<k1𝑗𝑘1\leq j<k1 ≤ italic_j < italic_k, we have xjwjwmuch-less-than𝑥superscript𝑗𝑤superscript𝑗𝑤x\ell^{j}w\ll\ell^{j}witalic_x roman_ℓ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_w ≪ roman_ℓ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_w and xjw𝑥superscript𝑗𝑤x\ell^{j}witalic_x roman_ℓ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_w is not an inverse Lyndon word. Finally, by item 1, kwsuperscript𝑘𝑤\ell^{k}wroman_ℓ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w is not an inverse Lyndon word. ∎

Algorithm 1 Compute ICFL(w)ICFL𝑤\operatorname{ICFL}(w)roman_ICFL ( italic_w ), the unique compact factorization of w𝑤witalic_w having the border property.
1:function Factorize(w𝑤witalic_w)
2:     (1e1,,nen)CompactFactors(w)superscriptsubscript1subscript𝑒1superscriptsubscript𝑛subscript𝑒𝑛CompactFactors𝑤(\ell_{1}^{e_{1}},\dots,\ell_{n}^{e_{n}})\leftarrow\textsc{CompactFactors}(w)( roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ← CompactFactors ( italic_w ) \triangleright Compute compact factors of w𝑤witalic_w
3:     \mathcal{F}\leftarrow\varnothingcaligraphic_F ← ∅
4:     mnensuperscript𝑚superscriptsubscript𝑛subscript𝑒𝑛m^{\prime}\leftarrow\ell_{n}^{e_{n}}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
5:     for t=n1 downto 1𝑡𝑛1 downto 1t=n-1\textbf{ downto }1italic_t = italic_n - 1 downto 1 do \triangleright Work one compact factor at a time
6:         if tmmuch-less-thansubscript𝑡superscript𝑚\ell_{t}\ll m^{\prime}roman_ℓ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≪ italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT then \triangleright Proposition 10
7:              (m,)superscript𝑚\mathcal{F}\leftarrow(m^{\prime},\mathcal{F})caligraphic_F ← ( italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , caligraphic_F )
8:              mtetsuperscript𝑚superscriptsubscript𝑡subscript𝑒𝑡m^{\prime}\leftarrow\ell_{t}^{e_{t}}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← roman_ℓ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
9:         else
10:              mtetmsuperscript𝑚superscriptsubscript𝑡subscript𝑒𝑡superscript𝑚m^{\prime}\leftarrow\ell_{t}^{e_{t}}\cdot m^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← roman_ℓ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋅ italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT               
11:     (m,)superscript𝑚\mathcal{F}\leftarrow(m^{\prime},\mathcal{F})caligraphic_F ← ( italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , caligraphic_F )
12:     return \mathcal{F}caligraphic_F

We now describe Algorithm 1. Function Factorize(w)Factorize𝑤\textsc{Factorize}(w)Factorize ( italic_w ) will compute the unique compact factorization of w𝑤witalic_w having the border property. First, at line 2, it is computed the decomposition of w𝑤witalic_w into its compact factors. Then, the factorization of w𝑤witalic_w is carried out from right to left. Specifically, in accordance with Proposition 9, the for-loop at lines 510 will search for the longest suffix msuperscript𝑚m^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of w𝑤witalic_w that is an inverse Lyndon word. The update of msuperscript𝑚m^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is managed by iteratively applying Proposition 10 at line 6. Once such longest suffix is found (that is, when the condition at line 6 is true) it is added to the growing factorization \mathcal{F}caligraphic_F and it is initiated a new search for the longest suffix for the remaining portion of the string. Otherwise, line 10, the suffix is extended. In the end, the complete factorization is returned.

7.1 Correctness and complexity

We now prove that Algorithm 1 is correct, that is that it will compute the unique inverse Lyndon factorization of w𝑤witalic_w having the border property, namely ICFL(w)ICFL𝑤\operatorname{ICFL}(w)roman_ICFL ( italic_w ). Formally:

Lemma 4.

Let wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, and let \mathcal{F}caligraphic_F be the result of Factorize(w)Factorize𝑤\textsc{Factorize}(w)Factorize ( italic_w ). Then, =ICFL(w)ICFL𝑤\mathcal{F}=\operatorname{ICFL}(w)caligraphic_F = roman_ICFL ( italic_w ).

Proof.

Let (1e1,,nen)superscriptsubscript1subscript𝑒1superscriptsubscript𝑛subscript𝑒𝑛(\ell_{1}^{e_{1}},\dots,\ell_{n}^{e_{n}})( roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) be the decomposition of w𝑤witalic_w into its compact factors, and let Lt=tetnensubscript𝐿𝑡superscriptsubscript𝑡subscript𝑒𝑡superscriptsubscript𝑛subscript𝑒𝑛L_{t}=\ell_{t}^{e_{t}}\cdots\ell_{n}^{e_{n}}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_ℓ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋯ roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. We will denote by mtsubscriptsuperscript𝑚𝑡m^{\prime}_{t}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (resp. tsubscript𝑡\mathcal{F}_{t}caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) the value of msuperscript𝑚m^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (resp. \mathcal{F}caligraphic_F) at the end of iteration t𝑡titalic_t. We will prove the following loop invariant: at the end of iteration t𝑡titalic_t, sequence (mt,t)subscriptsuperscript𝑚𝑡subscript𝑡(m^{\prime}_{t},\mathcal{F}_{t})( italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is a compact factorization of Ltsubscript𝐿𝑡L_{t}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT having the border property. The claimed result will follow by Corollary 2.

Initialization.

Prior to entering the loop, (mn,n)=(nen)subscriptsuperscript𝑚𝑛subscript𝑛superscriptsubscript𝑛subscript𝑒𝑛(m^{\prime}_{n},\mathcal{F}_{n})=(\ell_{n}^{e_{n}})( italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = ( roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) , where the last equality follows from Proposition 9.

Maintenance.

Let tn1𝑡𝑛1t\leq n-1italic_t ≤ italic_n - 1. By induction hypothesis, ICFL(Lt+1)=(mt+1,t+1)ICFLsubscript𝐿𝑡1subscriptsuperscript𝑚𝑡1subscript𝑡1\operatorname{ICFL}(L_{t+1})=(m^{\prime}_{t+1},\mathcal{F}_{t+1})roman_ICFL ( italic_L start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) = ( italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ).

Suppose tmt+1much-less-thansubscript𝑡subscriptsuperscript𝑚𝑡1\ell_{t}\ll m^{\prime}_{t+1}roman_ℓ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≪ italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. Then, by item 1 of Proposition 10 tmt+1subscript𝑡subscriptsuperscript𝑚𝑡1\ell_{t}\cdot m^{\prime}_{t+1}roman_ℓ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is not inverse Lyndon and mt+1subscriptsuperscript𝑚𝑡1m^{\prime}_{t+1}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is the longest suffix of tetmt+1superscriptsubscript𝑡subscript𝑒𝑡subscriptsuperscript𝑚𝑡1\ell_{t}^{e_{t}}\cdot m^{\prime}_{t+1}roman_ℓ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋅ italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT that is an inverse Lyndon word. Thus, by Proposition 9 mt+1subscriptsuperscript𝑚𝑡1m^{\prime}_{t+1}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is the last factor of any compact factorization of tetmt+1superscriptsubscript𝑡subscript𝑒𝑡subscriptsuperscript𝑚𝑡1\ell_{t}^{e_{t}}\cdot m^{\prime}_{t+1}roman_ℓ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋅ italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. Hence, (mt,t)=(tet,mt+1,t+1)subscriptsuperscript𝑚𝑡subscript𝑡superscriptsubscript𝑡subscript𝑒𝑡subscriptsuperscript𝑚𝑡1subscript𝑡1(m^{\prime}_{t},\mathcal{F}_{t})=(\ell_{t}^{e_{t}},m^{\prime}_{t+1},\mathcal{F% }_{t+1})( italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( roman_ℓ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) is a compact factorization of Ftsubscript𝐹𝑡F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT having the border property.

Now, consider the case where t≪̸mt+1not-much-less-thansubscript𝑡subscriptsuperscript𝑚𝑡1\ell_{t}\not\ll m^{\prime}_{t+1}roman_ℓ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≪̸ italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. Then, by the contrapositive of item 2 of Proposition 10, tmt+1subscript𝑡subscriptsuperscript𝑚𝑡1\ell_{t}\cdot m^{\prime}_{t+1}roman_ℓ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is inverse Lyndon and thus, again by item 2 of Proposition 10, tetmt+1superscriptsubscript𝑡subscript𝑒𝑡subscriptsuperscript𝑚𝑡1\ell_{t}^{e_{t}}\cdot m^{\prime}_{t+1}roman_ℓ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋅ italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is inverse Lyndon. Therefore, (mt,t)=(tetmt+1,t+1)subscriptsuperscript𝑚𝑡subscript𝑡superscriptsubscript𝑡subscript𝑒𝑡subscriptsuperscript𝑚𝑡1subscript𝑡1(m^{\prime}_{t},\mathcal{F}_{t})=(\ell_{t}^{e_{t}}\cdot m^{\prime}_{t+1},% \mathcal{F}_{t+1})( italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( roman_ℓ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋅ italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) is a compact factorization having the border property.

Termination.

After iteration t=1𝑡1t=1italic_t = 1, sequence (m1,1)=ICFL(L1)=ICFL(w)subscriptsuperscript𝑚1subscript1ICFLsubscript𝐿1ICFL𝑤(m^{\prime}_{1},\mathcal{F}_{1})=\operatorname{ICFL}(L_{1})=\operatorname{ICFL% }(w)( italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = roman_ICFL ( italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = roman_ICFL ( italic_w ).

Finally, line 11 sets =(m1,1)=ICFL(w)subscriptsuperscript𝑚1subscript1ICFL𝑤\mathcal{F}=(m^{\prime}_{1},\mathcal{F}_{1})=\operatorname{ICFL}(w)caligraphic_F = ( italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = roman_ICFL ( italic_w ). ∎

Function Factorize(w)Factorize𝑤\textsc{Factorize}(w)Factorize ( italic_w ) has time complexity that is linear in the length of w𝑤witalic_w. Indeed, the sequence of compact factors obtained at line 2 can be computed in linear time in the length of w𝑤witalic_w by a simple modification of Duval’s algorithm (see [17]). After that, each iteration t𝑡titalic_t of loop 510 can be implemented to run in time 𝒪(|t|)𝒪subscript𝑡\mathcal{O}(|\ell_{t}|)caligraphic_O ( | roman_ℓ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ). Indeed, condition tmmuch-less-thansubscript𝑡superscript𝑚\ell_{t}\ll m^{\prime}roman_ℓ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≪ italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can be checked by naively comparing tsubscript𝑡\ell_{t}roman_ℓ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT against msuperscript𝑚m^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Furthermore, the update of msuperscript𝑚m^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and \mathcal{F}caligraphic_F can be done in constant time: in fact, tsubscript𝑡\ell_{t}roman_ℓ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, tetsuperscriptsubscript𝑡subscript𝑒𝑡\ell_{t}^{e_{t}}roman_ℓ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, msuperscript𝑚m^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and \mathcal{F}caligraphic_F can all be implemented as pairs of indexes (in case of the former three) or as a list of indexes (in case of the latter) of w𝑤witalic_w.

8 Conclusions

We discover the special connection between the Lyndon factorization under the inverse lexicographic ordering, named CFLinsubscriptCFL𝑖𝑛\operatorname{CFL}_{in}roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT and the canonical inverse Lyndon factorization, named ICFLICFL\operatorname{ICFL}roman_ICFL: there exists a unique inverse Lyndon factorization having the border property and this unique factorization is ICFLICFL\operatorname{ICFL}roman_ICFL. Moreover each inverse factor of ICFLICFL\operatorname{ICFL}roman_ICFL is obtained by concatenating compact factors of CFLinsubscriptCFL𝑖𝑛\operatorname{CFL}_{in}roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT. These properties give a constrained structure to ICFLICFL\operatorname{ICFL}roman_ICFL that deserve to be further explored to characterize properties of words. In particular, we believe the characterization of ICFLICFL\operatorname{ICFL}roman_ICFL as a compact factorization, proved in the paper, could highlight novel properties related the compression of a word, as investigated in [26]. In particular, the number of compact factors seems to be a measure of repetitiveness of the word to be also used in speeding up suffix sorting of a word.

Finally, we believe that the characterization of ICFLICFL\operatorname{ICFL}roman_ICFL in terms of CFLinsubscriptCFL𝑖𝑛\operatorname{CFL}_{in}roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT may be used to extend to ICFLICFL\operatorname{ICFL}roman_ICFL the conservation property proved in [13] for CFLCFL\operatorname{CFL}roman_CFL. This property shows that the Lyndon factorization of a word w𝑤witalic_w preserves common factors with the factorization of a superstring of w𝑤witalic_w. This extends the conservation of Lyndon factors explored for the product uv𝑢𝑣u\cdot vitalic_u ⋅ italic_v of two words u𝑢uitalic_u and v𝑣vitalic_v [26, 27].

Acknowledgments

This research was supported by the European Union’s Horizon 2020 Research and Innovation Programme under the Marie Skłodowska-Curie grant agreement PANGAIA No. 872539, by MUR 2022YRB97K, PINC, Pangenome Informatics: from Theory to Applications, and by INdAM-GNCS Project 2023

References

  • [1] Kuo-Tsai Chen, Ralph H. Fox, and Roger C. Lyndon. Free Differential calculus, IV. The Quotient Groups of the Lower Central Series. Ann. Math., 68:81–95, 1958.
  • [2] Roger Lyndon. On Burnside’s problem. Trans. Amer. Math. Soc., 77:202–215, 1954.
  • [3] Hideo Bannai, Juha Kärkkäinen, Dominik Köppl, and Marcin Piatkowski. Constructing and indexing the bijective and extended Burrows-Wheeler transform. Information and Computation, 297:105153, 2024.
  • [4] Elena Biagi, Davide Cenzato, Zsuzsanna Lipták, and Giuseppe Romana. On the number of equal-letter runs of the Bijective Burrows-Wheeler Transform. In CEUR Workshop Proceedings, volume 3587, pages 129–142. R. Piskac c/o Redaktion Sun SITE, Informatik V, RWTH Aachen, 2023.
  • [5] Dominik Köppl, Daiki Hashimoto, Diptarama Hendrian, and Ayumi Shinohara. In-place bijective Burrows-Wheeler Transforms. In Inge Li Gørtz and Oren Weimann, editors, 31st Annual Symposium on Combinatorial Pattern Matching, CPM 2020, June 17-19, 2020, Copenhagen, Denmark, volume 161 of LIPIcs, pages 21:1–21:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020.
  • [6] Nico Bertram, Jonas Ellert, and Johannes Fischer. Lyndon words accelerate suffix sorting. In Petra Mutzel, Rasmus Pagh, and Grzegorz Herman, editors, 29th Annual European Symposium on Algorithms, ESA 2021, September 6-8, 2021, Lisbon, Portugal (Virtual Conference), volume 204 of LIPIcs, pages 15:1–15:13. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021.
  • [7] Olivier Delgrange and Eric Rivals. Star: an algorithm to search for tandem approximate repeats. Bioinformatics, 20(16):2812–2820, 2004.
  • [8] Igor Martayan, Bastien Cazaux, Antoine Limasset, and Camille Marchet. Conway-Bromage-Lyndon (CBL): an exact, dynamic representation of k-mer sets. bioRxiv, 2024.
  • [9] Paola Bonizzoni, Matteo Costantini, Clelia De Felice, Alessia Petescia, Yuri Pirola, Marco Previtali, Raffaella Rizzi, Jens Stoye, Rocco Zaccagnino, and Rosalba Zizza. Numeric Lyndon-based feature embedding of sequencing reads for machine learning approaches. Inf. Sci., 607:458–476, 2022.
  • [10] Paola Bonizzoni, Clelia De Felice, Rocco Zaccagnino, and Rosalba Zizza. Inverse Lyndon words and inverse Lyndon factorizations of words. Adv. Appl. Math., 101:281–319, 2018.
  • [11] Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. Suffix array and Lyndon factorization of a text. J. Discrete Algorithms, 28:2–8, 2014.
  • [12] Paola Bonizzoni, Clelia De Felice, Rocco Zaccagnino, and Rosalba Zizza. Lyndon words versus inverse Lyndon words: Queries on suffixes and bordered words. In Alberto Leporati, Carlos Martín-Vide, Dana Shapira, and Claudio Zandron, editors, Language and Automata Theory and Applications - 14th International Conference, LATA 2020, Milan, Italy, March 4-6, 2020, Proceedings, volume 12038 of Lecture Notes in Computer Science, pages 385–396. Springer, 2020.
  • [13] Paola Bonizzoni, Clelia De Felice, Rocco Zaccagnino, and Rosalba Zizza. On the longest common prefix of suffixes in an inverse Lyndon factorization and other properties. Theor. Comput. Sci., 862:24–41, 2021.
  • [14] Jean Berstel, Dominique Perrin, and Christophe Reutenauer. Codes and Automata. Encyclopedia of Mathematics and its Applications 129, Cambridge University Press, 2009.
  • [15] Christian Choffrut and Juhani Karhumäki. Combinatorics of Words. In Grzegorz Rozenberg and Arto Salomaa, editors, Handbook of Formal Languages, Vol. 1, pages 329–438. Springer-Verlag, Berlin, Heidelberg, 1997.
  • [16] M. Lothaire. Algebraic Combinatorics on Words, Encyclopedia Math. Appl., volume 90. Cambridge University Press, 1997.
  • [17] M. Lothaire. Applied Combinatorics on Words. Cambridge University Press, 2005.
  • [18] Christophe Reutenauer. Free Lie algebras. In Handbook of Algebra, London Mathematical Society Monographs. Oxford Science Publications, 1993.
  • [19] Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms on strings. Cambridge University Press, 2007.
  • [20] Hideo Bannai, I Tomohiro, Shunsuke Inenaga, Yuto Nakashima, Masayuki Takeda, and Kazuya Tsuruta. A new characterization of maximal repetitions by Lyndon trees. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January 4-6, 2015, pages 562–571, 2015.
  • [21] Jean-Pierre Duval. Factorizing Words over an Ordered Alphabet. J. Algorithms, 4(4):363–381, 1983.
  • [22] Harold Fredricksen and James Maiorana. Necklaces of beads in k𝑘kitalic_k colors and k𝑘kitalic_k-ary de Bru** sequences. Discrete Math., 23(3):207–210, 1978.
  • [23] Sukhpal Singh Ghuman, Emanuele Giaquinta, and Jorma Tarhio. Alternative Algorithms for Lyndon Factorization. In Proceedings of the Prague Stringology Conference 2014, Prague, Czech Republic, September 1-3, 2014, pages 169–178, 2014.
  • [24] Daniele A. Gewurz and Francesca Merola. Numeration and enumeration. Eur. J. Comb., 33(7):1547–1556, 2012.
  • [25] Paola Bonizzoni, Clelia De Felice, Rocco Zaccagnino, and Rosalba Zizza. From the Lyndon factorization to the Canonical Inverse Lyndon factorization: back and forth. under submission, ArXiv, 2024.
  • [26] Faster Lyndon factorization algorithms for SLP and LZ78 compressed text. Theoretical Computer Science, 656:215–224, 2016.
  • [27] Alberto Apostolico and Maxime Crochemore. Fast parallel Lyndon factorization with applications. Mathematical systems theory, 28(2):89–108, 1995.

Appendix A The canonical inverse Lyndon factorization

In this section we summarize the relevant material on the canonical inverse Lyndon factorization and we refer to [10, 13] for a thorough discussion on this topic.

If w𝑤witalic_w is an inverse Lyndon word, then ICFL(w)=wICFL𝑤𝑤\operatorname{ICFL}(w)=wroman_ICFL ( italic_w ) = italic_w. Otherwise, ICFL(w)ICFL𝑤\operatorname{ICFL}(w)roman_ICFL ( italic_w ) is recursively defined. The first factor of ICFL(w)ICFL𝑤\operatorname{ICFL}(w)roman_ICFL ( italic_w ) is obtained by a special pair (p,p¯)𝑝¯𝑝(p,\overline{p})( italic_p , over¯ start_ARG italic_p end_ARG ) of words, named the canonical pair associated with w𝑤witalic_w, which in turn is obtained by the shortest nonempty prefix z𝑧zitalic_z of w𝑤witalic_w such that z𝑧zitalic_z is not an inverse Lyndon word. Proposition 6.2 in [13] provides the following characterization of the pair (p,p¯)𝑝¯𝑝(p,\overline{p})( italic_p , over¯ start_ARG italic_p end_ARG ).

Proposition 11.

Let wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT be a word which is not an inverse Lyndon word. A pair of words (p,p¯)𝑝¯𝑝(p,\overline{p})( italic_p , over¯ start_ARG italic_p end_ARG ) is the canonical pair associated with w𝑤witalic_w if and only the following conditions are satisfied.

  • (1)

    z=pp¯𝑧𝑝¯𝑝z=p\overline{p}italic_z = italic_p over¯ start_ARG italic_p end_ARG is the shortest nonempty prefix of w𝑤witalic_w which is not an inverse Lyndon word.

  • (2)

    p=ras𝑝𝑟𝑎𝑠p=rasitalic_p = italic_r italic_a italic_s and p¯=rb¯𝑝𝑟𝑏\overline{p}=rbover¯ start_ARG italic_p end_ARG = italic_r italic_b, where r,sΣ𝑟𝑠superscriptΣr,s\in\Sigma^{*}italic_r , italic_s ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, a,bΣ𝑎𝑏Σa,b\in\Sigmaitalic_a , italic_b ∈ roman_Σ and r𝑟ritalic_r is the shortest prefix of pp¯𝑝¯𝑝p\overline{p}italic_p over¯ start_ARG italic_p end_ARG such that pp¯=rasrb𝑝¯𝑝𝑟𝑎𝑠𝑟𝑏p\overline{p}=rasrbitalic_p over¯ start_ARG italic_p end_ARG = italic_r italic_a italic_s italic_r italic_b, with a<b𝑎𝑏a<bitalic_a < italic_b.

  • (3)

    p¯¯𝑝\overline{p}over¯ start_ARG italic_p end_ARG is an inverse Lyndon word.

Given a word w𝑤witalic_w which is not an inverse Lyndon word, Proposition 11 suggests a method to identify the canonical pair (p,p¯)𝑝¯𝑝(p,\overline{p})( italic_p , over¯ start_ARG italic_p end_ARG ) associated with w𝑤witalic_w: just find the shortest nonempty prefix z𝑧zitalic_z of w𝑤witalic_w which is not an inverse Lyndon word and then a factorization z=pp¯𝑧𝑝¯𝑝z=p\overline{p}italic_z = italic_p over¯ start_ARG italic_p end_ARG such that conditions (2) and (3) in Proposition 11 are satisfied.

The canonical inverse Lyndon factorization has been also recursively defined.

Definition 13.

Let wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT.
(Basis Step) If w𝑤witalic_w is an inverse Lyndon word, then ICFL(w)=(w)ICFL𝑤𝑤\operatorname{ICFL}(w)=(w)roman_ICFL ( italic_w ) = ( italic_w ).
(Recursive Step) If w𝑤witalic_w is not an inverse Lyndon word, let (p,p¯)𝑝¯𝑝(p,\overline{p})( italic_p , over¯ start_ARG italic_p end_ARG ) be the canonical pair associated with w𝑤witalic_w and let vΣ𝑣superscriptΣv\in\Sigma^{*}italic_v ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT such that w=pv𝑤𝑝𝑣w=pvitalic_w = italic_p italic_v. Let ICFL(v)=(m1,,mk)ICFL𝑣subscriptsuperscript𝑚1subscriptsuperscript𝑚𝑘\operatorname{ICFL}(v)=(m^{\prime}_{1},\ldots,m^{\prime}_{k})roman_ICFL ( italic_v ) = ( italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) and let r,sΣ𝑟𝑠superscriptΣr,s\in\Sigma^{*}italic_r , italic_s ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, a,bΣ𝑎𝑏Σa,b\in\Sigmaitalic_a , italic_b ∈ roman_Σ such that p=ras𝑝𝑟𝑎𝑠p=rasitalic_p = italic_r italic_a italic_s, p¯=rb¯𝑝𝑟𝑏\overline{p}=rbover¯ start_ARG italic_p end_ARG = italic_r italic_b with a<b𝑎𝑏a<bitalic_a < italic_b.

ICFL(w)={(p,ICFL(v)) if p¯=rbpm1(pm1,m2,,mk) if m1prICFL𝑤cases𝑝ICFL𝑣 if ¯𝑝𝑟𝑏subscript𝑝subscriptsuperscript𝑚1𝑝subscriptsuperscript𝑚1subscriptsuperscript𝑚2subscriptsuperscript𝑚𝑘subscript𝑝 if subscriptsuperscript𝑚1𝑟\operatorname{ICFL}(w)=\begin{cases}(p,\operatorname{ICFL}(v))&\mbox{ if }% \overline{p}=rb\leq_{p}m^{\prime}_{1}\\ (pm^{\prime}_{1},m^{\prime}_{2},\ldots,m^{\prime}_{k})&\mbox{ if }m^{\prime}_{% 1}\leq_{p}r\end{cases}roman_ICFL ( italic_w ) = { start_ROW start_CELL ( italic_p , roman_ICFL ( italic_v ) ) end_CELL start_CELL if over¯ start_ARG italic_p end_ARG = italic_r italic_b ≤ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ( italic_p italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL start_CELL if italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_r end_CELL end_ROW

The following example is in [10].

Example 7.

Let Σ={a,b,c,d}Σ𝑎𝑏𝑐𝑑\Sigma=\{a,b,c,d\}roman_Σ = { italic_a , italic_b , italic_c , italic_d } with a<b<c<d𝑎𝑏𝑐𝑑a<b<c<ditalic_a < italic_b < italic_c < italic_d, w=dabadabdabdadac𝑤𝑑𝑎𝑏𝑎𝑑𝑎𝑏𝑑𝑎𝑏𝑑𝑎𝑑𝑎𝑐w=dabadabdabdadacitalic_w = italic_d italic_a italic_b italic_a italic_d italic_a italic_b italic_d italic_a italic_b italic_d italic_a italic_d italic_a italic_c. We have CFLin(w)=(daba,dab,dab,dadac)subscriptCFL𝑖𝑛𝑤𝑑𝑎𝑏𝑎𝑑𝑎𝑏𝑑𝑎𝑏𝑑𝑎𝑑𝑎𝑐\operatorname{CFL}_{in}(w)=(daba,dab,dab,dadac)roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_w ) = ( italic_d italic_a italic_b italic_a , italic_d italic_a italic_b , italic_d italic_a italic_b , italic_d italic_a italic_d italic_a italic_c ) and ICFL(w)=(daba,dabdab,dadac)ICFL𝑤𝑑𝑎𝑏𝑎𝑑𝑎𝑏𝑑𝑎𝑏𝑑𝑎𝑑𝑎𝑐\operatorname{ICFL}(w)=(daba,dabdab,dadac)roman_ICFL ( italic_w ) = ( italic_d italic_a italic_b italic_a , italic_d italic_a italic_b italic_d italic_a italic_b , italic_d italic_a italic_d italic_a italic_c ). Consider z=dabdadacddbdc𝑧𝑑𝑎𝑏𝑑𝑎𝑑𝑎𝑐𝑑𝑑𝑏𝑑𝑐z=dabdadacddbdcitalic_z = italic_d italic_a italic_b italic_d italic_a italic_d italic_a italic_c italic_d italic_d italic_b italic_d italic_c. We have ICFL(z)=CFLin(z)=(dab,dadac,ddbdc)ICFL𝑧subscriptCFL𝑖𝑛𝑧𝑑𝑎𝑏𝑑𝑎𝑑𝑎𝑐𝑑𝑑𝑏𝑑𝑐\operatorname{ICFL}(z)=\operatorname{CFL}_{in}(z)=(dab,dadac,ddbdc)roman_ICFL ( italic_z ) = roman_CFL start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ( italic_z ) = ( italic_d italic_a italic_b , italic_d italic_a italic_d italic_a italic_c , italic_d italic_d italic_b italic_d italic_c ).