Unveiling the connection between the Lyndon factorization and the canonical inverse Lyndon factorization via a border property

Paola Bonizzoni, Brian Riccardi
Dip. di Informatica, Sistemistica e Comunicazione
University of Milano-Bicocca
viale Sarca 336, 20126 Milan, Italy
{paola.bonizzoni,[email protected]}unimib.it
&Clelia De Felice, Rocco Zaccagnino, Rosalba Zizza
Dip. di Informatica
University of Salerno
via Giovanni Paolo II 132, 84084 Fisciano, Italy
{cdefelice,rzaccagnino,rzizza}email@email

Abstract

The notion of Lyndon word and Lyndon factorization has shown to have unexpected applications in theory as well in develo** novel algorithms on words. A counterpart to these notions are those of inverse Lyndon word and inverse Lyndon factorization. Differently from the Lyndon words, the inverse Lyndon words may be bordered. The relationship between the two factorizations is related to the inverse lexicographic ordering, and has only been recently explored. More precisely, a main open question is how to get an inverse Lyndon factorization from a classical Lyndon factorization under the inverse lexicographic ordering, named $\operatorname{CFL}_{in}$ . In this paper we reveal a strong connection between these two factorizations where the border plays a relevant role. More precisely, we show two main results. We say that a factorization has the border property if a nonempty border of a factor cannot be a prefix of the next factor. First we show that there exists a unique inverse Lyndon factorization having the border property. Then we show that this unique factorization with the border property is the so-called canonical inverse Lyndon factorization, named $\operatorname{ICFL}$ . By showing that $\operatorname{ICFL}$ is obtained by compacting factors of the Lyndon factorization over the inverse lexicographic ordering, we provide a linear time algorithm for computing $\operatorname{ICFL}$ from $\operatorname{CFL}_{in}$ .

Keywords Lyndon words $\cdot$ Lyndon factorization $\cdot$ Combinatorial algorithms on words

1 Introduction

The theoretical investigation of combinatorial properties of well-known word factorizations is a research topic that recently have witnessed special interest especially for improving the efficiency of algorithms. Among these, undoubtedly stands out the Lyndon Factorization introduced by Chen, Fox, Lyndon in [1], named $\operatorname{CFL}$ . Any word $w$ admits a unique factorization $\operatorname{CFL}(w)$ , that is a lexicographically non-increasing sequence of factors which are Lyndon words. A Lyndon word $w$ is strictly lexicographically smaller than each of its proper cyclic shifts, or, equivalently, than each of its nonempty proper suffixes [2]. Interesting applications of the use of the Lyndon factorization and Lyndon words are the development of the bijective Burrows-Wheeler Transforms [3, 4, 5] and a novel algorithm for sorting suffixes [6]. In particular, the notion of a Lyndon word has been re-discovered various times as a theoretical tool to locate short motifs [7] and relevant k-mers in bioinformatics applications [8]. In this line of research, Lyndon-based word factorizations have been explored to define a novel feature representation for biological sequences based on theoretical combinatorial properties proved to capture sequence similarities [9].

The notion of a Lyndon word has a counterpart that is the notion of an inverse Lyndon word, i.e., a word lexicographically greater than its suffixes. Inverting the relation between a word and its suffixes, as between Lyndon words and inverse Lyndon words, leads to different properties. Indeed, although a word could admit more than one inverse Lyndon factorization, that is a factorization into a nonincreasing product of inverse Lyndon words, in [10] the Canonical Inverse Lyndon Factorization, named $\operatorname{ICFL}$ , was introduced. $\operatorname{ICFL}$ maintains the main properties of $\operatorname{CFL}$ : it is unique and can be computed in linear time. In addition, it maintains a similar Compatibility Property, used for obtaining the sorting of the suffixes of $w$ (“global suffixes”) by using the sorting of the suffixes of each factor of $\operatorname{CFL}(w)$ (“local suffixes”) [11]. Most notably, $\operatorname{ICFL}(w)$ has another interesting property [10, 12, 13]: we can provide an upper bound on the length of the longest common prefix of two substrings of a word $w$ starting from different positions.

A relationship between $\operatorname{ICFL}(w)$ and $\operatorname{CFL}(w)$ has been proved by using the notion of grou** [10]. First, let $\operatorname{CFL}_{in}(w)$ be the Lyndon factorization of $w$ with respect to the inverse lexicographic order, it is proved that $\operatorname{ICFL}(w)$ is obtained by concatenating the factors of a non-increasing maximal chain with respect to the prefix order, denoted by $\mathcal{PMC}$ , in $\operatorname{CFL}_{in}(w)$ (see Section 6). Despite this result, the connection between $\operatorname{CFL}_{in}(w)$ and the inverse Lyndon factorization still remained obscure, mainly by the fact that a word may have multiple inverse Lyndon factorizations.

In this paper, we explore this connection between $\operatorname{CFL}_{in}$ and the inverse Lyndon factorizations. Our first main contribution consists in showing that there is a unique inverse Lyndon factorization of a word that has border property. The border property states that any nonempty border of a factor cannot be a prefix of the next factor. We further highlight the aforementioned connection by proving that the inverse Lyndon factorization with the border property is a compact factorization (Definition 12), i.e., each inverse Lyndon factor is the concatenation of compact factors, each obtained by concatenating the longest sequence of identical words in a $\mathcal{PMC}$ . We then show the second contribution of this paper: this unique factorization is $\operatorname{ICFL}$ itself and then provide a simpler linear time algorithm for computing $\operatorname{ICFL}$ . Our algorithm is based on a new property that characterizes $\operatorname{ICFL}(w)$ : the last factor in an inverse Lyndon factorization with the border property of $w$ is the longest suffix of $w$ that is an inverse Lyndon word. Recall that the Lyndon factorization of $w$ has a similar property: the last factor is the longest suffix of $w$ that is a Lyndon word.

2 Words

Throughout this paper we follow [14, 15, 16, 17, 18] for the notations. We denote by $\Sigma^{*}$ the free monoid generated by a finite alphabet $\Sigma$ and we set $\Sigma^{+}=\Sigma^{*}\setminus 1$ , where $1$ is the empty word. For a word $w\in\Sigma^{*}$ , we denote by $|w|$ its length. A word $x\in\Sigma^{*}$ is a factor of $w\in\Sigma^{*}$ if there are $u_{1},u_{2}\in\Sigma^{*}$ such that $w=u_{1}xu_{2}$ . If $u_{1}=1$ (resp. $u_{2}=1$ ), then $x$ is a prefix (resp. suffix) of $w$ . A factor (resp. prefix, suffix) $x$ of $w$ is proper if $x\not=w$ . Two words $x,y$ are incomparable for the prefix order, denoted as $x\Join y$ , if neither $x$ is a prefix of $y$ nor $y$ is a prefix of $x$ . Otherwise, $x,y$ are comparable for the prefix order. We write $x\leq_{p}y$ if $x$ is a prefix of $y$ and $x\geq_{p}y$ if $y$ is a prefix of $x$ . The notion of a pair of words comparable (or incomparable) for the suffix order is defined symmetrically.

We recall that, given a nonempty word $w$ , a border of $w$ is a word which is both a proper prefix and a suffix of $w$ [19]. The longest proper prefix of $w$ which is a suffix of $w$ is also called the border of $w$ [19, 17]. A word $w\in\Sigma^{+}$ is bordered if it has a nonempty border. Otherwise, $w$ is unbordered. A nonempty word $w$ is primitive if $w=x^{k}$ implies $k=1$ . An unbordered word is primitive. A sesquipower of a word $x$ is a word $w=x^{n}p$ where $p$ is a proper prefix of $x$ and $n\geq 1$ . Two words $x,y$ are called conjugate if there exist words $u,v$ such that $x=uv,y=vu$ . The conjugacy relation is an equivalence relation. A conjugacy class is a class of this equivalence relation.

Definition 1.

Let $(\Sigma,<)$ be a totally ordered alphabet. The lexicographic (or alphabetic order) $\prec$ on $(\Sigma^{*},<)$ is defined by setting $x\prec y$ if

•

$x$ is a proper prefix of $y$ , or
•

$x=ras$ , $y=rbt$ , $a<b$ , for $a,b\in\Sigma$ and $r,s,t\in\Sigma^{*}$ .

In the next part of the paper we will implicitly refer to totally ordered alphabets. For two nonempty words $x,y$ , we write $x\ll y$ if $x\prec y$ and $x$ is not a proper prefix of $y$ [20]. We also write $y\succ x$ if $x\prec y$ . Basic properties of the lexicographic order are recalled below.

Lemma 1.

For $x,y\in\Sigma^{+}$ , the following properties hold.

(1)

$x\prec y$ if and only if $zx\prec zy$ , for every word $z$ .
(2)

If $x\ll y$ , then $xu\ll yv$ for all words $u,v$ .
(3)

If $x\prec y\prec xz$ for a word $z$ , then $y=xy^{\prime}$ for some word $y^{\prime}$ such that $y^{\prime}\prec z$ .
(4)

If $x\ll y$ and $y\ll z$ , then $x\ll z$ .

Let $\mathcal{S}_{1},\ldots,\mathcal{S}_{t}$ be sequences, with $\mathcal{S}_{j}=(s_{j,1},\ldots,s_{j,r_{j}})$ . For abbreviation, we let $(\mathcal{S}_{1},\ldots,\mathcal{S}_{t})$ stand for the sequence $(s_{1,1},\ldots,s_{1,r_{1}},\ldots,s_{t,1},\ldots,s_{t,r_{t}})$ .

3 Lyndon words

Definition 2.

A Lyndon word $w\in\Sigma^{+}$ is a word which is primitive and the smallest one in its conjugacy class for the lexicographic order.

Example 1.

Let $\Sigma=\{a,b\}$ with $a<b$ . The words $a$ , $b$ , $aaab$ , $abbb$ , $aabab$ and $aababaabb$ are Lyndon words. On the contrary, $abab$ , $aba$ and $abaab$ are not Lyndon words.

Proposition 1.

Each Lyndon word $w$ is unbordered.

A class of conjugacy is also called a necklace and often identified with the minimal word for the lexicographic order in it. We will adopt this terminology. Then a word is a necklace if and only if it is a power of a Lyndon word. A prenecklace is a prefix of a necklace. Then clearly any nonempty prenecklace $w$ has the form $w=(uv)^{k}u$ , where $uv$ is a Lyndon word, $u\in\Sigma^{*}$ , $v\in\Sigma^{+}$ , $k\geq 1$ , that is, $w$ is a sesquipower of a Lyndon word $uv$ . The following result has been proved in [21]. It shows that the nonempty prefixes of Lyndon words are exactly the nonempty prefixes of the powers of Lyndon words with the exclusion of $c^{k}$ , where $c$ is the maximal letter and $k\geq 2$ .

Proposition 2.

A word is a nonempty prefix of a Lyndon word if and only if it is a sesquipower of a Lyndon word distinct of $c^{k}$ , where $c$ is the maximal letter and $k\geq 2$ .

In the following $L=L_{(\Sigma^{*},<)}$ will be the set of Lyndon words, totally ordered by the relation $\prec$ on $(\Sigma^{*},<)$ .

Theorem 3.1.

Any word $w\in\Sigma^{+}$ can be written in a unique way as a nonincreasing product $w=\ell_{1}\ell_{2}\cdots\ell_{h}$ of Lyndon words, i.e., in the form

\displaystyle w

\displaystyle=

\displaystyle\ell_{1}\ell_{2}\cdots\ell_{h},\mbox{ with }\ell_{j}\in L\mbox{ % and }\ell_{1}\succeq\ell_{2}\succeq\ldots\succeq\ell_{h}

(3.1)

The sequence $\operatorname{CFL}(w)=(\ell_{1},\ldots,\ell_{h})$ in Eq. (3.1) is called the Lyndon decomposition (or Lyndon factorization) of $w$ . It is denoted by $\operatorname{CFL}(w)$ because Theorem 3.1 is usually credited to Chen, Fox and Lyndon [1]. The following result, proved in [21], is necessary for our aims.

Corollary 1.

Let $w\in\Sigma^{+}$ , let $\ell_{1}$ be its longest prefix which is a Lyndon word and let $w^{\prime}$ be such that $w=\ell_{1}w^{\prime}$ . If $w^{\prime}\not=1$ , then $\operatorname{CFL}(w)=(\ell_{1},\operatorname{CFL}(w^{\prime}))$ .

Sometimes we need to emphasize consecutive equal factors in $\operatorname{CFL}$ . We write $\operatorname{CFL}(w)=(\ell_{1}^{n_{1}},\ldots,\ell_{r}^{n_{r}})$ to denote a tuple of $n_{1}+\ldots+n_{r}$ Lyndon words, where $r>0$ , $n_{1},\ldots,n_{r}\geq 1$ . Precisely $\ell_{1}\succ\ldots\succ\ell_{r}$ are Lyndon words, also named Lyndon factors of $w$ . There is a linear time algorithm to compute the pair $(\ell_{1},n_{1})$ and thus, by iteration, the Lyndon factorization of $w$ [22, 17]. Linear time algorithms may also be found in [21] and in the more recent paper [23].

4 Inverse Lyndon words

For the material in this section see [10, 12, 13]. Inverse Lyndon words are related to the inverse alphabetic order. Its definition is recalled below.

Definition 3.

Let $(\Sigma,<)$ be a totally ordered alphabet. The inverse $<_{in}$ of $<$ is defined by

\forall a,b\in\Sigma\quad b<_{in}a\Leftrightarrow a<b

The inverse lexicographic or inverse alphabetic order on $(\Sigma^{*},<)$ , denoted $\prec_{in}$ , is the lexicographic order on $(\Sigma^{*},<_{in})$ .

Example 2.

Let $\Sigma=\{a,b,c,d\}$ with $a<b<c<d$ . Then $dab\prec dabd$ and $dabda\prec dac$ . We have $d<_{in}c<_{in}b<_{in}a$ . Therefore $dab\prec_{in}dabd$ and $dac\prec_{in}dabda$ .

Of course for all $x,y\in\Sigma^{*}$ such that $x\Join y$ ,

y\prec_{in}x\Leftrightarrow x\prec y.

Moreover, in this case $x\ll y$ . This justifies the adopted terminology.

From now on, $L_{in}=L_{(\Sigma^{*},<_{in})}$ denotes the set of the Lyndon words on $\Sigma^{*}$ with respect to the inverse lexicographic order. Following [24], a word $w\in L_{in}$ will be named an anti-Lyndon word. Correspondingly, an anti-prenecklace will be a prefix of an anti-necklace, which in turn will be a necklace with respect to the inverse lexicographic order.

In the following, we denote by $\operatorname{CFL}_{in}(w)$ the Lyndon factorization of $w$ with respect to the inverse order $<_{in}$ .

Definition 4.

A word $w\in\Sigma^{+}$ is an inverse Lyndon word if $s\prec w$ , for each nonempty proper suffix $s$ of $w$ .

Example 3.

The words $a$ , $b$ , $aaaaa$ , $bbba$ , $baaab$ , $bbaba$ and $bbababbaa$ are inverse Lyndon words on $\{a,b\}$ , with $a<b$ . On the contrary, $aaba$ is not an inverse Lyndon word since $aaba\prec ba$ . Analogously, $aabba\prec ba$ and thus $aabba$ is not an inverse Lyndon word.

The following result, proved in [10, 13], and also in [25], summarizes some properties of the inverse Lyndon words.

Proposition 3.

Let $w\in\Sigma^{+}$ . Then we have

1.

The word $w$ is an anti-Lyndon word if and only if it is an unbordered inverse Lyndon word.
2.

The word $w$ is an inverse Lyndon word if and only if $w$ is a nonempty anti-prenecklace.
3.

If $w$ is an inverse Lyndon word, then any nonempty prefix of $w$ is an inverse Lyndon word.

Definition 5.

An inverse Lyndon factorization of a word $w\in\Sigma^{+}$ is a sequence $(m_{1},\ldots,m_{k})$ of inverse Lyndon words such that $m_{1}\cdots m_{k}=w$ and $m_{i}\ll m_{i+1}$ , $1\leq i\leq k-1$ .

As the following example in [10] shows, a word may have different inverse Lyndon factorizations.

Example 4.

Let $\Sigma=\{a,b,c,d\}$ with $a<b<c<d$ , $z=dabdadacddbdc$ . It is easy to see that $(dab,dadacd,db,dc)$ , $(dabda,dac,ddbdc)$ , $(dab,dadac,ddbdc)$ are all inverse Lyndon factorizations of $z$ .

5 The border property

In this section we prove the main result of this paper, namely, for any nonempty word $w$ , there exists a unique inverse Lyndon factorization of $w$ which has a special property, named the border property.

Definition 6 (Border property).

Let $w\in\Sigma^{+}$ . A factorization $(m_{1},\ldots,m_{k})$ of $w$ has the border property if each nonempty border $z$ of $m_{i}$ is not a prefix of $m_{i+1}$ , $1\leq i\leq k-1$ .

We first prove a fundamental property of the inverse Lyndon factorizations of $w$ which have the border property.

Lemma 2.

Let $w\in\Sigma^{+}$ , let $(m_{1},\ldots,m_{k})$ be an inverse Lyndon factorization of $w$ having the border property. If $\alpha$ is a nonempty border of $m_{j}$ , $1\leq j\leq k-1$ , then there exists a nonempty prefix $\beta$ of $m_{j+1}$ such that $|\beta|\leq|\alpha|$ and $\alpha\ll\beta$ .

Proof.

Let $w\in\Sigma^{+}$ , let $(m_{1},\ldots,m_{k})$ be an inverse Lyndon factorization of $w$ having the border property, let $\alpha$ be a nonempty border of $m_{j}$ , $1\leq j\leq k-1$ . We distinguish two cases: either $|m_{j+1}|<|\alpha|$ or $|m_{j+1}|\geq|\alpha|$ .

Assume $|m_{j+1}|<|\alpha|$ . By hypothesis $(m_{1},\ldots,m_{k})$ is an inverse Lyndon factorization, hence $m_{j}\ll m_{j+1}$ , that is, there are $r,s,t\in\Sigma^{*}$ , $a,b\in\Sigma$ , such that $a<b$ and $m_{j}=ras$ , $m_{j+1}=rbt$ . Obviously $|ra|\leq|m_{j+1}|<|\alpha|$ , thus there is $s^{\prime}\in\Sigma^{*}$ such that $\alpha=ras^{\prime}$ . Consequently, $\alpha=ras^{\prime}\ll rbt=m_{j+1}$ and our claim holds with $\beta=m_{j+1}$ .

Assume $|m_{j+1}|\geq|\alpha|$ . Let $\beta$ be the nonempty prefix of $m_{j+1}$ such that $|\beta|=|\alpha|$ . Clearly $\beta\not=\alpha$ because $(m_{1},\ldots,m_{k})$ has the border property. Since $\alpha$ and $\beta$ are two different nonempty words of the same length, either $\beta\ll\alpha$ or $\alpha\ll\beta$ . The first case leads to a contradiction because if $\beta\ll\alpha$ then $m_{j+1}\ll m_{j}$ by Lemma 1 and this contradicts the fact that $(m_{1},\ldots,m_{k})$ is an inverse Lyndon factorization. Thus, $\alpha\ll\beta$ and the proof is complete. ∎

Proposition 4.

For each $w\in\Sigma^{+}$ , there exists a unique inverse Lyndon factorization of $w$ having the border property.

Proof.

The proof is by induction on $|w|$ . If $|w|=1$ , then $F_{1}(w)=F_{2}(w)=(w)$ and statement clearly holds. Thus assume $|w|>1$ . Let $F_{1}(w)=(f_{1},\ldots,f_{k})$ and $F_{2}(w)=(f^{\prime}_{1},\ldots,f^{\prime}_{v})$ two inverse Lyndon factorizations of $w$ having the border property. Thus

f_{1}\cdots f_{k}=f^{\prime}_{1}\cdots f^{\prime}_{v}=w

(5.1)

If $|f_{k}|=|f^{\prime}_{v}|$ and $v=1$ or $k=1$ , clearly $f_{k}=f^{\prime}_{v}$ and $F_{1}(w)=F_{2}(w)$ . Analogously, if $|f_{k}|=|f^{\prime}_{v}|$ , $v>1$ and $k>1$ , then $f_{k}=f^{\prime}_{v}$ and $F^{\prime}_{1}(w^{\prime})=(f_{1},\ldots,f_{k-1})$ , $F^{\prime}_{2}(w^{\prime})=(f^{\prime}_{1},\ldots,f^{\prime}_{v-1})$ would be two inverse Lyndon factorizations of $w^{\prime}$ having the border property, where $w^{\prime}$ is such that $w=w^{\prime}f_{k}$ . Of course, $|w^{\prime}|<|w|$ . By induction hypothesis, $F^{\prime}_{1}(w^{\prime})=F^{\prime}_{2}(w^{\prime})$ , hence $F_{1}(w)=F_{2}(w)$ .

By contradiction, let $|f_{k}|\not=|f^{\prime}_{v}|$ . Assume $|f_{k}|<|f^{\prime}_{v}|$ (similar arguments apply if $|f_{k}|>|f^{\prime}_{v}|$ ). The word $f_{k}$ is a proper suffix of $f^{\prime}_{v}$ . Clearly $k>1$ . Let $g$ be the smallest integer such that $f_{g+1}\cdots f_{k}$ is a proper suffix of $f^{\prime}_{v}$ , $1\leq g\leq k-1$ , that is,

f^{\prime}_{v}=\alpha f_{g+1}\cdots f_{k}

(5.2)

where $\alpha\in\Sigma^{+}$ is a suffix of $f_{g}$ .

Notice that

\alpha\not\ll f_{g+1}

(5.3)

Indeed, if $\alpha\ll f_{g+1}$ , then, by Eq. (5.2), we would have $f^{\prime}_{v}=\alpha f_{g+1}\cdots f_{k}\ll f_{g+1}\cdots f_{k}$ , which is impossible because $f^{\prime}_{v}$ is an inverse Lyndon word.

The word $\alpha$ is a nonempty proper suffix of $f_{g}$ since otherwise we would have $\alpha=f_{g}\ll f_{g+1}$ , contrary to Eq. (5.3). Since $f_{g}$ is an inverse Lyndon word and $\alpha$ is a nonempty proper suffix of $f_{g}$ , either $\alpha\leq_{p}f_{g}$ or $\alpha\ll f_{g}$ .

If $\alpha\leq_{p}f_{g}$ , then $\alpha$ is a nonempty border of $f_{g}$ , then, by Lemma 2, there exists a nonempty prefix $\beta$ of $f_{g+1}$ such that $|\beta|\leq|\alpha|$ and $\alpha\ll\beta$ . Thus, $\alpha\ll f_{g+1}$ which contradicts Eq. (5.3). Assume $\alpha\ll f_{g}$ . Since $f_{g}\ll f_{g+1}$ , by Lemma 1 we have $\alpha\ll f_{g+1}$ which contradicts once again Eq. (5.3). This finishes the proof. ∎

6 Grou**s and compact factorizations

In this section we prove a structural property of an inverse Lyndon factorization having the border property, namely it is a compact factorization. This result is crucial to characterize the relationship between $\operatorname{CFL}_{in}(w)$ and the factorization into inverse Lyndon words of $w$ . First we report the notion of grou** given in [10]. We refer to [10, 13] for a detailed and complete discussion on this topic.

Let $\operatorname{CFL}_{in}(w)=(\ell_{1},\ldots,\ell_{h})$ , where $\ell_{1}\succeq_{in}\ell_{2}\succeq_{in}\ldots\succeq_{in}\ell_{h}$ . Consider the partial order $\geq_{p}$ , where $x\geq_{p}y$ if $y$ is a prefix of $x$ . Recall that a chain is a set of a pairwise comparable elements. We say that a chain is maximal if it is not strictly contained in any other chain. A non-increasing (maximal) chain in $\operatorname{CFL}_{in}(w)$ is the sequence corresponding to a (maximal) chain in the multiset $\{\ell_{1},\ldots,\ell_{h}\}$ with respect to $\geq_{p}$ . We denote by $\mathcal{PMC}$ a non-increasing maximal chain in $\operatorname{CFL}_{in}(w)$ . Looking at the definition of the (inverse) lexicographic order, it is easy to see that a $\mathcal{PMC}$ is a sequence of consecutive factors in $\operatorname{CFL}_{in}(w)$ . Moreover $\operatorname{CFL}_{in}(w)$ is the concatenation of its $\mathcal{PMC}$ . The formal definitions are given below.

Definition 7.

Let $w\in\Sigma^{+}$ , let $\operatorname{CFL}_{in}(w)=(\ell_{1},\ldots,\ell_{h})$ and let $1\leq r<s\leq h$ . We say that $\ell_{r},\ell_{r+1},\ldots,\ell_{s}$ is a non-increasing maximal chain for the prefix order in $\operatorname{CFL}_{in}(w)$ , abbreviated $\mathcal{PMC}$ , if $\ell_{r}\geq_{p}\ell_{r+1}\geq_{p}\ldots\geq_{p}\ell_{s}$ . Moreover, if $r>1$ , then $\ell_{r-1}\not\geq_{p}\ell_{r}$ , if $s<h$ , then $\ell_{s}\not\geq_{p}\ell_{s+1}$ . Two $\mathcal{PMC}$ $\mathcal{C}_{1}=\ell_{r},\ell_{r+1},\ldots,\ell_{s}$ , $\mathcal{C}_{2}=\ell_{r^{\prime}},\ell_{r^{\prime}+1},\ldots,\ell_{s^{\prime}}$ are consecutive if $r^{\prime}=s+1$ (or $r=s^{\prime}+1$ ).

Definition 8.

Let $w\in\Sigma^{+}$ , let $\operatorname{CFL}_{in}(w)=(\ell_{1},\ldots,\ell_{h})$ . We say that $(\mathcal{C}_{1},\mathcal{C}_{2},\ldots,\mathcal{C}_{s})$ is the decomposition of $\operatorname{CFL}_{in}(w)$ into its non-increasing maximal chains for the prefix order if the following holds

(1)

Each $\mathcal{C}_{j}$ is a non-increasing maximal chain in $\operatorname{CFL}_{in}(w)$ .
(2)

$\mathcal{C}_{j}$ and $\mathcal{C}_{j+1}$ are consecutive, $1\leq j\leq s-1$ .
(3)

$\operatorname{CFL}_{in}(w)$ is the concatenation of the sequences $\mathcal{C}_{1},\mathcal{C}_{2},\ldots,\mathcal{C}_{s}$ .

Definition 9.

Let $w\in\Sigma^{+}$ . We say that $(m_{1},\ldots,m_{k})$ is a grou** of $\operatorname{CFL}_{in}(w)$ if the following holds

(1)

$(m_{1},\ldots,m_{k})$ is an inverse Lyndon factorization of $w$
(2)

Each factor $m_{j}$ , is the product of consecutive factors in a $\mathcal{PMC}$ in $\operatorname{CFL}_{in}(w)$ .

Example 5.

Let $\Sigma=\{a,b,c,d\}$ , $a<b<c<d$ , and $w=dabadabdabdadac$ . We have $\operatorname{CFL}_{in}(w)=(daba,dab,dab,dadac)$ . The decomposition of $\operatorname{CFL}_{in}(w)$ into its $\mathcal{PMC}$ is $((daba,dab,dab),(dadac))$ . Moreover, $(daba,dabdab,dadac)$ is a grou** of $\operatorname{CFL}_{in}(w)$ but for the inverse Lyndon factorization $(dabadab,dabda,dac)$ this is no longer true.

Next, let $y=dabadabdabdabdadac$ . We have $\operatorname{CFL}_{in}(y)=(daba,dab,dab,dab,dadac)$ . The decomposition of $\operatorname{CFL}_{in}(w)$ into its $\mathcal{PMC}$ is $((daba,dab,dab,dab),(dadac))$ . Moreover, $(daba,(dab)^{3},dadac)$ and $(dabadab,(dab)^{2},dadac)$ are two grou**s of $\operatorname{CFL}_{in}(y)$ .

For our aims, we need to consider the words that are concatenations of equal factors in $\operatorname{CFL}_{in}$ . This approach leads to a refinement of the partition of $\operatorname{CFL}_{in}$ into non-increasing maximal chains for the prefix order, as defined below.

Definition 10 (Compact sequences).

Let ${\cal C}=(\ell_{1},\ldots,\ell_{h})$ be a non-increasing maximal chain for the prefix order in $\operatorname{CFL}_{in}(w)$ . The decomposition of ${\cal C}$ into maximal compact sequences is the sequence $({\cal G}_{1},\ldots,{\cal G}_{n})$ such that

(1)

${\cal C}=({\cal G}_{1},\ldots,{\cal G}_{n})$
(2)

For every $i$ , $1\leq i\leq n$ , ${\cal G}_{i}$ consists of the longest sequence of consecutive identical elements in ${\cal C}$

Let $(\mathcal{C}_{1},\mathcal{C}_{2},\ldots,\mathcal{C}_{s})$ be the decomposition of $\operatorname{CFL}_{in}(w)$ into its non-increasing maximal chains for the prefix order. The decomposition of $\operatorname{CFL}_{in}(w)$ into its maximal compact sequences is obtained by replacing each $\mathcal{C}_{j}$ in $(\mathcal{C}_{1},\mathcal{C}_{2},\ldots,\mathcal{C}_{s})$ with its decomposition into maximal compact sequences.

Definition 11 (Compact factor).

Let $({\cal G}_{1},\ldots,{\cal G}_{n})$ be the decomposition of $\operatorname{CFL}_{in}(w)$ into its maximal compact sequences. For every $i$ , $1\leq i\leq n$ , the concatenation $g_{i}$ of the elements in ${\cal G}_{i}$ is a compact factor in $\operatorname{CFL}_{in}(w)$ .

Definition 12 (Compact factorization).

Let $w\in\Sigma^{+}$ . We say that $(m_{1},\ldots,m_{k})$ is a compact factorization of $w$ if $(m_{1},\ldots,m_{k})$ is an inverse Lyndon factorization of $w$ and each $m_{j}$ , $1\leq j\leq k$ , is a concatenation of compact factors in $\operatorname{CFL}_{in}(w)$ .

Example 6.

Consider again $y=dabadabdabdabdadac$ over $\Sigma=\{a,b,c,d\}$ , $a<b<c<d$ , as in Example 5. The decomposition of $\operatorname{CFL}_{in}(y)=(daba,dab,dab,dab,dadac)$ into its maximal compact sequences is $((daba),(dab,dab,dab),(dadac))$ . The compact factors in $\operatorname{CFL}_{in}(w)$ are $daba,(dab)^{3},dadac$ . Moreover, $(daba,(dab)^{3},dadac)$ is a compact factorization whereas $(dabadab,(dab)^{2},dadac)$ is a grou** of $\operatorname{CFL}_{in}(y)$ which is not a compact factorization.

Proposition 5.

Let $w\in\Sigma^{+}$ . If $(m_{1},\ldots,m_{k})$ is an inverse Lyndon factorization of $w$ having the border property, then $(m_{1},\ldots,m_{k})$ is a compact factorization of $w$ .

Proof.

Let $w\in\Sigma^{+}$ , let $(m_{1},\ldots,m_{k})$ be an inverse Lyndon factorization of $w$ having the border property. Let $\operatorname{CFL}_{in}(w)=(\ell_{1},\ldots,\ell_{h})$ , where $\ell_{1}\succeq_{in}\ell_{2}\succeq_{in}\ldots\succeq_{in}\ell_{h}$ and $\ell_{1},\ldots,\ell_{h}$ are anti-Lyndon words. First we prove that $(m_{1},\ldots,m_{k})$ is a grou** of $\operatorname{CFL}_{in}(w)$ by induction on $|w|$ . If $|w|=1$ the statement clearly holds, thus assume $|w|>1$ .

The words $m_{1}$ and $\ell_{1}$ are comparable for the prefix order, hence either $m_{1}$ is a proper prefix of $\ell_{1}$ or $\ell_{1}$ is a prefix of $m_{1}$ . Suppose that $m_{1}$ is a proper prefix of $\ell_{1}$ . Thus, there are $j$ , $1<j\leq k$ , and $x,y\in\Sigma^{*}$ , $x\not=1$ , such that $m_{j}=xy$ and $\ell_{1}=m_{1}\cdots m_{j-1}x$ . Necessarily it turns out $j=2$ because otherwise $m_{1}\ll m_{j-1}$ , hence, by Lemma 1, $\ell_{1}\ll m_{j-1}x$ and this contradicts the fact that $\ell_{1}$ is an anti-Lyndon word. In conclusion $\ell_{1}=m_{1}x$ and $m_{2}=xy$ . We know that $m_{1}\ll m_{2}$ , that is, there are $r,s,t\in\Sigma^{*}$ , $a,b\in\Sigma$ , such that $a<b$ and $m_{1}=ras$ , $m_{2}=rbt=xy$ . If $|x|\leq|r|$ , then $r$ is a nonempty border of $\ell_{1}$ and if $|x|>|r|$ , then there is a word $t^{\prime}$ such that $x=rbt^{\prime}$ which implies $\ell_{1}\ll x$ . Both cases again contradict the fact that $\ell_{1}$ is an anti-Lyndon word.

Therefore, $\ell_{1}$ is a prefix of $m_{1}$ . Let $i$ be the largest integer such that $m_{1}=\ell_{1}\cdots\ell_{i-1}x$ , $x,y\in\Sigma^{*}$ , $\ell_{i}=xy$ , $1<i\leq h$ , $y\not=1$ . Let $(\mathcal{C}_{1},\mathcal{C}_{2},\ldots,\mathcal{C}_{s})$ be the decomposition of $\operatorname{CFL}_{in}(w)$ into its non-increasing maximal chains for the prefix order. We claim that $\ell_{1}\cdots\ell_{i-1}$ is a prefix of the concatenation of the elements of $\mathcal{C}_{1}$ , thus $(\ell_{1},\ldots,\ell_{i-1})$ is a chain for the prefix order. If $i=1$ we are done. Let $i>1$ . By contradiction, assume that there is $j$ , $1<j<i$ , such that $\ell_{j}\not\in\mathcal{C}_{1}$ . Therefore, $\ell_{1}\ll\ell_{j}$ which implies $m_{1}\ll\ell_{j}\cdots\ell_{i-1}x$ and this contradicts the fact that $m_{1}$ is an inverse Lyndon word.

We now prove that $x=1$ . Assume $x\not=1$ . As a preliminary step, we prove that there is no nonempty prefix $\beta$ of $m_{2}$ such that $|\beta|\leq|x|$ and $x\ll\beta$ . In fact, if such a prefix existed, there would be $r,s,t\in\Sigma^{*}$ , $a,b\in\Sigma$ , such that $a<b$ and $x=ras$ , $\beta=rbt$ . If $|\ell_{i}|\leq|xr|$ then $\ell_{i}=xr^{\prime}=rasr^{\prime}$ , where $r^{\prime}$ would be a nonempty prefix of $r$ , thus a nonempty border of $\ell_{i}$ (recall that $\ell_{i}=xy$ with $y\not=1$ ). If $|\ell_{i}|>|xr|$ , then there would be a word $t^{\prime}$ such that $\ell_{i}=rasrbt^{\prime}$ which would imply $\ell_{i}\ll rbt^{\prime}$ . Both cases contradict the fact that $\ell_{i}$ is an anti-Lyndon word.

If $x\not=1$ , then either $\ell_{i}$ is a prefix of $\ell_{1}$ or $\ell_{1}\ll\ell_{i}$ . If $\ell_{i}$ were a prefix of $\ell_{1}$ , then $x$ would be a nonempty border of $m_{1}$ . By Lemma 2 there would exist a nonempty prefix $\beta$ of $m_{2}$ such that $|\beta|\leq|x|$ and $x\ll\beta$ which contradicts our preliminary step.

If it were true that $\ell_{1}\ll\ell_{i}$ then there would be $r,s,t\in\Sigma^{*}$ , $a,b\in\Sigma$ , such that $a<b$ and $\ell_{1}=ras$ , $\ell_{i}=rbt=xy$ . If $|x|>|r|$ , then there would be a word $t^{\prime}$ such that $x=rbt^{\prime}$ which would imply $m_{1}\ll x$ and this contradicts the fact that $m_{1}$ is an inverse Lyndon word. If $|x|\leq|r|$ , then $x$ is a prefix of $r$ and is a nonempty border of $m_{1}$ . By Lemma 2 again, there would exist a nonempty prefix $\beta$ of $m_{2}$ such that $|\beta|\leq|x|$ and $x\ll\beta$ which contradicts again our preliminary step.

Let $w^{\prime}\in\Sigma^{*}$ be such that $w=m_{1}w^{\prime}$ . If $w^{\prime}=1$ we are done. Assume $w^{\prime}\not=1$ . Clearly $|w^{\prime}|<|w|$ . Of course $(m_{2},\ldots,m_{k})$ is an inverse Lyndon factorization of $w$ having the border property. Moreover, by Corollary 1, $\operatorname{CFL}_{in}(w^{\prime})=(\ell_{i},\ldots,\ell_{h})$ and $(\mathcal{C}^{\prime}_{1},\mathcal{C}_{2},\ldots,\mathcal{C}_{s})$ is the decomposition of $\operatorname{CFL}_{in}(w^{\prime})$ into its non-increasing maximal chains for the prefix order, where $\mathcal{C}^{\prime}_{1}$ is defined by $\mathcal{C}_{1}=(\ell_{1},\ldots,\ell_{i-1},\mathcal{C}^{\prime}_{1})$ . By induction hypothesis, $(m_{2},\ldots,m_{k})$ is a grou** of $\operatorname{CFL}_{in}(w^{\prime})$ and consequently $(m_{1},\ldots,m_{k})$ is a grou** of $\operatorname{CFL}_{in}(w)$ .

Finally, to obtain a contradiction, suppose that $(m_{1},\ldots,m_{k})$ is a grou** of $\operatorname{CFL}_{in}(w)$ having the border property such that $(m_{1},\ldots,m_{k})$ is not a compact factorization of $w$ . To adapt the notation to the proof, set $\operatorname{CFL}_{in}(w)=(\ell_{1}^{n_{1}},\ldots,\ell_{r}^{n_{r}})$ , where $r>0$ , $n_{1},\ldots,n_{r}\geq 1$ and $\ell_{1},\ldots,\ell_{r}$ are anti-Lyndon words. By Definitions 9 and 12, there exist integers $j,h,p_{h},q_{h}$ , $1\leq j\leq k-1$ , $1\leq h\leq r$ , $p_{h}\geq 1$ , $q_{h}\geq 1$ , $p_{h}+q_{h}\leq n_{h}$ , such that $m_{j}$ ends with $\ell_{h}^{p_{h}}$ and $m_{j+1}$ starts with $\ell_{h}^{q_{h}}$ . Thus, by Definition 9, $\ell_{h}$ is a prefix of $m_{j}$ . Moreover, $\ell_{h}$ is a proper prefix of $m_{j}$ . Indeed otherwise $\ell_{h}=m_{j}\leq_{p}m_{j+1}$ which is impossible because $m_{j}\ll m_{j+1}$ ( $(m_{1},\ldots,m_{k})$ is an inverse Lyndon factorization). Thus $\ell_{h}$ is a nonempty border of $m_{j}$ . The word $\ell_{h}$ is also a prefix of $m_{j+1}$ and this contradicts the fact that $(m_{1},\ldots,m_{k})$ has the border property. ∎

7 The canonical inverse Lyndon factorization: the algorithm

In this section we state another relevant result of the paper related to the main one stated in Section 5. We have shown that a nonempty word $w$ can have more than one inverse Lyndon factorization but $w$ has a unique inverse Lyndon factorization with the border property (Example 4, Proposition 4). Below we highlight that this unique factorization is the canonical one defined in [10, 13].

This special inverse Lyndon factorization is denoted by $\operatorname{ICFL}$ because it is the counterpart of the Lyndon factorization $\operatorname{CFL}$ of $w$ , when we use (I)inverse words as factors. Indeed, in [10] it has been proved that $\operatorname{ICFL}(w)$ can be computed in linear time and it is uniquely determined for a word $w$ . See Section A for definitions of $\operatorname{ICFL}$ and all related notions. Since $\operatorname{ICFL}(w)$ is the unique inverse Lyndon factorization with the border property, from now on these two notions will be synonymous.

Below we show another interesting property of $\operatorname{ICFL}$ : the last factor of the factorization is the longest suffix that is an inverse Lyndon word. Based on this result we provide a new simpler linear algorithm for computing $\operatorname{ICFL}$ .

We begin by recalling previously proved results on $\operatorname{ICFL}$ , namely Proposition 7.7 in [10] and Proposition 9.5 in [13]. They are merged into Proposition 6.

Proposition 6.

For any $w\in\Sigma^{+}$ , $\operatorname{ICFL}(w)$ is a grou** of $\operatorname{CFL}_{in}(w)$ . Moreover, $\operatorname{ICFL}(w)$ has the border property.

Corollary 2 is a direct consequence of Propositions 4, 5 and 6.

Corollary 2.

For each $w\in\Sigma^{+}$ , $\operatorname{ICFL}(w)$ is a compact factorization and it is is the unique inverse Lyndon factorization of $w$ having the border property.

We end the section with a result which has been proved in [25] and which will be used in the next section.

Proposition 7.

Let $w\in\Sigma^{+}$ , let $\operatorname{CFL}_{in}(w)=(\ell_{1},\ldots,\ell_{h})$ and let $(\mathcal{C}_{1},\mathcal{C}_{2},\ldots,\mathcal{C}_{s})$ be the decomposition of $\operatorname{CFL}_{in}(w)$ into its non-increasing maximal chains for the prefix order. Let $w_{1},\ldots,w_{s}$ be words such that $\operatorname{CFL}_{in}(w_{j})=\mathcal{C}_{j}$ , $1\leq j\leq s$ . Then $\operatorname{ICFL}(w)$ is the concatenation of the sequences $\operatorname{ICFL}(w_{1}),\ldots,\operatorname{ICFL}(w_{s})$ , that is,

\operatorname{ICFL}(w)=(\operatorname{ICFL}(w_{1}),\ldots,\operatorname{ICFL}(% w_{s}))

(7.1)

We can now state some results useful to prove the correctness of our algorithm. First we observe that, thanks to Corollary 2 and Proposition 7, to compute $\operatorname{ICFL}$ we can limit ourselves to the case in which $\operatorname{CFL}_{in}$ is a chain with respect to the prefix order.

Lemma 3.

Let $\ell_{1},\ldots,\ell_{h}$ be anti-Lyndon words over $\Sigma$ that form a non-increasing chain for the prefix order, that is, $\ell_{1}\geq_{p}\ell_{2}\geq_{p}\ldots\geq_{p}\ell_{h}$ . If $\ell_{1}\not=\ell_{2}$ , then $\ell_{1}\not<_{p}\ell_{2}\cdots\ell_{h}$ .

Proof.

By contradiction, assume that $\ell_{1}$ is a prefix of $\ell_{2}\cdots\ell_{h}$ . Then, $\ell_{1}=\ell_{2}\cdots\ell_{t}z$ where either $z=1$ and $2<t\leq h$ or $z$ is a nonempty prefix of $\ell_{t+1}$ , $2\leq t<h$ . Thus either $\ell_{t}$ or $z$ is a nonempty border of $\ell_{1}$ , a contradiction in both cases. ∎

Remark 1.

[13] Let $x,y$ two different borders of a same word $w\in\Sigma^{+}$ . If $x$ is shorter than $y$ , then $x$ is a border of $y$ .

Proposition 8.

Let $w\in\Sigma^{+}$ and assume that $\operatorname{CFL}_{in}(w)$ form a non-increasing chain for the prefix order. If $(m_{1},\ldots,m_{k})$ is a factorization of $w$ such that each $m_{j}$ , $1\leq j\leq k$ , is a concatenation of compact factors in $\operatorname{CFL}_{in}(w)$ , then $(m_{1},\ldots,m_{k})$ has the border property.

Proof.

Let $w\in\Sigma^{+}$ and assume that $\operatorname{CFL}_{in}(w)$ form a non-increasing chain for the prefix order. Let $(m_{1},\ldots,m_{k})$ be a factorization of $w$ such that each $m_{j}$ , $1\leq j\leq k$ , is a concatenation of compact factors in $\operatorname{CFL}_{in}(w)$ . The proof is by induction on $k$ . If $k=1$ , then the conclusion follows immediately. Assume $k>1$ .

Let $w^{\prime}\in\Sigma^{+}$ be such that $w=m_{1}w^{\prime}$ . It is clear that $(m_{2},\ldots,m_{k})$ is a factorization of $w^{\prime}$ such that each $m_{j}$ , $2\leq j\leq k$ , is a concatenation of compact factors in $\operatorname{CFL}_{in}(w^{\prime})$ . Thus, by induction hypothesis, $(m_{2},\ldots,m_{k})$ has the border property. It remains to prove that each nonempty border of $m_{1}$ is not a prefix of $m_{2}$ . The proof is straightforward if $m_{1}$ is unbordered, thus assume that $m_{1}$ is bordered.

Let $\operatorname{CFL}(w)=(\ell_{1}^{n_{1}},\ldots,\ell_{r}^{n_{r}})$ , where $\ell_{1}^{n_{1}},\ldots,\ell_{r}^{n_{r}}$ are the compact factors in $\operatorname{CFL}(w)$ , that is $\ell_{1},\ldots,\ell_{r}$ are anti-Lyndon words such that $\ell_{1}\geq_{p}\ldots\geq_{p}\ell_{h}$ . Since $m_{i}$ is a concatenation of compact factors in $\operatorname{CFL}_{in}(w)$ , there is $h$ , $1\leq h<r$ such that

m_{1}=\ell_{1}^{n_{1}}\cdots\ell_{h}^{n_{h}}

Notice that $\ell_{h}$ is a nonempty border of $m_{1}$ . Furthermore, since $\ell_{h}$ is unbordered, $\ell_{h}$ is the shortest nonempty border of $m_{1}$ .

If there were a word $z$ which is a nonempty border of $m_{1}$ and also a prefix of $m_{2}$ , by Remark 1, $\ell_{h}$ would be a prefix of $m_{2}$ . Therefore, $\ell_{h}$ would be a prefix of the word $\ell_{h+1}^{n_{h+1}}\cdots\ell_{r}^{n_{r}}$ which contradicts Lemma 3. ∎

Proposition 9.

Let $w\in\Sigma^{+}$ and let $\operatorname{ICFL}(w)=(m_{1},\ldots,m_{k})$ be the unique inverse Lyndon factorization of $w$ having the border property. Then $m_{k}$ is the longest suffix of $w$ which is an inverse Lyndon word.

Proof.

Let $w\in\Sigma^{+}$ and let $(m_{1},\dots,m_{k})$ be the unique inverse Lyndon factorization of $w$ having the border property. If $k=1$ we are done. Thus suppose $k>1$ . By contradiction, suppose that $m_{k}$ is not the longest suffix of $w$ that is an inverse Lyndon word. Let $s$ be such longest suffix. Thus, there exist a nonempty suffix $x$ of $m_{j}$ , $1\leq j<k$ such that $s=xm_{j+1}\cdots m_{k}$ . Furthermore $x$ must be a proper suffix of $m_{j}$ or we would have $s=m_{j}\cdots m_{k}\ll m_{j+1}\cdots m_{k}$ contradicting the hypothesis that $s$ is inverse Lyndon.

We claim that $x\ll m_{j+1}$ . Indeed, since $m_{j}$ is an inverse Lyndon word, it holds $x\preceq m_{j}$ . Thus, if $x\ll m_{j}$ or $x=m_{j}$ , it immediately follows that $x\ll m_{j+1}$ . Otherwise, $x\leq_{p}m_{j}$ and $x$ is a nonempty border of $m_{j}$ . By Lemma 2 applied to $(m_{1},\dots,m_{k})$ , with $x=\alpha$ , there must exist a prefix $\beta$ of $m_{j+1}$ such that $x\ll\beta$ , hence $x\ll m_{j+1}$ .

Since $x\ll m_{j+1}$ , we have $s=xm_{j+1}\cdots m_{k}\ll m_{j+1}\cdots m_{k}$ , contradicting the hypothesis that $s$ is an inverse Lyndon word. ∎

Proposition 10.

Let $w\in\Sigma^{+}$ be an inverse Lyndon word, and let $\ell\in\Sigma^{+}$ be an anti-Lyndon word. Then:

1.

If $\ell\ll w$ , then for every $k\geq 1$ , $\ell^{k}w$ is not an inverse Lyndon word.
2.

If $\ell w$ is not an inverse Lyndon word, then $\ell\ll w$ . Furthermore, for every $k\geq 1$ , $w$ is the longest suffix of $\ell^{k}w$ that is an inverse Lyndon word.

Proof.

By Lemma 1, the proof of item 1 is immediate. Suppose $\ell w$ is not inverse Lyndon. Then, there exists a proper suffix $s$ of $\ell w$ such that $\ell w\preceq s$ , hence $\ell w\ll s$ . Since $\ell$ is anti-Lyndon, for every proper suffix $x$ of $\ell$ it follows $x\ll\ell$ and consequently $xw\ll\ell w$ . Thus, $s$ must be a suffix of $w$ . Since $w$ is an inverse Lyndon word, one of the following three cases holds: (1) $w=s$ ; (2) $s<_{p}w$ ; (3) $s\ll w$ . By $\ell w\ll s$ , in each of the three cases it is evident that $\ell w\ll w$ . Thus there are $r,t,t^{\prime}\in\Sigma^{*}$ and $a,b\in\Sigma$ with $a<b$ such that $\ell w=rat$ , $w=rbt^{\prime}$ . If $|\ell|\geq|ra|$ , then clearly $\ell\ll w$ . Otherwise, $|\ell|\leq|r|$ and there is $r^{\prime}\in\Sigma^{*}$ such that $r=\ell r^{\prime}$ . Consequently, $w$ starts with $r^{\prime}a$ . On the other hand, $r^{\prime}$ is a border of $r$ , hence $w=\ell r^{\prime}bt^{\prime}$ and $r^{\prime}bt^{\prime}$ is a suffix of $w$ . This contradicts the fact that $w$ is an inverse Lyndon word.

For every $k\geq 1$ , $w$ is a suffix of $\ell^{k}w$ that is an inverse Lyndon word. Let $x$ be a proper nonempty suffix of $\ell$ . Of course $x\ll\ell$ . The word $xw$ is not an inverse Lyndon word, otherwise we would have $\ell\ll w\preceq xw\ll\ell w$ , a contradiction. Moreover, by Lemma 1, for any $j$ , $1\leq j<k$ , we have $x\ell^{j}w\ll\ell^{j}w$ and $x\ell^{j}w$ is not an inverse Lyndon word. Finally, by item 1, $\ell^{k}w$ is not an inverse Lyndon word. ∎

Algorithm 1 Compute

\operatorname{ICFL}(w)

, the unique compact factorization of

w

having the border property.

1:function Factorize(

w

)

(\ell_{1}^{e_{1}},\dots,\ell_{n}^{e_{n}})\leftarrow\textsc{CompactFactors}(w)

\triangleright

Compute compact factors of

w

\mathcal{F}\leftarrow\varnothing

m^{\prime}\leftarrow\ell_{n}^{e_{n}}

5: for

t=n-1\textbf{ downto }1

\triangleright

Work one compact factor at a time

6: if

\ell_{t}\ll m^{\prime}

then

\triangleright

Proposition 10

\mathcal{F}\leftarrow(m^{\prime},\mathcal{F})

m^{\prime}\leftarrow\ell_{t}^{e_{t}}

9: else

10:

m^{\prime}\leftarrow\ell_{t}^{e_{t}}\cdot m^{\prime}

11:

\mathcal{F}\leftarrow(m^{\prime},\mathcal{F})

12: return

\mathcal{F}

We now describe Algorithm 1. Function $\textsc{Factorize}(w)$ will compute the unique compact factorization of $w$ having the border property. First, at line 2, it is computed the decomposition of $w$ into its compact factors. Then, the factorization of $w$ is carried out from right to left. Specifically, in accordance with Proposition 9, the for-loop at lines 5–10 will search for the longest suffix $m^{\prime}$ of $w$ that is an inverse Lyndon word. The update of $m^{\prime}$ is managed by iteratively applying Proposition 10 at line 6. Once such longest suffix is found (that is, when the condition at line 6 is true) it is added to the growing factorization $\mathcal{F}$ and it is initiated a new search for the longest suffix for the remaining portion of the string. Otherwise, line 10, the suffix is extended. In the end, the complete factorization is returned.

7.1 Correctness and complexity

We now prove that Algorithm 1 is correct, that is that it will compute the unique inverse Lyndon factorization of $w$ having the border property, namely $\operatorname{ICFL}(w)$ . Formally:

Lemma 4.

Let $w\in\Sigma^{+}$ , and let $\mathcal{F}$ be the result of $\textsc{Factorize}(w)$ . Then, $\mathcal{F}=\operatorname{ICFL}(w)$ .

Proof.

Let $(\ell_{1}^{e_{1}},\dots,\ell_{n}^{e_{n}})$ be the decomposition of $w$ into its compact factors, and let $L_{t}=\ell_{t}^{e_{t}}\cdots\ell_{n}^{e_{n}}$ . We will denote by $m^{\prime}_{t}$ (resp. $\mathcal{F}_{t}$ ) the value of $m^{\prime}$ (resp. $\mathcal{F}$ ) at the end of iteration $t$ . We will prove the following loop invariant: at the end of iteration $t$ , sequence $(m^{\prime}_{t},\mathcal{F}_{t})$ is a compact factorization of $L_{t}$ having the border property. The claimed result will follow by Corollary 2.

Initialization.: Prior to entering the loop, $(m^{\prime}_{n},\mathcal{F}_{n})=(\ell_{n}^{e_{n}})$ , where the last equality follows from Proposition 9.
Maintenance.: Let $t\leq n-1$ . By induction hypothesis, $\operatorname{ICFL}(L_{t+1})=(m^{\prime}_{t+1},\mathcal{F}_{t+1})$ .

Suppose $\ell_{t}\ll m^{\prime}_{t+1}$ . Then, by item 1 of Proposition 10 $\ell_{t}\cdot m^{\prime}_{t+1}$ is not inverse Lyndon and $m^{\prime}_{t+1}$ is the longest suffix of $\ell_{t}^{e_{t}}\cdot m^{\prime}_{t+1}$ that is an inverse Lyndon word. Thus, by Proposition 9 $m^{\prime}_{t+1}$ is the last factor of any compact factorization of $\ell_{t}^{e_{t}}\cdot m^{\prime}_{t+1}$ . Hence, $(m^{\prime}_{t},\mathcal{F}_{t})=(\ell_{t}^{e_{t}},m^{\prime}_{t+1},\mathcal{F% }_{t+1})$ is a compact factorization of $F_{t}$ having the border property.

Now, consider the case where $\ell_{t}\not\ll m^{\prime}_{t+1}$ . Then, by the contrapositive of item 2 of Proposition 10, $\ell_{t}\cdot m^{\prime}_{t+1}$ is inverse Lyndon and thus, again by item 2 of Proposition 10, $\ell_{t}^{e_{t}}\cdot m^{\prime}_{t+1}$ is inverse Lyndon. Therefore, $(m^{\prime}_{t},\mathcal{F}_{t})=(\ell_{t}^{e_{t}}\cdot m^{\prime}_{t+1},% \mathcal{F}_{t+1})$ is a compact factorization having the border property.
Termination.: After iteration $t=1$ , sequence $(m^{\prime}_{1},\mathcal{F}_{1})=\operatorname{ICFL}(L_{1})=\operatorname{ICFL% }(w)$ .

Finally, line 11 sets $\mathcal{F}=(m^{\prime}_{1},\mathcal{F}_{1})=\operatorname{ICFL}(w)$ . ∎

Function $\textsc{Factorize}(w)$ has time complexity that is linear in the length of $w$ . Indeed, the sequence of compact factors obtained at line 2 can be computed in linear time in the length of $w$ by a simple modification of Duval’s algorithm (see [17]). After that, each iteration $t$ of loop 5–10 can be implemented to run in time $\mathcal{O}(|\ell_{t}|)$ . Indeed, condition $\ell_{t}\ll m^{\prime}$ can be checked by naively comparing $\ell_{t}$ against $m^{\prime}$ . Furthermore, the update of $m^{\prime}$ and $\mathcal{F}$ can be done in constant time: in fact, $\ell_{t}$ , $\ell_{t}^{e_{t}}$ , $m^{\prime}$ and $\mathcal{F}$ can all be implemented as pairs of indexes (in case of the former three) or as a list of indexes (in case of the latter) of $w$ .

8 Conclusions

We discover the special connection between the Lyndon factorization under the inverse lexicographic ordering, named $\operatorname{CFL}_{in}$ and the canonical inverse Lyndon factorization, named $\operatorname{ICFL}$ : there exists a unique inverse Lyndon factorization having the border property and this unique factorization is $\operatorname{ICFL}$ . Moreover each inverse factor of $\operatorname{ICFL}$ is obtained by concatenating compact factors of $\operatorname{CFL}_{in}$ . These properties give a constrained structure to $\operatorname{ICFL}$ that deserve to be further explored to characterize properties of words. In particular, we believe the characterization of $\operatorname{ICFL}$ as a compact factorization, proved in the paper, could highlight novel properties related the compression of a word, as investigated in [26]. In particular, the number of compact factors seems to be a measure of repetitiveness of the word to be also used in speeding up suffix sorting of a word.

Finally, we believe that the characterization of $\operatorname{ICFL}$ in terms of $\operatorname{CFL}_{in}$ may be used to extend to $\operatorname{ICFL}$ the conservation property proved in [13] for $\operatorname{CFL}$ . This property shows that the Lyndon factorization of a word $w$ preserves common factors with the factorization of a superstring of $w$ . This extends the conservation of Lyndon factors explored for the product $u\cdot v$ of two words $u$ and $v$ [26, 27].

Acknowledgments

This research was supported by the European Union’s Horizon 2020 Research and Innovation Programme under the Marie Skłodowska-Curie grant agreement PANGAIA No. 872539, by MUR 2022YRB97K, PINC, Pangenome Informatics: from Theory to Applications, and by INdAM-GNCS Project 2023

References

[1] Kuo-Tsai Chen, Ralph H. Fox, and Roger C. Lyndon. Free Differential calculus, IV. The Quotient Groups of the Lower Central Series. Ann. Math., 68:81–95, 1958.
[2] Roger Lyndon. On Burnside’s problem. Trans. Amer. Math. Soc., 77:202–215, 1954.
[3] Hideo Bannai, Juha Kärkkäinen, Dominik Köppl, and Marcin Piatkowski. Constructing and indexing the bijective and extended Burrows-Wheeler transform. Information and Computation, 297:105153, 2024.
[4] Elena Biagi, Davide Cenzato, Zsuzsanna Lipták, and Giuseppe Romana. On the number of equal-letter runs of the Bijective Burrows-Wheeler Transform. In CEUR Workshop Proceedings, volume 3587, pages 129–142. R. Piskac c/o Redaktion Sun SITE, Informatik V, RWTH Aachen, 2023.
[5] Dominik Köppl, Daiki Hashimoto, Diptarama Hendrian, and Ayumi Shinohara. In-place bijective Burrows-Wheeler Transforms. In Inge Li Gørtz and Oren Weimann, editors, 31st Annual Symposium on Combinatorial Pattern Matching, CPM 2020, June 17-19, 2020, Copenhagen, Denmark, volume 161 of LIPIcs, pages 21:1–21:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020.
[6] Nico Bertram, Jonas Ellert, and Johannes Fischer. Lyndon words accelerate suffix sorting. In Petra Mutzel, Rasmus Pagh, and Grzegorz Herman, editors, 29th Annual European Symposium on Algorithms, ESA 2021, September 6-8, 2021, Lisbon, Portugal (Virtual Conference), volume 204 of LIPIcs, pages 15:1–15:13. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021.
[7] Olivier Delgrange and Eric Rivals. Star: an algorithm to search for tandem approximate repeats. Bioinformatics, 20(16):2812–2820, 2004.
[8] Igor Martayan, Bastien Cazaux, Antoine Limasset, and Camille Marchet. Conway-Bromage-Lyndon (CBL): an exact, dynamic representation of k-mer sets. bioRxiv, 2024.
[9] Paola Bonizzoni, Matteo Costantini, Clelia De Felice, Alessia Petescia, Yuri Pirola, Marco Previtali, Raffaella Rizzi, Jens Stoye, Rocco Zaccagnino, and Rosalba Zizza. Numeric Lyndon-based feature embedding of sequencing reads for machine learning approaches. Inf. Sci., 607:458–476, 2022.
[10] Paola Bonizzoni, Clelia De Felice, Rocco Zaccagnino, and Rosalba Zizza. Inverse Lyndon words and inverse Lyndon factorizations of words. Adv. Appl. Math., 101:281–319, 2018.
[11] Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. Suffix array and Lyndon factorization of a text. J. Discrete Algorithms, 28:2–8, 2014.
[12] Paola Bonizzoni, Clelia De Felice, Rocco Zaccagnino, and Rosalba Zizza. Lyndon words versus inverse Lyndon words: Queries on suffixes and bordered words. In Alberto Leporati, Carlos Martín-Vide, Dana Shapira, and Claudio Zandron, editors, Language and Automata Theory and Applications - 14th International Conference, LATA 2020, Milan, Italy, March 4-6, 2020, Proceedings, volume 12038 of Lecture Notes in Computer Science, pages 385–396. Springer, 2020.
[13] Paola Bonizzoni, Clelia De Felice, Rocco Zaccagnino, and Rosalba Zizza. On the longest common prefix of suffixes in an inverse Lyndon factorization and other properties. Theor. Comput. Sci., 862:24–41, 2021.
[14] Jean Berstel, Dominique Perrin, and Christophe Reutenauer. Codes and Automata. Encyclopedia of Mathematics and its Applications 129, Cambridge University Press, 2009.
[15] Christian Choffrut and Juhani Karhumäki. Combinatorics of Words. In Grzegorz Rozenberg and Arto Salomaa, editors, Handbook of Formal Languages, Vol. 1, pages 329–438. Springer-Verlag, Berlin, Heidelberg, 1997.
[16] M. Lothaire. Algebraic Combinatorics on Words, Encyclopedia Math. Appl., volume 90. Cambridge University Press, 1997.
[17] M. Lothaire. Applied Combinatorics on Words. Cambridge University Press, 2005.
[18] Christophe Reutenauer. Free Lie algebras. In Handbook of Algebra, London Mathematical Society Monographs. Oxford Science Publications, 1993.
[19] Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms on strings. Cambridge University Press, 2007.
[20] Hideo Bannai, I Tomohiro, Shunsuke Inenaga, Yuto Nakashima, Masayuki Takeda, and Kazuya Tsuruta. A new characterization of maximal repetitions by Lyndon trees. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January 4-6, 2015, pages 562–571, 2015.
[21] Jean-Pierre Duval. Factorizing Words over an Ordered Alphabet. J. Algorithms, 4(4):363–381, 1983.
[22] Harold Fredricksen and James Maiorana. Necklaces of beads in $k$ colors and $k$ -ary de Bru** sequences. Discrete Math., 23(3):207–210, 1978.
[23] Sukhpal Singh Ghuman, Emanuele Giaquinta, and Jorma Tarhio. Alternative Algorithms for Lyndon Factorization. In Proceedings of the Prague Stringology Conference 2014, Prague, Czech Republic, September 1-3, 2014, pages 169–178, 2014.
[24] Daniele A. Gewurz and Francesca Merola. Numeration and enumeration. Eur. J. Comb., 33(7):1547–1556, 2012.
[25] Paola Bonizzoni, Clelia De Felice, Rocco Zaccagnino, and Rosalba Zizza. From the Lyndon factorization to the Canonical Inverse Lyndon factorization: back and forth. under submission, ArXiv, 2024.
[26] Faster Lyndon factorization algorithms for SLP and LZ78 compressed text. Theoretical Computer Science, 656:215–224, 2016.
[27] Alberto Apostolico and Maxime Crochemore. Fast parallel Lyndon factorization with applications. Mathematical systems theory, 28(2):89–108, 1995.

Appendix A The canonical inverse Lyndon factorization

In this section we summarize the relevant material on the canonical inverse Lyndon factorization and we refer to [10, 13] for a thorough discussion on this topic.

If $w$ is an inverse Lyndon word, then $\operatorname{ICFL}(w)=w$ . Otherwise, $\operatorname{ICFL}(w)$ is recursively defined. The first factor of $\operatorname{ICFL}(w)$ is obtained by a special pair $(p,\overline{p})$ of words, named the canonical pair associated with $w$ , which in turn is obtained by the shortest nonempty prefix $z$ of $w$ such that $z$ is not an inverse Lyndon word. Proposition 6.2 in [13] provides the following characterization of the pair $(p,\overline{p})$ .

Proposition 11.

Let $w\in\Sigma^{+}$ be a word which is not an inverse Lyndon word. A pair of words $(p,\overline{p})$ is the canonical pair associated with $w$ if and only the following conditions are satisfied.

(1)

$z=p\overline{p}$ is the shortest nonempty prefix of $w$ which is not an inverse Lyndon word.
(2)

$p=ras$ and $\overline{p}=rb$ , where $r,s\in\Sigma^{*}$ , $a,b\in\Sigma$ and $r$ is the shortest prefix of $p\overline{p}$ such that $p\overline{p}=rasrb$ , with $a<b$ .
(3)

$\overline{p}$ is an inverse Lyndon word.

Given a word $w$ which is not an inverse Lyndon word, Proposition 11 suggests a method to identify the canonical pair $(p,\overline{p})$ associated with $w$ : just find the shortest nonempty prefix $z$ of $w$ which is not an inverse Lyndon word and then a factorization $z=p\overline{p}$ such that conditions (2) and (3) in Proposition 11 are satisfied.

The canonical inverse Lyndon factorization has been also recursively defined.

Definition 13.

Let $w\in\Sigma^{+}$ .
(Basis Step) If $w$ is an inverse Lyndon word, then $\operatorname{ICFL}(w)=(w)$ .
(Recursive Step) If $w$ is not an inverse Lyndon word, let $(p,\overline{p})$ be the canonical pair associated with $w$ and let $v\in\Sigma^{*}$ such that $w=pv$ . Let $\operatorname{ICFL}(v)=(m^{\prime}_{1},\ldots,m^{\prime}_{k})$ and let $r,s\in\Sigma^{*}$ , $a,b\in\Sigma$ such that $p=ras$ , $\overline{p}=rb$ with $a<b$ .

\operatorname{ICFL}(w)=\begin{cases}(p,\operatorname{ICFL}(v))&\mbox{ if }% \overline{p}=rb\leq_{p}m^{\prime}_{1}\\ (pm^{\prime}_{1},m^{\prime}_{2},\ldots,m^{\prime}_{k})&\mbox{ if }m^{\prime}_{% 1}\leq_{p}r\end{cases}

The following example is in [10].

Example 7.

Let $\Sigma=\{a,b,c,d\}$ with $a<b<c<d$ , $w=dabadabdabdadac$ . We have $\operatorname{CFL}_{in}(w)=(daba,dab,dab,dadac)$ and $\operatorname{ICFL}(w)=(daba,dabdab,dadac)$ . Consider $z=dabdadacddbdc$ . We have $\operatorname{ICFL}(z)=\operatorname{CFL}_{in}(z)=(dab,dadac,ddbdc)$ .