Unveiling the connection between the Lyndon factorization and the canonical inverse Lyndon factorization via a border property
Abstract
The notion of Lyndon word and Lyndon factorization has shown to have unexpected applications in theory as well in develo** novel algorithms on words. A counterpart to these notions are those of inverse Lyndon word and inverse Lyndon factorization. Differently from the Lyndon words, the inverse Lyndon words may be bordered. The relationship between the two factorizations is related to the inverse lexicographic ordering, and has only been recently explored. More precisely, a main open question is how to get an inverse Lyndon factorization from a classical Lyndon factorization under the inverse lexicographic ordering, named . In this paper we reveal a strong connection between these two factorizations where the border plays a relevant role. More precisely, we show two main results. We say that a factorization has the border property if a nonempty border of a factor cannot be a prefix of the next factor. First we show that there exists a unique inverse Lyndon factorization having the border property. Then we show that this unique factorization with the border property is the so-called canonical inverse Lyndon factorization, named . By showing that is obtained by compacting factors of the Lyndon factorization over the inverse lexicographic ordering, we provide a linear time algorithm for computing from .
Keywords Lyndon words Lyndon factorization Combinatorial algorithms on words
1 Introduction
The theoretical investigation of combinatorial properties of well-known word factorizations is a research topic that recently have witnessed special interest especially for improving the efficiency of algorithms. Among these, undoubtedly stands out the Lyndon Factorization introduced by Chen, Fox, Lyndon in [1], named . Any word admits a unique factorization , that is a lexicographically non-increasing sequence of factors which are Lyndon words. A Lyndon word is strictly lexicographically smaller than each of its proper cyclic shifts, or, equivalently, than each of its nonempty proper suffixes [2]. Interesting applications of the use of the Lyndon factorization and Lyndon words are the development of the bijective Burrows-Wheeler Transforms [3, 4, 5] and a novel algorithm for sorting suffixes [6]. In particular, the notion of a Lyndon word has been re-discovered various times as a theoretical tool to locate short motifs [7] and relevant k-mers in bioinformatics applications [8]. In this line of research, Lyndon-based word factorizations have been explored to define a novel feature representation for biological sequences based on theoretical combinatorial properties proved to capture sequence similarities [9].
The notion of a Lyndon word has a counterpart that is the notion of an inverse Lyndon word, i.e., a word lexicographically greater than its suffixes. Inverting the relation between a word and its suffixes, as between Lyndon words and inverse Lyndon words, leads to different properties. Indeed, although a word could admit more than one inverse Lyndon factorization, that is a factorization into a nonincreasing product of inverse Lyndon words, in [10] the Canonical Inverse Lyndon Factorization, named , was introduced. maintains the main properties of : it is unique and can be computed in linear time. In addition, it maintains a similar Compatibility Property, used for obtaining the sorting of the suffixes of (“global suffixes”) by using the sorting of the suffixes of each factor of (“local suffixes”) [11]. Most notably, has another interesting property [10, 12, 13]: we can provide an upper bound on the length of the longest common prefix of two substrings of a word starting from different positions.
A relationship between and has been proved by using the notion of grou** [10]. First, let be the Lyndon factorization of with respect to the inverse lexicographic order, it is proved that is obtained by concatenating the factors of a non-increasing maximal chain with respect to the prefix order, denoted by , in (see Section 6). Despite this result, the connection between and the inverse Lyndon factorization still remained obscure, mainly by the fact that a word may have multiple inverse Lyndon factorizations.
In this paper, we explore this connection between and the inverse Lyndon factorizations. Our first main contribution consists in showing that there is a unique inverse Lyndon factorization of a word that has border property. The border property states that any nonempty border of a factor cannot be a prefix of the next factor. We further highlight the aforementioned connection by proving that the inverse Lyndon factorization with the border property is a compact factorization (Definition 12), i.e., each inverse Lyndon factor is the concatenation of compact factors, each obtained by concatenating the longest sequence of identical words in a . We then show the second contribution of this paper: this unique factorization is itself and then provide a simpler linear time algorithm for computing . Our algorithm is based on a new property that characterizes : the last factor in an inverse Lyndon factorization with the border property of is the longest suffix of that is an inverse Lyndon word. Recall that the Lyndon factorization of has a similar property: the last factor is the longest suffix of that is a Lyndon word.
2 Words
Throughout this paper we follow [14, 15, 16, 17, 18] for the notations. We denote by the free monoid generated by a finite alphabet and we set , where is the empty word. For a word , we denote by its length. A word is a factor of if there are such that . If (resp. ), then is a prefix (resp. suffix) of . A factor (resp. prefix, suffix) of is proper if . Two words are incomparable for the prefix order, denoted as , if neither is a prefix of nor is a prefix of . Otherwise, are comparable for the prefix order. We write if is a prefix of and if is a prefix of . The notion of a pair of words comparable (or incomparable) for the suffix order is defined symmetrically.
We recall that, given a nonempty word , a border of is a word which is both a proper prefix and a suffix of [19]. The longest proper prefix of which is a suffix of is also called the border of [19, 17]. A word is bordered if it has a nonempty border. Otherwise, is unbordered. A nonempty word is primitive if implies . An unbordered word is primitive. A sesquipower of a word is a word where is a proper prefix of and . Two words are called conjugate if there exist words such that . The conjugacy relation is an equivalence relation. A conjugacy class is a class of this equivalence relation.
Definition 1.
Let be a totally ordered alphabet. The lexicographic (or alphabetic order) on is defined by setting if
-
•
is a proper prefix of , or
-
•
, , , for and .
In the next part of the paper we will implicitly refer to totally ordered alphabets. For two nonempty words , we write if and is not a proper prefix of [20]. We also write if . Basic properties of the lexicographic order are recalled below.
Lemma 1.
For , the following properties hold.
-
(1)
if and only if , for every word .
-
(2)
If , then for all words .
-
(3)
If for a word , then for some word such that .
-
(4)
If and , then .
Let be sequences, with . For abbreviation, we let stand for the sequence .
3 Lyndon words
Definition 2.
A Lyndon word is a word which is primitive and the smallest one in its conjugacy class for the lexicographic order.
Example 1.
Let with . The words , , , , and are Lyndon words. On the contrary, , and are not Lyndon words.
Proposition 1.
Each Lyndon word is unbordered.
A class of conjugacy is also called a necklace and often identified with the minimal word for the lexicographic order in it. We will adopt this terminology. Then a word is a necklace if and only if it is a power of a Lyndon word. A prenecklace is a prefix of a necklace. Then clearly any nonempty prenecklace has the form , where is a Lyndon word, , , , that is, is a sesquipower of a Lyndon word . The following result has been proved in [21]. It shows that the nonempty prefixes of Lyndon words are exactly the nonempty prefixes of the powers of Lyndon words with the exclusion of , where is the maximal letter and .
Proposition 2.
A word is a nonempty prefix of a Lyndon word if and only if it is a sesquipower of a Lyndon word distinct of , where is the maximal letter and .
In the following will be the set of Lyndon words, totally ordered by the relation on .
Theorem 3.1.
Any word can be written in a unique way as a nonincreasing product of Lyndon words, i.e., in the form
(3.1) |
The sequence in Eq. (3.1) is called the Lyndon decomposition (or Lyndon factorization) of . It is denoted by because Theorem 3.1 is usually credited to Chen, Fox and Lyndon [1]. The following result, proved in [21], is necessary for our aims.
Corollary 1.
Let , let be its longest prefix which is a Lyndon word and let be such that . If , then .
Sometimes we need to emphasize consecutive equal factors in . We write to denote a tuple of Lyndon words, where , . Precisely are Lyndon words, also named Lyndon factors of . There is a linear time algorithm to compute the pair and thus, by iteration, the Lyndon factorization of [22, 17]. Linear time algorithms may also be found in [21] and in the more recent paper [23].
4 Inverse Lyndon words
For the material in this section see [10, 12, 13]. Inverse Lyndon words are related to the inverse alphabetic order. Its definition is recalled below.
Definition 3.
Let be a totally ordered alphabet. The inverse of is defined by
The inverse lexicographic or inverse alphabetic order on , denoted , is the lexicographic order on .
Example 2.
Let with . Then and . We have . Therefore and .
Of course for all such that ,
Moreover, in this case . This justifies the adopted terminology.
From now on, denotes the set of the Lyndon words on with respect to the inverse lexicographic order. Following [24], a word will be named an anti-Lyndon word. Correspondingly, an anti-prenecklace will be a prefix of an anti-necklace, which in turn will be a necklace with respect to the inverse lexicographic order.
In the following, we denote by the Lyndon factorization of with respect to the inverse order .
Definition 4.
A word is an inverse Lyndon word if , for each nonempty proper suffix of .
Example 3.
The words , , , , , and are inverse Lyndon words on , with . On the contrary, is not an inverse Lyndon word since . Analogously, and thus is not an inverse Lyndon word.
The following result, proved in [10, 13], and also in [25], summarizes some properties of the inverse Lyndon words.
Proposition 3.
Let . Then we have
-
1.
The word is an anti-Lyndon word if and only if it is an unbordered inverse Lyndon word.
-
2.
The word is an inverse Lyndon word if and only if is a nonempty anti-prenecklace.
-
3.
If is an inverse Lyndon word, then any nonempty prefix of is an inverse Lyndon word.
Definition 5.
An inverse Lyndon factorization of a word is a sequence of inverse Lyndon words such that and , .
As the following example in [10] shows, a word may have different inverse Lyndon factorizations.
Example 4.
Let with , . It is easy to see that , , are all inverse Lyndon factorizations of .
5 The border property
In this section we prove the main result of this paper, namely, for any nonempty word , there exists a unique inverse Lyndon factorization of which has a special property, named the border property.
Definition 6 (Border property).
Let . A factorization of has the border property if each nonempty border of is not a prefix of , .
We first prove a fundamental property of the inverse Lyndon factorizations of which have the border property.
Lemma 2.
Let , let be an inverse Lyndon factorization of having the border property. If is a nonempty border of , , then there exists a nonempty prefix of such that and .
Proof.
Let , let be an inverse Lyndon factorization of having the border property, let be a nonempty border of , . We distinguish two cases: either or .
Assume . By hypothesis is an inverse Lyndon factorization, hence , that is, there are , , such that and , . Obviously , thus there is such that . Consequently, and our claim holds with .
Assume . Let be the nonempty prefix of such that . Clearly because has the border property. Since and are two different nonempty words of the same length, either or . The first case leads to a contradiction because if then by Lemma 1 and this contradicts the fact that is an inverse Lyndon factorization. Thus, and the proof is complete. ∎
Proposition 4.
For each , there exists a unique inverse Lyndon factorization of having the border property.
Proof.
The proof is by induction on . If , then and statement clearly holds. Thus assume . Let and two inverse Lyndon factorizations of having the border property. Thus
(5.1) |
If and or , clearly and . Analogously, if , and , then and , would be two inverse Lyndon factorizations of having the border property, where is such that . Of course, . By induction hypothesis, , hence .
By contradiction, let . Assume (similar arguments apply if ). The word is a proper suffix of . Clearly . Let be the smallest integer such that is a proper suffix of , , that is,
(5.2) |
where is a suffix of .
Notice that
(5.3) |
Indeed, if , then, by Eq. (5.2), we would have , which is impossible because is an inverse Lyndon word.
The word is a nonempty proper suffix of since otherwise we would have , contrary to Eq. (5.3). Since is an inverse Lyndon word and is a nonempty proper suffix of , either or .
6 Grou**s and compact factorizations
In this section we prove a structural property of an inverse Lyndon factorization having the border property, namely it is a compact factorization. This result is crucial to characterize the relationship between and the factorization into inverse Lyndon words of . First we report the notion of grou** given in [10]. We refer to [10, 13] for a detailed and complete discussion on this topic.
Let , where . Consider the partial order , where if is a prefix of . Recall that a chain is a set of a pairwise comparable elements. We say that a chain is maximal if it is not strictly contained in any other chain. A non-increasing (maximal) chain in is the sequence corresponding to a (maximal) chain in the multiset with respect to . We denote by a non-increasing maximal chain in . Looking at the definition of the (inverse) lexicographic order, it is easy to see that a is a sequence of consecutive factors in . Moreover is the concatenation of its . The formal definitions are given below.
Definition 7.
Let , let and let . We say that is a non-increasing maximal chain for the prefix order in , abbreviated , if . Moreover, if , then , if , then . Two , are consecutive if (or ).
Definition 8.
Let , let . We say that is the decomposition of into its non-increasing maximal chains for the prefix order if the following holds
-
(1)
Each is a non-increasing maximal chain in .
-
(2)
and are consecutive, .
-
(3)
is the concatenation of the sequences .
Definition 9.
Let . We say that is a grou** of if the following holds
-
(1)
is an inverse Lyndon factorization of
-
(2)
Each factor , is the product of consecutive factors in a in .
Example 5.
Let , , and . We have . The decomposition of into its is . Moreover, is a grou** of but for the inverse Lyndon factorization this is no longer true.
Next, let . We have . The decomposition of into its is . Moreover, and are two grou**s of .
For our aims, we need to consider the words that are concatenations of equal factors in . This approach leads to a refinement of the partition of into non-increasing maximal chains for the prefix order, as defined below.
Definition 10 (Compact sequences).
Let be a non-increasing maximal chain for the prefix order in . The decomposition of into maximal compact sequences is the sequence such that
-
(1)
-
(2)
For every , , consists of the longest sequence of consecutive identical elements in
Let be the decomposition of into its non-increasing maximal chains for the prefix order. The decomposition of into its maximal compact sequences is obtained by replacing each in with its decomposition into maximal compact sequences.
Definition 11 (Compact factor).
Let be the decomposition of into its maximal compact sequences. For every , , the concatenation of the elements in is a compact factor in .
Definition 12 (Compact factorization).
Let . We say that is a compact factorization of if is an inverse Lyndon factorization of and each , , is a concatenation of compact factors in .
Example 6.
Consider again over , , as in Example 5. The decomposition of into its maximal compact sequences is . The compact factors in are . Moreover, is a compact factorization whereas is a grou** of which is not a compact factorization.
Proposition 5.
Let . If is an inverse Lyndon factorization of having the border property, then is a compact factorization of .
Proof.
Let , let be an inverse Lyndon factorization of having the border property. Let , where and are anti-Lyndon words. First we prove that is a grou** of by induction on . If the statement clearly holds, thus assume .
The words and are comparable for the prefix order, hence either is a proper prefix of or is a prefix of . Suppose that is a proper prefix of . Thus, there are , , and , , such that and . Necessarily it turns out because otherwise , hence, by Lemma 1, and this contradicts the fact that is an anti-Lyndon word. In conclusion and . We know that , that is, there are , , such that and , . If , then is a nonempty border of and if , then there is a word such that which implies . Both cases again contradict the fact that is an anti-Lyndon word.
Therefore, is a prefix of . Let be the largest integer such that , , , , . Let be the decomposition of into its non-increasing maximal chains for the prefix order. We claim that is a prefix of the concatenation of the elements of , thus is a chain for the prefix order. If we are done. Let . By contradiction, assume that there is , , such that . Therefore, which implies and this contradicts the fact that is an inverse Lyndon word.
We now prove that . Assume . As a preliminary step, we prove that there is no nonempty prefix of such that and . In fact, if such a prefix existed, there would be , , such that and , . If then , where would be a nonempty prefix of , thus a nonempty border of (recall that with ). If , then there would be a word such that which would imply . Both cases contradict the fact that is an anti-Lyndon word.
If , then either is a prefix of or . If were a prefix of , then would be a nonempty border of . By Lemma 2 there would exist a nonempty prefix of such that and which contradicts our preliminary step.
If it were true that then there would be , , such that and , . If , then there would be a word such that which would imply and this contradicts the fact that is an inverse Lyndon word. If , then is a prefix of and is a nonempty border of . By Lemma 2 again, there would exist a nonempty prefix of such that and which contradicts again our preliminary step.
Let be such that . If we are done. Assume . Clearly . Of course is an inverse Lyndon factorization of having the border property. Moreover, by Corollary 1, and is the decomposition of into its non-increasing maximal chains for the prefix order, where is defined by . By induction hypothesis, is a grou** of and consequently is a grou** of .
Finally, to obtain a contradiction, suppose that is a grou** of having the border property such that is not a compact factorization of . To adapt the notation to the proof, set , where , and are anti-Lyndon words. By Definitions 9 and 12, there exist integers , , , , , , such that ends with and starts with . Thus, by Definition 9, is a prefix of . Moreover, is a proper prefix of . Indeed otherwise which is impossible because ( is an inverse Lyndon factorization). Thus is a nonempty border of . The word is also a prefix of and this contradicts the fact that has the border property. ∎
7 The canonical inverse Lyndon factorization: the algorithm
In this section we state another relevant result of the paper related to the main one stated in Section 5. We have shown that a nonempty word can have more than one inverse Lyndon factorization but has a unique inverse Lyndon factorization with the border property (Example 4, Proposition 4). Below we highlight that this unique factorization is the canonical one defined in [10, 13].
This special inverse Lyndon factorization is denoted by because it is the counterpart of the Lyndon factorization of , when we use (I)inverse words as factors. Indeed, in [10] it has been proved that can be computed in linear time and it is uniquely determined for a word . See Section A for definitions of and all related notions. Since is the unique inverse Lyndon factorization with the border property, from now on these two notions will be synonymous.
Below we show another interesting property of : the last factor of the factorization is the longest suffix that is an inverse Lyndon word. Based on this result we provide a new simpler linear algorithm for computing .
We begin by recalling previously proved results on , namely Proposition 7.7 in [10] and Proposition 9.5 in [13]. They are merged into Proposition 6.
Proposition 6.
For any , is a grou** of . Moreover, has the border property.
Corollary 2.
For each , is a compact factorization and it is is the unique inverse Lyndon factorization of having the border property.
We end the section with a result which has been proved in [25] and which will be used in the next section.
Proposition 7.
Let , let and let be the decomposition of into its non-increasing maximal chains for the prefix order. Let be words such that , . Then is the concatenation of the sequences , that is,
(7.1) |
We can now state some results useful to prove the correctness of our algorithm. First we observe that, thanks to Corollary 2 and Proposition 7, to compute we can limit ourselves to the case in which is a chain with respect to the prefix order.
Lemma 3.
Let be anti-Lyndon words over that form a non-increasing chain for the prefix order, that is, . If , then .
Proof.
By contradiction, assume that is a prefix of . Then, where either and or is a nonempty prefix of , . Thus either or is a nonempty border of , a contradiction in both cases. ∎
Remark 1.
[13] Let two different borders of a same word . If is shorter than , then is a border of .
Proposition 8.
Let and assume that form a non-increasing chain for the prefix order. If is a factorization of such that each , , is a concatenation of compact factors in , then has the border property.
Proof.
Let and assume that form a non-increasing chain for the prefix order. Let be a factorization of such that each , , is a concatenation of compact factors in . The proof is by induction on . If , then the conclusion follows immediately. Assume .
Let be such that . It is clear that is a factorization of such that each , , is a concatenation of compact factors in . Thus, by induction hypothesis, has the border property. It remains to prove that each nonempty border of is not a prefix of . The proof is straightforward if is unbordered, thus assume that is bordered.
Let , where are the compact factors in , that is are anti-Lyndon words such that . Since is a concatenation of compact factors in , there is , such that
Notice that is a nonempty border of . Furthermore, since is unbordered, is the shortest nonempty border of .
Proposition 9.
Let and let be the unique inverse Lyndon factorization of having the border property. Then is the longest suffix of which is an inverse Lyndon word.
Proof.
Let and let be the unique inverse Lyndon factorization of having the border property. If we are done. Thus suppose . By contradiction, suppose that is not the longest suffix of that is an inverse Lyndon word. Let be such longest suffix. Thus, there exist a nonempty suffix of , such that . Furthermore must be a proper suffix of or we would have contradicting the hypothesis that is inverse Lyndon.
We claim that . Indeed, since is an inverse Lyndon word, it holds . Thus, if or , it immediately follows that . Otherwise, and is a nonempty border of . By Lemma 2 applied to , with , there must exist a prefix of such that , hence .
Since , we have , contradicting the hypothesis that is an inverse Lyndon word. ∎
Proposition 10.
Let be an inverse Lyndon word, and let be an anti-Lyndon word. Then:
-
1.
If , then for every , is not an inverse Lyndon word.
-
2.
If is not an inverse Lyndon word, then . Furthermore, for every , is the longest suffix of that is an inverse Lyndon word.
Proof.
By Lemma 1, the proof of item 1 is immediate. Suppose is not inverse Lyndon. Then, there exists a proper suffix of such that , hence . Since is anti-Lyndon, for every proper suffix of it follows and consequently . Thus, must be a suffix of . Since is an inverse Lyndon word, one of the following three cases holds: (1) ; (2) ; (3) . By , in each of the three cases it is evident that . Thus there are and with such that , . If , then clearly . Otherwise, and there is such that . Consequently, starts with . On the other hand, is a border of , hence and is a suffix of . This contradicts the fact that is an inverse Lyndon word.
For every , is a suffix of that is an inverse Lyndon word. Let be a proper nonempty suffix of . Of course . The word is not an inverse Lyndon word, otherwise we would have , a contradiction. Moreover, by Lemma 1, for any , , we have and is not an inverse Lyndon word. Finally, by item 1, is not an inverse Lyndon word. ∎
We now describe Algorithm 1. Function will compute the unique compact factorization of having the border property. First, at line 2, it is computed the decomposition of into its compact factors. Then, the factorization of is carried out from right to left. Specifically, in accordance with Proposition 9, the for-loop at lines 5–10 will search for the longest suffix of that is an inverse Lyndon word. The update of is managed by iteratively applying Proposition 10 at line 6. Once such longest suffix is found (that is, when the condition at line 6 is true) it is added to the growing factorization and it is initiated a new search for the longest suffix for the remaining portion of the string. Otherwise, line 10, the suffix is extended. In the end, the complete factorization is returned.
7.1 Correctness and complexity
We now prove that Algorithm 1 is correct, that is that it will compute the unique inverse Lyndon factorization of having the border property, namely . Formally:
Lemma 4.
Let , and let be the result of . Then, .
Proof.
Let be the decomposition of into its compact factors, and let . We will denote by (resp. ) the value of (resp. ) at the end of iteration . We will prove the following loop invariant: at the end of iteration , sequence is a compact factorization of having the border property. The claimed result will follow by Corollary 2.
- Initialization.
-
Prior to entering the loop, , where the last equality follows from Proposition 9.
- Maintenance.
-
Let . By induction hypothesis, .
- Termination.
-
After iteration , sequence .
Finally, line 11 sets . ∎
Function has time complexity that is linear in the length of . Indeed, the sequence of compact factors obtained at line 2 can be computed in linear time in the length of by a simple modification of Duval’s algorithm (see [17]). After that, each iteration of loop 5–10 can be implemented to run in time . Indeed, condition can be checked by naively comparing against . Furthermore, the update of and can be done in constant time: in fact, , , and can all be implemented as pairs of indexes (in case of the former three) or as a list of indexes (in case of the latter) of .
8 Conclusions
We discover the special connection between the Lyndon factorization under the inverse lexicographic ordering, named and the canonical inverse Lyndon factorization, named : there exists a unique inverse Lyndon factorization having the border property and this unique factorization is . Moreover each inverse factor of is obtained by concatenating compact factors of . These properties give a constrained structure to that deserve to be further explored to characterize properties of words. In particular, we believe the characterization of as a compact factorization, proved in the paper, could highlight novel properties related the compression of a word, as investigated in [26]. In particular, the number of compact factors seems to be a measure of repetitiveness of the word to be also used in speeding up suffix sorting of a word.
Finally, we believe that the characterization of in terms of may be used to extend to the conservation property proved in [13] for . This property shows that the Lyndon factorization of a word preserves common factors with the factorization of a superstring of . This extends the conservation of Lyndon factors explored for the product of two words and [26, 27].
Acknowledgments
This research was supported by the European Union’s Horizon 2020 Research and Innovation Programme under the Marie Skłodowska-Curie grant agreement PANGAIA No. 872539, by MUR 2022YRB97K, PINC, Pangenome Informatics: from Theory to Applications, and by INdAM-GNCS Project 2023
References
- [1] Kuo-Tsai Chen, Ralph H. Fox, and Roger C. Lyndon. Free Differential calculus, IV. The Quotient Groups of the Lower Central Series. Ann. Math., 68:81–95, 1958.
- [2] Roger Lyndon. On Burnside’s problem. Trans. Amer. Math. Soc., 77:202–215, 1954.
- [3] Hideo Bannai, Juha Kärkkäinen, Dominik Köppl, and Marcin Piatkowski. Constructing and indexing the bijective and extended Burrows-Wheeler transform. Information and Computation, 297:105153, 2024.
- [4] Elena Biagi, Davide Cenzato, Zsuzsanna Lipták, and Giuseppe Romana. On the number of equal-letter runs of the Bijective Burrows-Wheeler Transform. In CEUR Workshop Proceedings, volume 3587, pages 129–142. R. Piskac c/o Redaktion Sun SITE, Informatik V, RWTH Aachen, 2023.
- [5] Dominik Köppl, Daiki Hashimoto, Diptarama Hendrian, and Ayumi Shinohara. In-place bijective Burrows-Wheeler Transforms. In Inge Li Gørtz and Oren Weimann, editors, 31st Annual Symposium on Combinatorial Pattern Matching, CPM 2020, June 17-19, 2020, Copenhagen, Denmark, volume 161 of LIPIcs, pages 21:1–21:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020.
- [6] Nico Bertram, Jonas Ellert, and Johannes Fischer. Lyndon words accelerate suffix sorting. In Petra Mutzel, Rasmus Pagh, and Grzegorz Herman, editors, 29th Annual European Symposium on Algorithms, ESA 2021, September 6-8, 2021, Lisbon, Portugal (Virtual Conference), volume 204 of LIPIcs, pages 15:1–15:13. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021.
- [7] Olivier Delgrange and Eric Rivals. Star: an algorithm to search for tandem approximate repeats. Bioinformatics, 20(16):2812–2820, 2004.
- [8] Igor Martayan, Bastien Cazaux, Antoine Limasset, and Camille Marchet. Conway-Bromage-Lyndon (CBL): an exact, dynamic representation of k-mer sets. bioRxiv, 2024.
- [9] Paola Bonizzoni, Matteo Costantini, Clelia De Felice, Alessia Petescia, Yuri Pirola, Marco Previtali, Raffaella Rizzi, Jens Stoye, Rocco Zaccagnino, and Rosalba Zizza. Numeric Lyndon-based feature embedding of sequencing reads for machine learning approaches. Inf. Sci., 607:458–476, 2022.
- [10] Paola Bonizzoni, Clelia De Felice, Rocco Zaccagnino, and Rosalba Zizza. Inverse Lyndon words and inverse Lyndon factorizations of words. Adv. Appl. Math., 101:281–319, 2018.
- [11] Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. Suffix array and Lyndon factorization of a text. J. Discrete Algorithms, 28:2–8, 2014.
- [12] Paola Bonizzoni, Clelia De Felice, Rocco Zaccagnino, and Rosalba Zizza. Lyndon words versus inverse Lyndon words: Queries on suffixes and bordered words. In Alberto Leporati, Carlos Martín-Vide, Dana Shapira, and Claudio Zandron, editors, Language and Automata Theory and Applications - 14th International Conference, LATA 2020, Milan, Italy, March 4-6, 2020, Proceedings, volume 12038 of Lecture Notes in Computer Science, pages 385–396. Springer, 2020.
- [13] Paola Bonizzoni, Clelia De Felice, Rocco Zaccagnino, and Rosalba Zizza. On the longest common prefix of suffixes in an inverse Lyndon factorization and other properties. Theor. Comput. Sci., 862:24–41, 2021.
- [14] Jean Berstel, Dominique Perrin, and Christophe Reutenauer. Codes and Automata. Encyclopedia of Mathematics and its Applications 129, Cambridge University Press, 2009.
- [15] Christian Choffrut and Juhani Karhumäki. Combinatorics of Words. In Grzegorz Rozenberg and Arto Salomaa, editors, Handbook of Formal Languages, Vol. 1, pages 329–438. Springer-Verlag, Berlin, Heidelberg, 1997.
- [16] M. Lothaire. Algebraic Combinatorics on Words, Encyclopedia Math. Appl., volume 90. Cambridge University Press, 1997.
- [17] M. Lothaire. Applied Combinatorics on Words. Cambridge University Press, 2005.
- [18] Christophe Reutenauer. Free Lie algebras. In Handbook of Algebra, London Mathematical Society Monographs. Oxford Science Publications, 1993.
- [19] Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms on strings. Cambridge University Press, 2007.
- [20] Hideo Bannai, I Tomohiro, Shunsuke Inenaga, Yuto Nakashima, Masayuki Takeda, and Kazuya Tsuruta. A new characterization of maximal repetitions by Lyndon trees. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January 4-6, 2015, pages 562–571, 2015.
- [21] Jean-Pierre Duval. Factorizing Words over an Ordered Alphabet. J. Algorithms, 4(4):363–381, 1983.
- [22] Harold Fredricksen and James Maiorana. Necklaces of beads in colors and -ary de Bru** sequences. Discrete Math., 23(3):207–210, 1978.
- [23] Sukhpal Singh Ghuman, Emanuele Giaquinta, and Jorma Tarhio. Alternative Algorithms for Lyndon Factorization. In Proceedings of the Prague Stringology Conference 2014, Prague, Czech Republic, September 1-3, 2014, pages 169–178, 2014.
- [24] Daniele A. Gewurz and Francesca Merola. Numeration and enumeration. Eur. J. Comb., 33(7):1547–1556, 2012.
- [25] Paola Bonizzoni, Clelia De Felice, Rocco Zaccagnino, and Rosalba Zizza. From the Lyndon factorization to the Canonical Inverse Lyndon factorization: back and forth. under submission, ArXiv, 2024.
- [26] Faster Lyndon factorization algorithms for SLP and LZ78 compressed text. Theoretical Computer Science, 656:215–224, 2016.
- [27] Alberto Apostolico and Maxime Crochemore. Fast parallel Lyndon factorization with applications. Mathematical systems theory, 28(2):89–108, 1995.
Appendix A The canonical inverse Lyndon factorization
In this section we summarize the relevant material on the canonical inverse Lyndon factorization and we refer to [10, 13] for a thorough discussion on this topic.
If is an inverse Lyndon word, then . Otherwise, is recursively defined. The first factor of is obtained by a special pair of words, named the canonical pair associated with , which in turn is obtained by the shortest nonempty prefix of such that is not an inverse Lyndon word. Proposition 6.2 in [13] provides the following characterization of the pair .
Proposition 11.
Let be a word which is not an inverse Lyndon word. A pair of words is the canonical pair associated with if and only the following conditions are satisfied.
-
(1)
is the shortest nonempty prefix of which is not an inverse Lyndon word.
-
(2)
and , where , and is the shortest prefix of such that , with .
-
(3)
is an inverse Lyndon word.
Given a word which is not an inverse Lyndon word, Proposition 11 suggests a method to identify the canonical pair associated with : just find the shortest nonempty prefix of which is not an inverse Lyndon word and then a factorization such that conditions (2) and (3) in Proposition 11 are satisfied.
The canonical inverse Lyndon factorization has been also recursively defined.
Definition 13.
Let .
(Basis Step)
If is an inverse Lyndon word,
then .
(Recursive Step)
If is not an inverse Lyndon word,
let be the canonical pair associated with and let
such that .
Let and let
, such that , with .
The following example is in [10].
Example 7.
Let with , . We have and . Consider . We have .