-
Bit catastrophes for the Burrows-Wheeler Transform
Authors:
Sara Giuliani,
Shunsuke Inenaga,
Zsuzsanna Lipták,
Giuseppe Romana,
Marinella Sciortino,
Cristian Urbina
Abstract:
A bit catastrophe, loosely defined, is when a change in just one character of a string causes a significant change in the size of the compressed string. We study this phenomenon for the Burrows-Wheeler Transform (BWT), a string transform at the heart of several of the most popular compressors and aligners today. The parameter determining the size of the compressed data is the number of equal-lette…
▽ More
A bit catastrophe, loosely defined, is when a change in just one character of a string causes a significant change in the size of the compressed string. We study this phenomenon for the Burrows-Wheeler Transform (BWT), a string transform at the heart of several of the most popular compressors and aligners today. The parameter determining the size of the compressed data is the number of equal-letter runs of the BWT, commonly denoted $r$.
We exhibit infinite families of strings in which insertion, deletion, resp. substitution of one character increases $r$ from constant to $Θ(\log n)$, where $n$ is the length of the string. These strings can be interpreted both as examples for an increase by a multiplicative or an additive $Θ(\log n)$-factor. As regards multiplicative factor, they attain the upper bound given by Akagi, Funakoshi, and Inenaga [Inf & Comput. 2023] of $O(\log n \log r)$, since here $r=O(1)$.
We then give examples of strings in which insertion, deletion, resp. substitution of a character increases $r$ by a $Θ(\sqrt{n})$ additive factor. These strings significantly improve the best known lower bound for an additive factor of $Ω(\log n)$ [Giuliani et al., SOFSEM 2021].
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
Exploring Repetitiveness Measures for Two-Dimensional Strings
Authors:
Giuseppe Romana,
Marinella Sciortino,
Cristian Urbina
Abstract:
Detecting and measuring repetitiveness of strings is a problem that has been extensively studied in data compression and text indexing. However, when the data are structured in a non-linear way, like in the context of two-dimensional strings, inherent redundancy offers a rich source for compression, yet systematic studies on repetitiveness measures are still lacking. In the paper we introduce exte…
▽ More
Detecting and measuring repetitiveness of strings is a problem that has been extensively studied in data compression and text indexing. However, when the data are structured in a non-linear way, like in the context of two-dimensional strings, inherent redundancy offers a rich source for compression, yet systematic studies on repetitiveness measures are still lacking. In the paper we introduce extensions of repetitiveness measures to general two-dimensional strings. In particular, we propose a new extension of the measures $δ$ and $γ$, diverging from previous square based definitions proposed in [Carfagna and Manzini, SPIRE 2023]. We further consider generalizations of macro schemes and straight line programs for the 2D setting and show that, in contrast to what happens on strings, 2D macro schemes and 2D SLPs can be both asymptotically smaller than $δ$ and $γ$. The results of the paper can be easily extended to $d$-dimensional strings with $d > 2$.
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
String attractors of some simple-Parry automatic sequences
Authors:
France Gheeraert,
Giuseppe Romana,
Manon Stipulanti
Abstract:
Firstly studied by Kempa and Prezza in 2018 as the cement of text compression algorithms, string attractors have become a compelling object of theoretical research within the community of combinatorics on words. In this context, they have been studied for several families of finite and infinite words. In this paper, we obtain string attractors of prefixes of particular infinite words generalizing…
▽ More
Firstly studied by Kempa and Prezza in 2018 as the cement of text compression algorithms, string attractors have become a compelling object of theoretical research within the community of combinatorics on words. In this context, they have been studied for several families of finite and infinite words. In this paper, we obtain string attractors of prefixes of particular infinite words generalizing k-bonacci words (including the famous Fibonacci word) and related to simple Parry numbers. In fact, our description involves the numeration systems classically derived from the considered morphisms. This extends our previous work published in the international conference WORDS 2023.
△ Less
Submitted 22 March, 2024; v1 submitted 27 February, 2023;
originally announced February 2023.
-
String Attractors and Infinite Words
Authors:
Antonio Restivo,
Giuseppe Romana,
Marinella Sciortino
Abstract:
The notion of string attractor has been introduced in [Kempa and Prezza, 2018] in the context of Data Compression and it represents a set of positions of a finite word in which all of its factors can be "attracted". The smallest size $γ^*$ of a string attractor for a finite word is a lower bound for several repetitiveness measures associated with the most common compression schemes, including BWT-…
▽ More
The notion of string attractor has been introduced in [Kempa and Prezza, 2018] in the context of Data Compression and it represents a set of positions of a finite word in which all of its factors can be "attracted". The smallest size $γ^*$ of a string attractor for a finite word is a lower bound for several repetitiveness measures associated with the most common compression schemes, including BWT-based and LZ-based compressors. The combinatorial properties of the measure $γ^*$ have been studied in [Mantaci et al., 2021]. Very recently, a complexity measure, called string attractor profile function, has been introduced for infinite words, by evaluating $γ^*$ on each prefix. Such a measure has been studied for automatic sequences and linearly recurrent infinite words [Schaeffer and Shallit, 2021]. In this paper, we study the relationship between such a complexity measure and other well-known combinatorial notions related to repetitiveness in the context of infinite words, such as the factor complexity and the recurrence. Furthermore, we introduce new string attractor-based complexity measures, in which the structure and the distribution of positions in a string attractor of the prefixes of infinite words are considered. We show that such measures provide a finer classification of some infinite families of words.
△ Less
Submitted 1 June, 2022;
originally announced June 2022.
-
Computing Maximal Unique Matches with the r-index
Authors:
Sara Giuliani,
Giuseppe Romana,
Massimiliano Rossi
Abstract:
In recent years, pangenomes received increasing attention from the scientific community for their ability to incorporate population variation information and alleviate reference genome bias. Maximal Exact Matches (MEMs) and Maximal Unique Matches (MUMs) have proven themselves to be useful in multiple bioinformatic contexts, for example short-read alignment and multiple-genome alignment. However, s…
▽ More
In recent years, pangenomes received increasing attention from the scientific community for their ability to incorporate population variation information and alleviate reference genome bias. Maximal Exact Matches (MEMs) and Maximal Unique Matches (MUMs) have proven themselves to be useful in multiple bioinformatic contexts, for example short-read alignment and multiple-genome alignment. However, standard techniques using suffix trees and FM-indexes do not scale to a pangenomic level. Recently, Gagie et al. [JACM 20] introduced the $r$-index that is a Burrows-Wheeler Transform (BWT)-based index able to handle hundreds of human genomes. Later, Rossi et al. [JCB 22] enabled the computation of MEMs using the $r$-index, and Boucher et al. [DCC 21] showed how to compute them in a streaming fashion. In this paper, we show how to augment Boucher et al.'s approach to enable the computation of MUMs on the $r$-index, while preserving the space and time bounds. We add additional $O(r)$ samples of the longest common prefix (LCP) array, where $r$ is the number of equal-letter runs of the BWT, that permits the computation of the second longest match of the pattern suffix with respect to the input text, which in turn allows the computation of candidate MUMs. We implemented a proof-of-concept of our approach, that we call mum-phinder, and tested on real-world datasets. We compared our approach with competing methods that are able to compute MUMs. We observe that our method is up to 8 times smaller, while up to 19 times slower when the dataset is not highly repetitive, while on highly repetitive data, our method is up to 6.5 times slower and uses up to 25 times less memory.
△ Less
Submitted 3 May, 2022;
originally announced May 2022.
-
Logarithmic equal-letter runs for BWT of purely morphic words
Authors:
Andrea Frosini,
Ilaria Mancini,
Simone Rinaldi,
Giuseppe Romana,
Marinella Sciortino
Abstract:
In this paper we study the number $r_{bwt}$ of equal-letter runs produced by the Burrows-Wheeler transform ($BWT$) when it is applied to purely morphic finite words, which are words generated by iterating prolongable morphisms. Such a parameter $r_{bwt}$ is very significant since it provides a measure of the performances of the $BWT$, in terms of both compressibility and indexing. In particular, w…
▽ More
In this paper we study the number $r_{bwt}$ of equal-letter runs produced by the Burrows-Wheeler transform ($BWT$) when it is applied to purely morphic finite words, which are words generated by iterating prolongable morphisms. Such a parameter $r_{bwt}$ is very significant since it provides a measure of the performances of the $BWT$, in terms of both compressibility and indexing. In particular, we prove that, when $BWT$ is applied to any purely morphic finite word on a binary alphabet, $r_{bwt}$ is $\mathcal{O}(\log n)$, where $n$ is the length of the word. Moreover, we prove that $r_{bwt}$ is $Θ(\log n)$ for the binary words generated by a large class of prolongable binary morphisms. These bounds are proved by providing some new structural properties of the \emph{bispecial circular factors} of such words.
△ Less
Submitted 5 February, 2022;
originally announced February 2022.
-
String Attractors and Combinatorics on Words
Authors:
Sabrina Mantaci,
Antonio Restivo,
Giuseppe Romana,
Giovanna Rosone,
Marinella Sciortino
Abstract:
The notion of \emph{string attractor} has recently been introduced in [Prezza, 2017] and studied in [Kempa and Prezza, 2018] to provide a unifying framework for known dictionary-based compressors. A string attractor for a word $w=w[1]w[2]\cdots w[n]$ is a subset $Γ$ of the positions $\{1,\ldots,n\}$, such that all distinct factors of $w$ have an occurrence crossing at least one of the elements of…
▽ More
The notion of \emph{string attractor} has recently been introduced in [Prezza, 2017] and studied in [Kempa and Prezza, 2018] to provide a unifying framework for known dictionary-based compressors. A string attractor for a word $w=w[1]w[2]\cdots w[n]$ is a subset $Γ$ of the positions $\{1,\ldots,n\}$, such that all distinct factors of $w$ have an occurrence crossing at least one of the elements of $Γ$. While finding the smallest string attractor for a word is a NP-complete problem, it has been proved in [Kempa and Prezza, 2018] that dictionary compressors can be interpreted as algorithms approximating the smallest string attractor for a given word.
In this paper we explore the notion of string attractor from a combinatorial point of view, by focusing on several families of finite words. The results presented in the paper suggest that the notion of string attractor can be used to define new tools to investigate combinatorial properties of the words.
△ Less
Submitted 10 July, 2019;
originally announced July 2019.