Search | arXiv e-print repository

arXiv:2309.00946 [pdf, other]

From Specific to Generic Learned Sorted Set Dictionaries: A Theoretically Sound Paradigm Yelding Competitive Data Structural Boosters in Practice

Authors: Domenico Amato, Giosué Lo Bosco, Raffaele Giancarlo

Abstract: This research concerns Learned Data Structures, a recent area that has emerged at the crossroad of Machine Learning and Classic Data Structures. It is methodologically important and with a high practical impact. We focus on Learned Indexes, i.e., Learned Sorted Set Dictionaries. The proposals available so far are specific in the sense that they can boost, indeed impressively, the time performance… ▽ More This research concerns Learned Data Structures, a recent area that has emerged at the crossroad of Machine Learning and Classic Data Structures. It is methodologically important and with a high practical impact. We focus on Learned Indexes, i.e., Learned Sorted Set Dictionaries. The proposals available so far are specific in the sense that they can boost, indeed impressively, the time performance of Table Search Procedures with a sorted layout only, e.g., Binary Search. We propose a novel paradigm that, complementing known specialized ones, can produce Learned versions of any Sorted Set Dictionary, for instance, Balanced Binary Search Trees or Binary Search on layouts other that sorted, i.e., Eytzinger. Theoretically, based on it, we obtain several results of interest, such as (a) the first Learned Optimum Binary Search Forest, with mean access time bounded by the Entropy of the probability distribution of the accesses to the Dictionary; (b) the first Learned Sorted Set Dictionary that, in the Dynamic Case and in an amortized analysis setting, matches the same time bounds known for Classic Dictionaries. This latter under widely accepted assumptions regarding the size of the Universe. The experimental part, somewhat complex in terms of software development, clearly indicates the nonobvious finding that the generalization we propose can yield effective and competitive Learned Data Structural Booster, even with respect to specific benchmark models. △ Less

Submitted 2 September, 2023; originally announced September 2023.

ACM Class: E.1; I.2; H.2

arXiv:2305.05551 [pdf, other]

Digital Transformation in the Public Administrations: a Guided Tour For Computer Scientists

Authors: Paolo Ciancarini, Raffaele Giancarlo, Gennaro Grimaudo

Abstract: Digital Transformation (DT) is the process of integrating digital technologies and solutions into the activities of an organization, whether public or private. This paper focuses on the DT of public sector organizations, where the targets of innovative digital solutions are either the citizens or the administrative bodies or both. This paper is a guided tour for Computer Scientists, as the digital… ▽ More Digital Transformation (DT) is the process of integrating digital technologies and solutions into the activities of an organization, whether public or private. This paper focuses on the DT of public sector organizations, where the targets of innovative digital solutions are either the citizens or the administrative bodies or both. This paper is a guided tour for Computer Scientists, as the digital transformation of the public sector involves more than just the use of technology. While technological innovation is a crucial component of any digital transformation, it is not sufficient on its own. Instead, DT requires a cultural, organizational, and technological shift in the way public sector organizations operate and relate to their users, creating the capabilities within the organization to take full advantage of any opportunity in the fastest, best, and most innovative manner in the ways they operate and relate to the citizens. Our tutorial is based on the results of a survey that we performed as an analysis of scientific literature available in some digital libraries well known to Computer Scientists. Such tutorial let us to identify four key pillars that sustain a successful DT: (open) data, ICT technologies, digital skills of citizens and public administrators, and agile processes for develo** new digital services and products. The tutorial discusses the interaction of these pillars and highlights the importance of data as the first and foremost pillar of any DT. We have developed a conceptual map in the form of a graph model to show some basic relationships among these pillars. We discuss the relationships among the four pillars aiming at avoiding the potential negative bias that may arise from a rendering of DT restricted to technology only. We also provide illustrative examples and highlight relevant trends emerging from the current state of the art. △ Less

Submitted 10 May, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

Comments: 30 pages, 3 figures

arXiv:2212.03067 [pdf, other]

Pareto Optimal Compression of Genomic Dictionaries, with or without Random Access in Main Memory

Authors: Raffaele Giancarlo, Gennaro Grimaudo

Abstract: Motivation: A Genomic Dictionary, i.e., the set of the k-mers appearing in a genome, is a fundamental source of genomic information: its collection is the first step in strategic computational methods ranging from assembly to sequence comparison and phylogeny. Unfortunately, it is costly to store. This motivates some recent studies regarding the compression of those k-mer sets. However, such an ar… ▽ More Motivation: A Genomic Dictionary, i.e., the set of the k-mers appearing in a genome, is a fundamental source of genomic information: its collection is the first step in strategic computational methods ranging from assembly to sequence comparison and phylogeny. Unfortunately, it is costly to store. This motivates some recent studies regarding the compression of those k-mer sets. However, such an area does not have the maturity of genomic compression, lacking an homogeneous and methodologically sound experimental foundation that allows to fairly compare the relative merits of the available solutions, and that takes into account also the rich choices of compression methods that can be used. Results: We provide such a foundation here, supporting it with an extensive set of experiments that use reference datasets and a carefully selected set of representative data compressors. Our results highlight the spectrum of compressor choices one has in terms of Pareto Optimality of compression vs. post-processing, this latter being important when the Dictionary needs to be decompressed many times. In addition to the useful indications, not available elsewhere, that this study offers to the researchers interested in storing k-mer dictionaries in compressed form, a software system that can be readily used to explore the Pareto Optimal solutions available r a given Dictionary is also provided. Availability: The software system is available at https://github.com/GenGrim76/Pareto-Optimal-GDC, together with user manuals and installation instructions. Contact: [email protected] Supplementary information: Additional data are available in the Supplementary Material. △ Less

Submitted 6 December, 2022; originally announced December 2022.

Comments: Main: 13 pages, 3 tables, 3 figures; Supplementary Material: 17 pages, 20 tables, 10 figures

arXiv:2211.15565 [pdf, other]

A Critical Analysis of Classifier Selection in Learned Bloom Filters

Authors: Dario Malchiodi, Davide Raimondi, Giacomo Fumagalli, Raffaele Giancarlo, Marco Frasca

Abstract: Learned Bloom Filters, i.e., models induced from data via machine learning techniques and solving the approximate set membership problem, have recently been introduced with the aim of enhancing the performance of standard Bloom Filters, with special focus on space occupancy. Unlike in the classical case, the "complexity" of the data used to build the filter might heavily impact on its performance.… ▽ More Learned Bloom Filters, i.e., models induced from data via machine learning techniques and solving the approximate set membership problem, have recently been introduced with the aim of enhancing the performance of standard Bloom Filters, with special focus on space occupancy. Unlike in the classical case, the "complexity" of the data used to build the filter might heavily impact on its performance. Therefore, here we propose the first in-depth analysis, to the best of our knowledge, for the performance assessment of a given Learned Bloom Filter, in conjunction with a given classifier, on a dataset of a given classification complexity. Indeed, we propose a novel methodology, supported by software, for designing, analyzing and implementing Learned Bloom Filters in function of specific constraints on their multi-criteria nature (that is, constraints involving space efficiency, false positive rate, and reject time). Our experiments show that the proposed methodology and the supporting software are valid and useful: we find out that only two classifiers have desirable properties in relation to problems with different data complexity, and, interestingly, none of them has been considered so far in the literature. We also experimentally show that the Sandwiched variant of Learned Bloom filters is the most robust to data complexity and classifier performance variability, as well as those usually having smaller reject times. The software can be readily used to test new Learned Bloom Filter proposals, which can be compared with the best ones identified here. △ Less

Submitted 28 November, 2022; originally announced November 2022.

arXiv:2205.05643 [pdf, other]

A New Class of String Transformations for Compressed Text Indexing

Authors: Raffaele Giancarlo, Giovanni Manzini, Antonio Restivo, Giovanna Rosone, Marinella Sciortino

Abstract: Introduced about thirty years ago in the field of Data Compression, the Burrows-Wheeler Transform (BWT) is a string transformation that, besides being a booster of the performance of memoryless compressors, plays a fundamental role in the design of efficient self-indexing compressed data structures. Finding other string transformations with the same remarkable properties of BWT has been a challeng… ▽ More Introduced about thirty years ago in the field of Data Compression, the Burrows-Wheeler Transform (BWT) is a string transformation that, besides being a booster of the performance of memoryless compressors, plays a fundamental role in the design of efficient self-indexing compressed data structures. Finding other string transformations with the same remarkable properties of BWT has been a challenge for many researchers for a long time. Among the known BWT variants, the only one that has been recently shown to be a valid alternative to BWT is the Alternating BWT (ABWT), another invertible string transformation introduced about ten years ago in connection with a generalization of Lyndon words. In this paper, we introduce a whole class of new string transformations, called local orderings-based transformations, which have all the myriad virtues of BWT. We show that this new family is a special case of a much larger class of transformations, based on context adaptive alphabet orderings, that includes BWT and ABWT. Although all transformations support pattern search, we show that, in the general case, the transformations within our larger class may take quadratic time for inversion and pattern search. As a further result, we show that the local orderings-based transformations can be used for the construction of the recently introduced r-index, which makes them suitable also for highly repetitive collections. In this context, we consider the problem of finding, for a given string, the BWT variant that minimizes the number of runs in the transformed string, and we provide an algorithm solving this problem in linear time. △ Less

Submitted 8 May, 2023; v1 submitted 11 May, 2022; originally announced May 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:1902.01280

arXiv:2203.14777 [pdf, other]

On the Suitability of Neural Networks as Building Blocks for The Design of Efficient Learned Indexes

Authors: Domenico Amato, Giosue' Lo Bosco, Raffaele Giancarlo

Abstract: With the aim of obtaining time/space improvements in classic Data Structures, an emerging trend is to combine Machine Learning techniques with the ones proper of Data Structures. This new area goes under the name of Learned Data Structures. The motivation for its study is a perceived change of paradigm in Computer Architectures that would favour the use of Graphics Processing Units and Tensor Proc… ▽ More With the aim of obtaining time/space improvements in classic Data Structures, an emerging trend is to combine Machine Learning techniques with the ones proper of Data Structures. This new area goes under the name of Learned Data Structures. The motivation for its study is a perceived change of paradigm in Computer Architectures that would favour the use of Graphics Processing Units and Tensor Processing Units over conventional Central Processing Units. In turn, that would favour the use of Neural Networks as building blocks of Classic Data Structures. Indeed, Learned Bloom Filters, which are one of the main pillars of Learned Data Structures, make extensive use of Neural Networks to improve the performance of classic Filters. However, no use of Neural Networks is reported in the realm of Learned Indexes, which is another main pillar of that new area. In this contribution, we provide the first, and much needed, comparative experimental analysis regarding the use of Neural Networks as building blocks of Learned Indexes. The results reported here highlight the need for the design of very specialized Neural Networks tailored to Learned Indexes and it establishes a solid ground for those developments. Our findings, methodologically important, are of interest to both Scientists and Engineers working in Neural Networks Design and Implementation, in view also of the importance of the application areas involved, e.g., Computer Networks and Data Bases. △ Less

Submitted 21 February, 2022; originally announced March 2022.

ACM Class: E.1; I.2; H.2

arXiv:2201.01554 [pdf, other]

doi 10.1002/spe.3150

Standard Vs Uniform Binary Search and Their Variants in Learned Static Indexing: The Case of the Searching on Sorted Data Benchmarking Software Platform

Authors: Domenico Amato, Giosuè Lo Bosco, Raffaele Giancarlo

Abstract: Learned Indexes are a novel approach to search in a sorted table. A model is used to predict an interval in which to search into and a Binary Search routine is used to finalize the search. They are quite effective. For the final stage, usually, the lower_bound routine of the Standard C++ library is used, although this is more of a natural choice rather than a requirement. However, recent studies,… ▽ More Learned Indexes are a novel approach to search in a sorted table. A model is used to predict an interval in which to search into and a Binary Search routine is used to finalize the search. They are quite effective. For the final stage, usually, the lower_bound routine of the Standard C++ library is used, although this is more of a natural choice rather than a requirement. However, recent studies, that do not use Machine Learning predictions, indicate that other implementations of Binary Search or variants, namely k-ary Search, are better suited to take advantage of the features offered by modern computer architectures. With the use of the Searching on Sorted Sets SOSD Learned Indexing benchmarking software, we investigate how to choose a Search routine for the final stage of searching in a Learned Index. Our results provide indications that better choices than the lower_bound routine can be made. We also highlight how such a choice may be dependent on the computer architecture that is to be used. Overall, our findings provide new and much-needed guidelines for the selection of the Search routine within the Learned Indexing framework. △ Less

Submitted 8 July, 2022; v1 submitted 5 January, 2022; originally announced January 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:2107.09480

ACM Class: E.1; I.2; H.2

arXiv:2112.06563 [pdf, other]

On the Choice of General Purpose Classifiers in Learned Bloom Filters: An Initial Analysis Within Basic Filters

Authors: Giacomo Fumagalli, Davide Raimondi, Raffaele Giancarlo, Dario Malchiodi, Marco Frasca

Abstract: Bloom Filters are a fundamental and pervasive data structure. Within the growing area of Learned Data Structures, several Learned versions of Bloom Filters have been considered, yielding advantages over classic Filters. Each of them uses a classifier, which is the Learned part of the data structure. Although it has a central role in those new filters, and its space footprint as well as classificat… ▽ More Bloom Filters are a fundamental and pervasive data structure. Within the growing area of Learned Data Structures, several Learned versions of Bloom Filters have been considered, yielding advantages over classic Filters. Each of them uses a classifier, which is the Learned part of the data structure. Although it has a central role in those new filters, and its space footprint as well as classification time may affect the performance of the Learned Filter, no systematic study of which specific classifier to use in which circumstances is available. We report progress in this area here, providing also initial guidelines on which classifier to choose among five classic classification paradigms. △ Less

Submitted 13 December, 2021; originally announced December 2021.

Comments: ICPRAM 2022

MSC Class: 68T07 ACM Class: I.2.6

arXiv:2107.09480 [pdf, other]

Learned Sorted Table Search and Static Indexes in Small Model Space

Authors: Domenico Amato, Giosuè Lo Bosco, Raffaele Giancarlo

Abstract: Machine Learning Techniques, properly combined with Data Structures, have resulted in Learned Static Indexes, innovative and powerful tools that speed-up Binary Search, with the use of additional space with respect to the table being searched into. Such space is devoted to the Machine Learning Model. Although in their infancy, they are methodologically and practically important, due to the pervasi… ▽ More Machine Learning Techniques, properly combined with Data Structures, have resulted in Learned Static Indexes, innovative and powerful tools that speed-up Binary Search, with the use of additional space with respect to the table being searched into. Such space is devoted to the Machine Learning Model. Although in their infancy, they are methodologically and practically important, due to the pervasiveness of Sorted Table Search procedures. In modern applications, model space is a key factor and, in fact, a major open question concerning this area is to assess to what extent one can enjoy the speed-up of Binary Search achieved by Learned Indexes while using constant or nearly constant space models. In this paper, we investigate the mentioned question by (a) introducing two new models, i.e., the Learned k-ary Search Model and the Synoptic Recursive Model Index, respectively; (b) systematically exploring the time-space trade-offs of a hierarchy of existing models, i.e., the ones in the reference software platform Searching on Sorted Data, together with the new ones proposed here. By adhering and extending the current benchmarking methodology, we experimentally show that the Learned k-ary Search Model can speed up Binary Search in constant additional space. Our second model, together with the bi-criteria Piece-wise Geometric Model index, can achieve a speed-up of Binary Search with a model space of 0:05% more than the one taken by the table, being competitive in terms of time-space trade-off with existing proposals. The Synoptic Recursive Model Index and the bi-criteria Piece-wise Geometric Model complement each other quite well across the various levels of the internal memory hierarchy. Finally, our findings stimulate research in this area, since they highlight the need for further studies regarding the time-space relation in Learned Indexes. △ Less

Submitted 17 September, 2022; v1 submitted 19 July, 2021; originally announced July 2021.

ACM Class: E.1; I.2; H.2

arXiv:2107.03341 [pdf, ps, other]

Burrows Wheeler Transform on a Large Scale: Algorithms Implemented in Apache Spark

Authors: Ylenia Galluzzo, Raffaele Giancarlo, Mario Randazzo, Simona E. Rombo

Abstract: With the rapid growth of Next Generation Sequencing (NGS) technologies, large amounts of "omics" data are daily collected and need to be processed. Indexing and compressing large sequences datasets are some of the most important tasks in this context. Here we propose algorithms for the computation of Burrows Wheeler transform relying on Big Data technologies, i.e., Apache Spark and Hadoop. Our alg… ▽ More With the rapid growth of Next Generation Sequencing (NGS) technologies, large amounts of "omics" data are daily collected and need to be processed. Indexing and compressing large sequences datasets are some of the most important tasks in this context. Here we propose algorithms for the computation of Burrows Wheeler transform relying on Big Data technologies, i.e., Apache Spark and Hadoop. Our algorithms are the first ones that distribute the index computation and not only the input dataset, allowing to fully benefit of the available cloud resources. △ Less

Submitted 7 July, 2021; originally announced July 2021.

Comments: 11 pages, 2 figures, 2 tables. arXiv admin note: substantial text overlap with arXiv:2007.10095

arXiv:2106.15531 [pdf, other]

The Power of Word-Frequency Based Alignment-Free Functions: a Comprehensive Large-scale Experimental Analysis -- Version 3

Authors: Giuseppe Cattaneo, Umberto Ferraro Petrillo, Raffaele Giancarlo, Francesco Palini, Chiara Romualdi

Abstract: Motivation: Alignment-free (AF) distance/similarity functions are a key tool for sequence analysis. Experimental studies on real datasets abound and, to some extent, there are also studies regarding their control of false positive rate (Type I error). However, assessment of their power, i.e., their ability to identify true similarity, has been limited to some members of the D2 family by experiment… ▽ More Motivation: Alignment-free (AF) distance/similarity functions are a key tool for sequence analysis. Experimental studies on real datasets abound and, to some extent, there are also studies regarding their control of false positive rate (Type I error). However, assessment of their power, i.e., their ability to identify true similarity, has been limited to some members of the D2 family by experimental studies on short sequences, not adequate for current applications, where sequence lengths may vary considerably. Such a State of the Art is methodologically problematic, since information regarding a key feature such as power is either missing or limited. Results: By concentrating on a representative set of word-frequency based AF functions, we perform the first coherent and uniform evaluation of the power, involving also Type I error for completeness. Two Alternative models of important genomic features (CIS Regulatory Modules and Horizontal Gene Transfer), a wide range of sequence lengths from a few thousand to millions, and different values of k have been used. As a result, we provide a characterization of those AF functions that is novel and informative. Indeed, we identify weak and strong points of each function considered, which may be used as a guide to choose one for analysis tasks. Remarkably, of the fifteen functions that we have considered, only four stand out, with small differences between small and short sequence length scenarios. Finally, in order to encourage the use of our methodology for validation of future AF functions, the Big Data platform supporting it is public. △ Less

Submitted 19 October, 2021; v1 submitted 27 June, 2021; originally announced June 2021.

arXiv:2007.13673 [pdf, other]

FASTA/Q Data Compressors for MapReduce-Hadoop Genomics:Space and Time Savings Made Easy -- Version 1

Authors: Umberto Ferraro Petrillo, Francesco Palini, Giuseppe Cattaneo, Raffaele Giancarlo

Abstract: Motivation: Storage of genomic data is a major cost for the Life Sciences, effectively addressed mostly via specialized data compression methods. For the same reasons of abundance in data production, the use of Big Data technologies is seen as the future for genomic data storage and processing, with MapReduce-Hadoop as leaders. Somewhat surprisingly, none of the specialized FASTA/Q compressors is… ▽ More Motivation: Storage of genomic data is a major cost for the Life Sciences, effectively addressed mostly via specialized data compression methods. For the same reasons of abundance in data production, the use of Big Data technologies is seen as the future for genomic data storage and processing, with MapReduce-Hadoop as leaders. Somewhat surprisingly, none of the specialized FASTA/Q compressors is available within Hadoop. Indeed, their deployment there is not exactly immediate. Such a State of the Art is problematic. Results: We provide major advances in two different directions. Methodologically, we propose two general methods, with the corresponding software, that make very easy to deploy a specialized FASTA/Q compressor within MapReduce-Hadoop for processing files stored on the distributed Hadoop File System, with very little knowledge of Hadoop. Practically, we provide evidence that the deployment of those specialized compressors within Hadoop, not available so far, results in major cost savings, i.e., on large plant genomes, 30% less HDFS data blocks (one block=128MB), speed-up of at least x1.5 in I/O time and comparable or reduced network communication time with respect to the use of generic compressors available in Hadoop. Finally, we observe that these results hold also for the Apache Spark framework, when used to process FASTA/Q files stored on the Hadoop File System. △ Less

Submitted 27 July, 2020; originally announced July 2020.

arXiv:2007.10237 [pdf, other]

Learning from Data to Speed-up Sorted Table Search Procedures: Methodology and Practical Guidelines

Authors: Domenico Amato, Giosué Lo Bosco, Raffaele Giancarlo

Abstract: Sorted Table Search Procedures are the quintessential query-answering tool, with widespread usage that now includes also Web Applications, e.g, Search Engines (Google Chrome) and ad Bidding Systems (AppNexus). Speeding them up, at very little cost in space, is still a quite significant achievement. Here we study to what extend Machine Learning Techniques can contribute to obtain such a speed-up vi… ▽ More Sorted Table Search Procedures are the quintessential query-answering tool, with widespread usage that now includes also Web Applications, e.g, Search Engines (Google Chrome) and ad Bidding Systems (AppNexus). Speeding them up, at very little cost in space, is still a quite significant achievement. Here we study to what extend Machine Learning Techniques can contribute to obtain such a speed-up via a systematic experimental comparison of known efficient implementations of Sorted Table Search procedures, with different Data Layouts, and their Learned counterparts developed here. We characterize the scenarios in which those latter can be profitably used with respect to the former, accounting for both CPU and GPU computing. Our approach contributes also to the study of Learned Data Structures, a recent proposal to improve the time/space performance of fundamental Data Structures, e.g., B-trees, Hash Tables, Bloom Filters. Indeed, we also formalize an Algorithmic Paradigm of Learned Dichotomic Sorted Table Search procedures that naturally complements the Learned one proposed here and that characterizes most of the known Sorted Table Search Procedures as having a "learning phase" that approximates Simple Linear Regression. △ Less

Submitted 30 July, 2020; v1 submitted 20 July, 2020; originally announced July 2020.

MSC Class: 68T07; 68P05; 62J05; 68P10 ACM Class: E.1; I.2.0

arXiv:2005.00942 [pdf, other]

doi 10.1093/bioinformatics/btab014

Alignment-free Genomic Analysis via a Big Data Spark Platform

Authors: Umberto Ferraro Petrillo, Francesco Palini, Giuseppe Cattaneo, Raffaele Giancarlo

Abstract: Motivation: Alignment-free distance and similarity functions (AF functions, for short) are a well established alternative to two and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent Literature indicating that the development of fast and scalable algorithms comp… ▽ More Motivation: Alignment-free distance and similarity functions (AF functions, for short) are a well established alternative to two and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent Literature indicating that the development of fast and scalable algorithms computing AF functions is a high-priority task. Somewhat surprisingly, despite the increasing popularity of Big Data technologies in Computational Biology, the development of a Big Data platform for those tasks has not been pursued, possibly due to its complexity. Results: We fill this important gap by introducing FADE, the first extensible, efficient and scalable Spark platform for Alignment-free genomic analysis. It supports natively eighteen of the best performing AF functions coming out of a recent hallmark benchmarking study. FADE development and potential impact comprises novel aspects of interest. Namely, (a) a considerable effort of distributed algorithms, the most tangible result being a much faster execution time of reference methods like MASH and FSWM; (b) a software design that makes FADE user-friendly and easily extendable by Spark non-specialists; (c) its ability to support data- and compute-intensive tasks. About this, we provide a novel and much needed analysis of how informative and robust AF functions are, in terms of the statistical significance of their output. Our findings naturally extend the ones of the highly regarded benchmarking study, since the functions that can really be used are reduced to a handful of the eighteen included in FADE. △ Less

Submitted 23 October, 2021; v1 submitted 2 May, 2020; originally announced May 2020.

Journal ref: Bioinformatics, Volume 37, Issue 12, 15 June 2021, Pages 1658-1665

arXiv:1907.02308 [pdf, ps, other]

The Alternating BWT: an algorithmic perspective

Authors: Raffaele Giancarlo, Giovanni Manzini, Antonio Restivo, Giovanna Rosone, Marinella Sciortino

Abstract: The Burrows-Wheeler Transform (BWT) is a word transformation introduced in 1994 for Data Compression. It has become a fundamental tool for designing self-indexing data structures, with important applications in several area in science and engineering. The Alternating Burrows-Wheeler Transform (ABWT) is another transformation recently introduced in [Gessel et al. 2012] and studied in the field of C… ▽ More The Burrows-Wheeler Transform (BWT) is a word transformation introduced in 1994 for Data Compression. It has become a fundamental tool for designing self-indexing data structures, with important applications in several area in science and engineering. The Alternating Burrows-Wheeler Transform (ABWT) is another transformation recently introduced in [Gessel et al. 2012] and studied in the field of Combinatorics on Words. It is analogous to the BWT, except that it uses an alternating lexicographical order instead of the usual one. Building on results in [Giancarlo et al. 2018], where we have shown that BWT and ABWT are part of a larger class of reversible transformations, here we provide a combinatorial and algorithmic study of the novel transform ABWT. We establish a deep analogy between BWT and ABWT by proving they are the only ones in the above mentioned class to be rank-invertible, a novel notion guaranteeing efficient invertibility. In addition, we show that the backward-search procedure can be efficiently generalized to the ABWT; this result implies that also the ABWT can be used as a basis for efficient compressed full text indices. Finally, we prove that the ABWT can be efficiently computed by using a combination of the Difference Cover suffix sorting algorithm [Kärkkäinen et al., 2006] with a linear time algorithm for finding the minimal cyclic rotation of a word with respect to the alternating lexicographical order. △ Less

Submitted 4 July, 2019; originally announced July 2019.

arXiv:1902.01280 [pdf, other]

A New Class of Searchable and Provably Highly Compressible String Transformations

Authors: Raffaele Giancarlo, Giovanni Manzini, Giovanna Rosone, Marinella Sciortino

Abstract: The Burrows-Wheeler Transform is a string transformation that plays a fundamental role for the design of self-indexing compressed data structures. Over the years, researchers have successfully extended this transformation outside the domains of strings. However, efforts to find non-trivial alternatives of the original, now 25 years old, Burrows-Wheeler string transformation have met limited succes… ▽ More The Burrows-Wheeler Transform is a string transformation that plays a fundamental role for the design of self-indexing compressed data structures. Over the years, researchers have successfully extended this transformation outside the domains of strings. However, efforts to find non-trivial alternatives of the original, now 25 years old, Burrows-Wheeler string transformation have met limited success. In this paper we bring new lymph to this area by introducing a whole new family of transformations that have all the myriad virtues of the BWT: they can be computed and inverted in linear time, they produce provably highly compressible strings, and they support linear time pattern search directly on the transformed string. This new family is a special case of a more general class of transformations based on context adaptive alphabet orderings, a concept introduced here. This more general class includes also the Alternating BWT, another invertible string transforms recently introduced in connection with a generalization of Lyndon words. △ Less

Submitted 4 February, 2019; originally announced February 2019.

arXiv:1807.01566 [pdf, other]

Analyzing Big Datasets of Genomic Sequences: Fast and Scalable Collection of k-mer Statistics

Authors: Umberto Ferraro Petrillo, Mara Sorella, Giuseppe Cattaneo, Raffaele Giancarlo, Simona Rombo

Abstract: Distributed approaches based on the map-reduce programming paradigm have started to be proposed in the bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of map-reduce and related Big Data technologies and frameworks (e.g., Apache Hadoop and Spark) does not necessarily produce satisfactory results, in terms of both efficie… ▽ More Distributed approaches based on the map-reduce programming paradigm have started to be proposed in the bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of map-reduce and related Big Data technologies and frameworks (e.g., Apache Hadoop and Spark) does not necessarily produce satisfactory results, in terms of both efficiency and effectiveness. We discuss how the development of distributed and Big Data management technologies has affected the analysis of large datasets of biological sequences. Moreover, we show how the choice of different parameter configurations and the careful engineering of the software with respect to the specific framework under consideration may be crucial in order to achieve good performance, especially on very large amounts of data. We choose k-mers counting as a case study for our analysis, and Spark as the framework to implement FastKmer, a novel approach for the extraction of k-mer statistics from large collection of biological sequences, with arbitrary values of k. One of the most relevant contributions of FastKmer is the introduction of a module for balancing the statistics aggregation workload over the nodes of a computing cluster, in order to overcome data skew while allowing for a fully exploitation of the underly- ing distributed architecture. We also present the results of a comparative experimental analysis showing that our approach is currently the fastest among the ones based on Big Data technologies, while exhibiting a very good scalability. We provide evidence that the usage of technologies such as Hadoop or Spark for the analysis of big datasets of biological sequences is productive only if the architectural details and the peculiar aspects of the considered framework are carefully taken into account for the algorithm design and implementation. △ Less

Submitted 4 July, 2018; originally announced July 2018.

arXiv:1205.6010 [pdf, ps, other]

The Chromatin Organization of an Eukaryotic Genome : Sequence Specific+ Statistical=Combinatorial (Extended Abstract)

Authors: Davide Corona, Valeria Di Benedetto, Raffaele Giancarlo, Filippo Utro

Abstract: Nucleosome organization in eukaryotic genomes has a deep impact on gene function. Although progress has been recently made in the identification of various concurring factors influencing nucleosome positioning, it is still unclear whether nucleosome positions are sequence dictated or determined by a random process. It has been postulated for a long time that,in the proximity of TSS, a barrier dete… ▽ More Nucleosome organization in eukaryotic genomes has a deep impact on gene function. Although progress has been recently made in the identification of various concurring factors influencing nucleosome positioning, it is still unclear whether nucleosome positions are sequence dictated or determined by a random process. It has been postulated for a long time that,in the proximity of TSS, a barrier determines the position of the +1 nucleosome and then geometric constraints alter the random positioning process determining nucleosomal phasing. Such a pattern fades out as one moves away from the barrier to become again a random positioning process. Although this statistical model is widely accepted,the molecular nature of the barrier is still unknown. Moreover,we are far from the identification of a set of sequence rules able:to account for the genome-wide nucleosome organization;to explain the nature of the barriers on which the statistical mechanism hinges;to allow for a smooth transition from sequence-dictated to statistical positioning and back. We show that sequence complexity,quantified via various methods, can be the rule able to at least partially account for all the above.In particular, we have conducted our analyses on 4 high resolution nucleosomal maps of the model eukaryotes and found that nucleosome depleted regions can be well distinguished from nucleosome enriched regions by sequence complexity measures.In particular, (a) the depleted regions are less complex than the enriched ones, (b) around TSS complexity measures alone are in striking agreement with in vivo nucleosome occupancy,in particular precisely indicating the positions of the +1 and -1 nucleosomes. Those findings indicate that the intrinsic richness of subsequences within sequences plays a role in nucleosomal formation in genomes, and that sequence complexity constitutes the molecular nature of nucleosome barrier. △ Less

Submitted 27 May, 2012; originally announced May 2012.

Comments: Work presented at the 8th SIBBM Seminar (Annual Conference Meeting of the Italian Biophysics and Molecular Biology Society)- May 24-26 2012, Palermo, Italy

arXiv:cs/0203018 [pdf, ps, other]

Improving Table Compression with Combinatorial Optimization

Authors: Adam L. Buchsbaum, Glenn S. Fowler, Raffaele Giancarlo

Abstract: We study the problem of compressing massive tables within the partition-training paradigm introduced by Buchsbaum et al. [SODA'00], in which a table is partitioned by an off-line training procedure into disjoint intervals of columns, each of which is compressed separately by a standard, on-line compressor like gzip. We provide a new theory that unifies previous experimental observations on parti… ▽ More We study the problem of compressing massive tables within the partition-training paradigm introduced by Buchsbaum et al. [SODA'00], in which a table is partitioned by an off-line training procedure into disjoint intervals of columns, each of which is compressed separately by a standard, on-line compressor like gzip. We provide a new theory that unifies previous experimental observations on partitioning and heuristic observations on column permutation, all of which are used to improve compression rates. Based on the theory, we devise the first on-line training algorithms for table compression, which can be applied to individual files, not just continuously operating sources; and also a new, off-line training algorithm, based on a link to the asymmetric traveling salesman problem, which improves on prior work by rearranging columns prior to partitioning. We demonstrate these results experimentally. On various test files, the on-line algorithms provide 35-55% improvement over gzip with negligible slowdown; the off-line reordering provides up to 20% further improvement over partitioning alone. We also show that a variation of the table compression problem is MAX-SNP hard. △ Less

Submitted 13 March, 2002; originally announced March 2002.

Comments: 22 pages, 2 figures, 5 tables, 23 references. Extended abstract appears in Proc. 13th ACM-SIAM SODA, pp. 213-222, 2002

ACM Class: E.4; F.1.3; F.2.2; G.2.1; H.1.1; H.1.8; H.2.7

Journal ref: JACM 50(6):825-851, 2003

Showing 1–19 of 19 results for author: Giancarlo, R