-
The Power of Word-Frequency Based Alignment-Free Functions: a Comprehensive Large-scale Experimental Analysis -- Version 3
Authors:
Giuseppe Cattaneo,
Umberto Ferraro Petrillo,
Raffaele Giancarlo,
Francesco Palini,
Chiara Romualdi
Abstract:
Motivation: Alignment-free (AF) distance/similarity functions are a key tool for sequence analysis. Experimental studies on real datasets abound and, to some extent, there are also studies regarding their control of false positive rate (Type I error). However, assessment of their power, i.e., their ability to identify true similarity, has been limited to some members of the D2 family by experiment…
▽ More
Motivation: Alignment-free (AF) distance/similarity functions are a key tool for sequence analysis. Experimental studies on real datasets abound and, to some extent, there are also studies regarding their control of false positive rate (Type I error). However, assessment of their power, i.e., their ability to identify true similarity, has been limited to some members of the D2 family by experimental studies on short sequences, not adequate for current applications, where sequence lengths may vary considerably. Such a State of the Art is methodologically problematic, since information regarding a key feature such as power is either missing or limited. Results: By concentrating on a representative set of word-frequency based AF functions, we perform the first coherent and uniform evaluation of the power, involving also Type I error for completeness. Two Alternative models of important genomic features (CIS Regulatory Modules and Horizontal Gene Transfer), a wide range of sequence lengths from a few thousand to millions, and different values of k have been used. As a result, we provide a characterization of those AF functions that is novel and informative. Indeed, we identify weak and strong points of each function considered, which may be used as a guide to choose one for analysis tasks. Remarkably, of the fifteen functions that we have considered, only four stand out, with small differences between small and short sequence length scenarios. Finally, in order to encourage the use of our methodology for validation of future AF functions, the Big Data platform supporting it is public.
△ Less
Submitted 19 October, 2021; v1 submitted 27 June, 2021;
originally announced June 2021.
-
Parallel Sandpiles or Spurious Bidirectional Icepiles?
Authors:
Gianpiero Cattaneo,
Luca Manzoni
Abstract:
In a recent paper E. Formenti and K. Perrot (FP) introduce a global rule assumed to describe the discrete time dynamics associated with a sandpile model under the parallel application of a suitable local rule acting on d dimensional lattices of cells equipped with uniform neighborhood. In this paper we submit this approach to a critical analysis, in the simplest elementary particular case of a one…
▽ More
In a recent paper E. Formenti and K. Perrot (FP) introduce a global rule assumed to describe the discrete time dynamics associated with a sandpile model under the parallel application of a suitable local rule acting on d dimensional lattices of cells equipped with uniform neighborhood. In this paper we submit this approach to a critical analysis, in the simplest elementary particular case of a one-dimensional lattice, which can be divided in two parts. In the first part we prove that the FP global rule does not describe the dynamics of standard sandpiles, but rather furnishes a description of the quite different situation of height difference between consecutive piles. This is a semantic uncorrect difference of interpretation. In the second part we investigate the consequences of the uncorrect FP assumption proving that their global rule describes a bidirectional spurious dynamics of icepiles (rather than sandpiles), in the sense that this latter is the consequence of application of three local rules: bidirectional vertical rule, bidirectional horizontal rule (typical of icepiles), and a granule jump from the bottom to the top (spurious rule of the dynamics).
△ Less
Submitted 16 May, 2021; v1 submitted 10 May, 2021;
originally announced May 2021.
-
On the Reliability of the PNU for Source Camera Identification Tasks
Authors:
Andrea Bruno,
Giuseppe Cattaneo,
Paola Capasso
Abstract:
The PNU is an essential and reliable tool to perform SCI and, during the years, became a standard de-facto for this task in the forensic field. In this paper, we show that, although strategies exist that aim to cancel, modify, replace the PNU traces in a digital camera image, it is still possible, through our experimental method, to find residual traces of the noise produced by the sensor used to…
▽ More
The PNU is an essential and reliable tool to perform SCI and, during the years, became a standard de-facto for this task in the forensic field. In this paper, we show that, although strategies exist that aim to cancel, modify, replace the PNU traces in a digital camera image, it is still possible, through our experimental method, to find residual traces of the noise produced by the sensor used to shoot the photo. Furthermore, we show that is possible to inject the PNU of a different camera in a target image and trace it back to the source camera, but only under the condition that the new camera is of the same model of the original one used to take the target image. Both cameras must fall within our availability.
For completeness, we carried out 2 experiments and, rather than using the popular public reference dataset, CASIA TIDE, we preferred to introduce a dataset that does not present any kind of statistical artifacts.
A preliminary experiment on a small dataset of smartphones showed that the injection of PNU from a different device makes it impossible to identify the source camera correctly.
For a second experiment, we built a large dataset of images taken with the same model DSLR. We extracted a denoised version of each image, injected each one with the RN of all the cameras in the dataset and compared all with a RP from each camera. The results of the experiments, clearly, show that either in the denoised images and the injected ones is possible to find residual traces of the original camera PNU.
The combined results of the experiments show that, even in theory is possible to remove or replace the \ac{PNU} from an image, this process can be, easily, detected and is possible, under some hard conditions, confirming the robustness of the \ac{PNU} under this type of attacks.
△ Less
Submitted 28 August, 2020;
originally announced August 2020.
-
FASTA/Q Data Compressors for MapReduce-Hadoop Genomics:Space and Time Savings Made Easy -- Version 1
Authors:
Umberto Ferraro Petrillo,
Francesco Palini,
Giuseppe Cattaneo,
Raffaele Giancarlo
Abstract:
Motivation: Storage of genomic data is a major cost for the Life Sciences, effectively addressed mostly via specialized data compression methods. For the same reasons of abundance in data production, the use of Big Data technologies is seen as the future for genomic data storage and processing, with MapReduce-Hadoop as leaders. Somewhat surprisingly, none of the specialized FASTA/Q compressors is…
▽ More
Motivation: Storage of genomic data is a major cost for the Life Sciences, effectively addressed mostly via specialized data compression methods. For the same reasons of abundance in data production, the use of Big Data technologies is seen as the future for genomic data storage and processing, with MapReduce-Hadoop as leaders. Somewhat surprisingly, none of the specialized FASTA/Q compressors is available within Hadoop. Indeed, their deployment there is not exactly immediate. Such a State of the Art is problematic. Results: We provide major advances in two different directions. Methodologically, we propose two general methods, with the corresponding software, that make very easy to deploy a specialized FASTA/Q compressor within MapReduce-Hadoop for processing files stored on the distributed Hadoop File System, with very little knowledge of Hadoop. Practically, we provide evidence that the deployment of those specialized compressors within Hadoop, not available so far, results in major cost savings, i.e., on large plant genomes, 30% less HDFS data blocks (one block=128MB), speed-up of at least x1.5 in I/O time and comparable or reduced network communication time with respect to the use of generic compressors available in Hadoop. Finally, we observe that these results hold also for the Apache Spark framework, when used to process FASTA/Q files stored on the Hadoop File System.
△ Less
Submitted 27 July, 2020;
originally announced July 2020.
-
Alignment-free Genomic Analysis via a Big Data Spark Platform
Authors:
Umberto Ferraro Petrillo,
Francesco Palini,
Giuseppe Cattaneo,
Raffaele Giancarlo
Abstract:
Motivation: Alignment-free distance and similarity functions (AF functions, for short) are a well established alternative to two and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent Literature indicating that the development of fast and scalable algorithms comp…
▽ More
Motivation: Alignment-free distance and similarity functions (AF functions, for short) are a well established alternative to two and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent Literature indicating that the development of fast and scalable algorithms computing AF functions is a high-priority task. Somewhat surprisingly, despite the increasing popularity of Big Data technologies in Computational Biology, the development of a Big Data platform for those tasks has not been pursued, possibly due to its complexity. Results: We fill this important gap by introducing FADE, the first extensible, efficient and scalable Spark platform for Alignment-free genomic analysis. It supports natively eighteen of the best performing AF functions coming out of a recent hallmark benchmarking study. FADE development and potential impact comprises novel aspects of interest. Namely, (a) a considerable effort of distributed algorithms, the most tangible result being a much faster execution time of reference methods like MASH and FSWM; (b) a software design that makes FADE user-friendly and easily extendable by Spark non-specialists; (c) its ability to support data- and compute-intensive tasks. About this, we provide a novel and much needed analysis of how informative and robust AF functions are, in terms of the statistical significance of their output. Our findings naturally extend the ones of the highly regarded benchmarking study, since the functions that can really be used are reduced to a handful of the eighteen included in FADE.
△ Less
Submitted 23 October, 2021; v1 submitted 2 May, 2020;
originally announced May 2020.
-
Analyzing Big Datasets of Genomic Sequences: Fast and Scalable Collection of k-mer Statistics
Authors:
Umberto Ferraro Petrillo,
Mara Sorella,
Giuseppe Cattaneo,
Raffaele Giancarlo,
Simona Rombo
Abstract:
Distributed approaches based on the map-reduce programming paradigm have started to be proposed in the bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of map-reduce and related Big Data technologies and frameworks (e.g., Apache Hadoop and Spark) does not necessarily produce satisfactory results, in terms of both efficie…
▽ More
Distributed approaches based on the map-reduce programming paradigm have started to be proposed in the bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of map-reduce and related Big Data technologies and frameworks (e.g., Apache Hadoop and Spark) does not necessarily produce satisfactory results, in terms of both efficiency and effectiveness. We discuss how the development of distributed and Big Data management technologies has affected the analysis of large datasets of biological sequences. Moreover, we show how the choice of different parameter configurations and the careful engineering of the software with respect to the specific framework under consideration may be crucial in order to achieve good performance, especially on very large amounts of data. We choose k-mers counting as a case study for our analysis, and Spark as the framework to implement FastKmer, a novel approach for the extraction of k-mer statistics from large collection of biological sequences, with arbitrary values of k. One of the most relevant contributions of FastKmer is the introduction of a module for balancing the statistics aggregation workload over the nodes of a computing cluster, in order to overcome data skew while allowing for a fully exploitation of the underly- ing distributed architecture. We also present the results of a comparative experimental analysis showing that our approach is currently the fastest among the ones based on Big Data technologies, while exhibiting a very good scalability. We provide evidence that the usage of technologies such as Hadoop or Spark for the analysis of big datasets of biological sequences is productive only if the architectural details and the peculiar aspects of the considered framework are carefully taken into account for the algorithm design and implementation.
△ Less
Submitted 4 July, 2018;
originally announced July 2018.
-
A Distance Between Populations for n-Points Crossover in Genetic Algorithms
Authors:
Mauro Castelli,
Gianpiero Cattaneo,
Luca Manzoni,
Leonardo Vanneschi
Abstract:
Genetic algorithms (GAs) are an optimization technique that has been successfully used on many real-world problems. There exist different approaches to their theoretical study. In this paper we complete a recently presented approach to model one-point crossover using pretopologies (or Cech topologies) in two ways. First, we extend it to the case of n-points crossover. Then, we experimentally study…
▽ More
Genetic algorithms (GAs) are an optimization technique that has been successfully used on many real-world problems. There exist different approaches to their theoretical study. In this paper we complete a recently presented approach to model one-point crossover using pretopologies (or Cech topologies) in two ways. First, we extend it to the case of n-points crossover. Then, we experimentally study how the distance distribution changes when the number of crossover points increases.
△ Less
Submitted 3 July, 2017;
originally announced July 2017.
-
A discussion about LNG Experiment: Irreversible or Reversible Generation of the OR Logic Gate?
Authors:
Gianpiero Cattaneo,
Roberto Leporini
Abstract:
In a recent paper M. Lopez-Suarez, I. Neri, and L. Gammaitoni (LNG) present a concrete realization of the Boolean OR irreversible gate, but contrary to the standard Landauer principle, with an arbitrary small dissipation of energy. A Popperian good falsification! In this paper we discuss a theoretical description of the LNG device which is indeed a 3in/3out self--reversible realization of the invo…
▽ More
In a recent paper M. Lopez-Suarez, I. Neri, and L. Gammaitoni (LNG) present a concrete realization of the Boolean OR irreversible gate, but contrary to the standard Landauer principle, with an arbitrary small dissipation of energy. A Popperian good falsification! In this paper we discuss a theoretical description of the LNG device which is indeed a 3in/3out self--reversible realization of the involved OR gate, satisfying in this way the Landauer principle of no dispersion of energy, contrary to the LNG conclusions. The different point of view is due to a different interpretation of the two outputs corresponding to the inputs 10 and 01, which are considered by LNG indistinguishable so producing a non reversible realization of the standard 2in/1out gate. On the contrary, always considering these two outputs indistinguishable, by a suitable normalization function of the cantilever angles, the experimental results obtained by the LNG device coincide with the OR connective obtained from the third output of the self-reversible 3in/3out CL gate by the Inputs-Ancilla->Garbage-Output procedure. Thus, by the self-reversibility this realization is without dissipation of energy according to the Landauer principle. Furthermore, using the self-reversible Toffoli gate it is possible to obtain from the LNG device the realization of the connective AND adopting another normalization function on the cantilever angles. Finally, by other suitable normalization procedures on cantilever angles it is possible to obtain also the other logic NOR and NAND connectives, and in a more sophisticated way the XOR and NXOR connectives in a self-reversible way. All this leads to introduce a universal logic machine consisting of the LNG device plus a memory containing all the necessary angle normalization functions to produce in a self-reversible way, by choosing one of these latter, the logic connectives now listed.
△ Less
Submitted 15 June, 2017; v1 submitted 15 May, 2017;
originally announced May 2017.
-
Unitary and anti-unitary quantum description of the classical not gate
Authors:
G. Cattaneo,
G. Conte,
R. Leporini
Abstract:
We consider the unitary and the anti--unitary operator realizations of two important genuine quantum gates that transform elements of the computational basis of into superpositions: the square root of the identity and the square root of the negation.
We consider the unitary and the anti--unitary operator realizations of two important genuine quantum gates that transform elements of the computational basis of into superpositions: the square root of the identity and the square root of the negation.
△ Less
Submitted 4 March, 2015; v1 submitted 12 September, 2014;
originally announced September 2014.