-
A Distributed Approach for Persistent Homology Computation on a Large Scale
Authors:
Riccardo Ceccaroni,
Lorenzo Di Rocco,
Umberto Ferraro Petrillo,
Pierpaolo Brutti
Abstract:
Persistent homology (PH) is a powerful mathematical method to automatically extract relevant insights from images, such as those obtained by high-resolution imaging devices like electron microscopes or new-generation telescopes. However, the application of this method comes at a very high computational cost, that is bound to explode more because new imaging devices generate an ever-growing amount…
▽ More
Persistent homology (PH) is a powerful mathematical method to automatically extract relevant insights from images, such as those obtained by high-resolution imaging devices like electron microscopes or new-generation telescopes. However, the application of this method comes at a very high computational cost, that is bound to explode more because new imaging devices generate an ever-growing amount of data. In this paper we present PixHomology, a novel algorithm for efficiently computing $0$-dimensional PH on 2D images, optimizing memory and processing time. By leveraging the Apache Spark framework, we also present a distributed version of our algorithm with several optimized variants, able to concurrently process large batches of astronomical images. Finally, we present the results of an experimental analysis showing that our algorithm and its distributed version are efficient in terms of required memory, execution time, and scalability, consistently outperforming existing state-of-the-art PH computation tools when used to process large datasets.
△ Less
Submitted 12 April, 2024;
originally announced April 2024.
-
The Power of Word-Frequency Based Alignment-Free Functions: a Comprehensive Large-scale Experimental Analysis -- Version 3
Authors:
Giuseppe Cattaneo,
Umberto Ferraro Petrillo,
Raffaele Giancarlo,
Francesco Palini,
Chiara Romualdi
Abstract:
Motivation: Alignment-free (AF) distance/similarity functions are a key tool for sequence analysis. Experimental studies on real datasets abound and, to some extent, there are also studies regarding their control of false positive rate (Type I error). However, assessment of their power, i.e., their ability to identify true similarity, has been limited to some members of the D2 family by experiment…
▽ More
Motivation: Alignment-free (AF) distance/similarity functions are a key tool for sequence analysis. Experimental studies on real datasets abound and, to some extent, there are also studies regarding their control of false positive rate (Type I error). However, assessment of their power, i.e., their ability to identify true similarity, has been limited to some members of the D2 family by experimental studies on short sequences, not adequate for current applications, where sequence lengths may vary considerably. Such a State of the Art is methodologically problematic, since information regarding a key feature such as power is either missing or limited. Results: By concentrating on a representative set of word-frequency based AF functions, we perform the first coherent and uniform evaluation of the power, involving also Type I error for completeness. Two Alternative models of important genomic features (CIS Regulatory Modules and Horizontal Gene Transfer), a wide range of sequence lengths from a few thousand to millions, and different values of k have been used. As a result, we provide a characterization of those AF functions that is novel and informative. Indeed, we identify weak and strong points of each function considered, which may be used as a guide to choose one for analysis tasks. Remarkably, of the fifteen functions that we have considered, only four stand out, with small differences between small and short sequence length scenarios. Finally, in order to encourage the use of our methodology for validation of future AF functions, the Big Data platform supporting it is public.
△ Less
Submitted 19 October, 2021; v1 submitted 27 June, 2021;
originally announced June 2021.
-
FASTA/Q Data Compressors for MapReduce-Hadoop Genomics:Space and Time Savings Made Easy -- Version 1
Authors:
Umberto Ferraro Petrillo,
Francesco Palini,
Giuseppe Cattaneo,
Raffaele Giancarlo
Abstract:
Motivation: Storage of genomic data is a major cost for the Life Sciences, effectively addressed mostly via specialized data compression methods. For the same reasons of abundance in data production, the use of Big Data technologies is seen as the future for genomic data storage and processing, with MapReduce-Hadoop as leaders. Somewhat surprisingly, none of the specialized FASTA/Q compressors is…
▽ More
Motivation: Storage of genomic data is a major cost for the Life Sciences, effectively addressed mostly via specialized data compression methods. For the same reasons of abundance in data production, the use of Big Data technologies is seen as the future for genomic data storage and processing, with MapReduce-Hadoop as leaders. Somewhat surprisingly, none of the specialized FASTA/Q compressors is available within Hadoop. Indeed, their deployment there is not exactly immediate. Such a State of the Art is problematic. Results: We provide major advances in two different directions. Methodologically, we propose two general methods, with the corresponding software, that make very easy to deploy a specialized FASTA/Q compressor within MapReduce-Hadoop for processing files stored on the distributed Hadoop File System, with very little knowledge of Hadoop. Practically, we provide evidence that the deployment of those specialized compressors within Hadoop, not available so far, results in major cost savings, i.e., on large plant genomes, 30% less HDFS data blocks (one block=128MB), speed-up of at least x1.5 in I/O time and comparable or reduced network communication time with respect to the use of generic compressors available in Hadoop. Finally, we observe that these results hold also for the Apache Spark framework, when used to process FASTA/Q files stored on the Hadoop File System.
△ Less
Submitted 27 July, 2020;
originally announced July 2020.
-
Alignment-free Genomic Analysis via a Big Data Spark Platform
Authors:
Umberto Ferraro Petrillo,
Francesco Palini,
Giuseppe Cattaneo,
Raffaele Giancarlo
Abstract:
Motivation: Alignment-free distance and similarity functions (AF functions, for short) are a well established alternative to two and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent Literature indicating that the development of fast and scalable algorithms comp…
▽ More
Motivation: Alignment-free distance and similarity functions (AF functions, for short) are a well established alternative to two and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent Literature indicating that the development of fast and scalable algorithms computing AF functions is a high-priority task. Somewhat surprisingly, despite the increasing popularity of Big Data technologies in Computational Biology, the development of a Big Data platform for those tasks has not been pursued, possibly due to its complexity. Results: We fill this important gap by introducing FADE, the first extensible, efficient and scalable Spark platform for Alignment-free genomic analysis. It supports natively eighteen of the best performing AF functions coming out of a recent hallmark benchmarking study. FADE development and potential impact comprises novel aspects of interest. Namely, (a) a considerable effort of distributed algorithms, the most tangible result being a much faster execution time of reference methods like MASH and FSWM; (b) a software design that makes FADE user-friendly and easily extendable by Spark non-specialists; (c) its ability to support data- and compute-intensive tasks. About this, we provide a novel and much needed analysis of how informative and robust AF functions are, in terms of the statistical significance of their output. Our findings naturally extend the ones of the highly regarded benchmarking study, since the functions that can really be used are reduced to a handful of the eighteen included in FADE.
△ Less
Submitted 23 October, 2021; v1 submitted 2 May, 2020;
originally announced May 2020.
-
Analyzing Big Datasets of Genomic Sequences: Fast and Scalable Collection of k-mer Statistics
Authors:
Umberto Ferraro Petrillo,
Mara Sorella,
Giuseppe Cattaneo,
Raffaele Giancarlo,
Simona Rombo
Abstract:
Distributed approaches based on the map-reduce programming paradigm have started to be proposed in the bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of map-reduce and related Big Data technologies and frameworks (e.g., Apache Hadoop and Spark) does not necessarily produce satisfactory results, in terms of both efficie…
▽ More
Distributed approaches based on the map-reduce programming paradigm have started to be proposed in the bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of map-reduce and related Big Data technologies and frameworks (e.g., Apache Hadoop and Spark) does not necessarily produce satisfactory results, in terms of both efficiency and effectiveness. We discuss how the development of distributed and Big Data management technologies has affected the analysis of large datasets of biological sequences. Moreover, we show how the choice of different parameter configurations and the careful engineering of the software with respect to the specific framework under consideration may be crucial in order to achieve good performance, especially on very large amounts of data. We choose k-mers counting as a case study for our analysis, and Spark as the framework to implement FastKmer, a novel approach for the extraction of k-mer statistics from large collection of biological sequences, with arbitrary values of k. One of the most relevant contributions of FastKmer is the introduction of a module for balancing the statistics aggregation workload over the nodes of a computing cluster, in order to overcome data skew while allowing for a fully exploitation of the underly- ing distributed architecture. We also present the results of a comparative experimental analysis showing that our approach is currently the fastest among the ones based on Big Data technologies, while exhibiting a very good scalability. We provide evidence that the usage of technologies such as Hadoop or Spark for the analysis of big datasets of biological sequences is productive only if the architectural details and the peculiar aspects of the considered framework are carefully taken into account for the algorithm design and implementation.
△ Less
Submitted 4 July, 2018;
originally announced July 2018.
-
Using HTML5 to Prevent Detection of Drive-by-Download Web Malware
Authors:
Alfredo De Santis,
Giancarlo De Maio,
Umberto Ferraro Petrillo
Abstract:
The web is experiencing an explosive growth in the last years. New technologies are introduced at a very fast-pace with the aim of narrowing the gap between web-based applications and traditional desktop applications. The results are web applications that look and feel almost like desktop applications while retaining the advantages of being originated from the web. However, these advancements come…
▽ More
The web is experiencing an explosive growth in the last years. New technologies are introduced at a very fast-pace with the aim of narrowing the gap between web-based applications and traditional desktop applications. The results are web applications that look and feel almost like desktop applications while retaining the advantages of being originated from the web. However, these advancements come at a price. The same technologies used to build responsive, pleasant and fully-featured web applications, can also be used to write web malware able to escape detection systems. In this article we present new obfuscation techniques, based on some of the features of the upcoming HTML5 standard, which can be used to deceive malware detection systems. The proposed techniques have been experimented on a reference set of obfuscated malware. Our results show that the malware rewritten using our obfuscation techniques go undetected while being analyzed by a large number of detection systems. The same detection systems were able to correctly identify the same malware in its original unobfuscated form. We also provide some hints about how the existing malware detection systems can be modified in order to cope with these new techniques.
△ Less
Submitted 13 July, 2015;
originally announced July 2015.