Search | arXiv e-print repository

Towards a more inductive world for drug repurposing approaches

Authors: Jesus de la Fuente, Guillermo Serrano, Uxía Veleiro, Mikel Casals, Laura Vera, Marija Pizurica, Antonio Pineda-Lucena, Idoia Ochoa, Silve Vicent, Olivier Gevaert, Mikel Hernaez

Abstract: Drug-target interaction (DTI) prediction is a challenging, albeit essential task in drug repurposing. Learning on graph models have drawn special attention as they can significantly reduce drug repurposing costs and time commitment. However, many current approaches require high-demanding additional information besides DTIs that complicates their evaluation process and usability. Additionally, stru… ▽ More Drug-target interaction (DTI) prediction is a challenging, albeit essential task in drug repurposing. Learning on graph models have drawn special attention as they can significantly reduce drug repurposing costs and time commitment. However, many current approaches require high-demanding additional information besides DTIs that complicates their evaluation process and usability. Additionally, structural differences in the learning architecture of current models hinder their fair benchmarking. In this work, we first perform an in-depth evaluation of current DTI datasets and prediction models through a robust benchmarking process, and show that DTI prediction methods based on transductive models lack generalization and lead to inflated performance when evaluated as previously done in the literature, hence not being suited for drug repurposing approaches. We then propose a novel biologically-driven strategy for negative edge subsampling and show through in vitro validation that newly discovered interactions are indeed true. We envision this work as the underpinning for future fair benchmarking and robust model design. All generated resources and tools are publicly available as a python package. △ Less

Submitted 24 November, 2023; v1 submitted 21 November, 2023; originally announced November 2023.

arXiv:2311.11991 [pdf, other]

Sweetwater: An interpretable and adaptive autoencoder for efficient tissue deconvolution

Authors: Jesus de la Fuente, Naroa Legarra, Guillermo Serrano, Irene Marin-Goni, Aintzane Diaz-Mazkiaran, Markel Benito Sendin, Ana Garcia Osta, Krishna R. Kalari, Carlos Fernandez-Granda, Idoia Ochoa, Mikel Hernaez

Abstract: Single-cell RNA-sequencing (scRNA-seq) stands as a powerful tool for deciphering cellular heterogeneity and exploring gene expression profiles at high resolution. However, its high cost renders it impractical for extensive sample cohorts within routine clinical care, hindering its broader applicability. Hence, many methodologies have recently arised to estimate cell type proportions from bulk RNA-… ▽ More Single-cell RNA-sequencing (scRNA-seq) stands as a powerful tool for deciphering cellular heterogeneity and exploring gene expression profiles at high resolution. However, its high cost renders it impractical for extensive sample cohorts within routine clinical care, hindering its broader applicability. Hence, many methodologies have recently arised to estimate cell type proportions from bulk RNA-seq samples (known as deconvolution methods). However, they have several limitations: Many depend on selecting a robust scRNA-seq reference dataset, which is often challenging. Secondly, building reliable pseudobulk samples requires determining the optimal number of genes or cells involved in the simulated data generation process, which has not been studied in depth. Moreover, pseudobulk and bulk RNA-seq samples often exhibit distribution shifts. Finally, most modern deconvolution approaches behave as a black box, and the underlying mechanisms of the deconvolution task are still unknown, which can compromise the reliability of the results. In this work, we present Sweetwater, an adaptive and interpretable autoencoder able to efficiently deconvolve bulk RNA-seq and microarray samples leveraging multiple classes of reference data, such as scRNA-seq and single-nuclei RNA-seq. Moreover, it can be trained on a mixture of FACS-sorted FASTQ files, which we newly propose to use as this reduces platform-specific biases and may potentially outperform single-cell-based references. Also, we demonstrate that Sweetwater effectively uncovers biologically meaningful patterns during the training process, increasing the reliability of the results. Sweetwater is available at https://github.com/ubioinformat/Sweetwater, and we anticipate will facilitate and expedite the accurate examination of high-throughput clinical data across diverse applications. △ Less

Submitted 17 March, 2024; v1 submitted 20 November, 2023; originally announced November 2023.

arXiv:2310.12804 [pdf, other]

Differentiable Vertex Fitting for Jet Flavour Tagging

Authors: Rachel E. C. Smith, Inês Ochoa, Rúben Inácio, Jonathan Shoemaker, Michael Kagan

Abstract: We propose a differentiable vertex fitting algorithm that can be used for secondary vertex fitting, and that can be seamlessly integrated into neural networks for jet flavour tagging. Vertex fitting is formulated as an optimization problem where gradients of the optimized solution vertex are defined through implicit differentiation and can be passed to upstream or downstream neural network compone… ▽ More We propose a differentiable vertex fitting algorithm that can be used for secondary vertex fitting, and that can be seamlessly integrated into neural networks for jet flavour tagging. Vertex fitting is formulated as an optimization problem where gradients of the optimized solution vertex are defined through implicit differentiation and can be passed to upstream or downstream neural network components for network training. More broadly, this is an application of differentiable programming to integrate physics knowledge into neural network models in high energy physics. We demonstrate how differentiable secondary vertex fitting can be integrated into larger transformer-based models for flavour tagging and improve heavy flavour jet classification. △ Less

Submitted 19 October, 2023; originally announced October 2023.

Comments: 11 pages

arXiv:2211.03233 [pdf, other]

doi 10.3389/frai.2023.1268852

Fitting a Collider in a Quantum Computer: Tackling the Challenges of Quantum Machine Learning for Big Datasets

Authors: Miguel Caçador Peixoto, Nuno Filipe Castro, Miguel Crispim Romão, Maria Gabriela Jordão Oliveira, Inês Ochoa

Abstract: Current quantum systems have significant limitations affecting the processing of large datasets with high dimensionality, typical of high energy physics. In the present paper, feature and data prototype selection techniques were studied to tackle this challenge. A grid search was performed and quantum machine learning models were trained and benchmarked against classical shallow machine learning m… ▽ More Current quantum systems have significant limitations affecting the processing of large datasets with high dimensionality, typical of high energy physics. In the present paper, feature and data prototype selection techniques were studied to tackle this challenge. A grid search was performed and quantum machine learning models were trained and benchmarked against classical shallow machine learning methods, trained both in the reduced and the complete datasets. The performance of the quantum algorithms was found to be comparable to the classical ones, even when using large datasets. Sequential Backward Selection and Principal Component Analysis techniques were used for feature's selection and while the former can produce the better quantum machine learning models in specific cases, it is more unstable. Additionally, we show that such variability in the results is caused by the use of discrete variables, highlighting the suitability of Principal Component analysis transformed data for quantum machine learning applications in the high energy physics context. △ Less

Submitted 6 December, 2023; v1 submitted 6 November, 2022; originally announced November 2022.

Comments: Code available in https://github.com/mcpeixoto/QML-HEP

Journal ref: Front. Artif. Intell. 6 (2023) 1268852

arXiv:2203.07622 [pdf, other]

The International Linear Collider: Report to Snowmass 2021

Authors: Alexander Aryshev, Ties Behnke, Mikael Berggren, James Brau, Nathaniel Craig, Ayres Freitas, Frank Gaede, Spencer Gessner, Stefania Gori, Christophe Grojean, Sven Heinemeyer, Daniel Jeans, Katja Kruger, Benno List, Jenny List, Zhen Liu, Shinichiro Michizono, David W. Miller, Ian Moult, Hitoshi Murayama, Tatsuya Nakada, Emilio Nanni, Mihoko Nojiri, Hasan Padamsee, Maxim Perelstein , et al. (487 additional authors not shown)

Abstract: The International Linear Collider (ILC) is on the table now as a new global energy-frontier accelerator laboratory taking data in the 2030s. The ILC addresses key questions for our current understanding of particle physics. It is based on a proven accelerator technology. Its experiments will challenge the Standard Model of particle physics and will provide a new window to look beyond it. This docu… ▽ More The International Linear Collider (ILC) is on the table now as a new global energy-frontier accelerator laboratory taking data in the 2030s. The ILC addresses key questions for our current understanding of particle physics. It is based on a proven accelerator technology. Its experiments will challenge the Standard Model of particle physics and will provide a new window to look beyond it. This document brings the story of the ILC up to date, emphasizing its strong physics motivation, its readiness for construction, and the opportunity it presents to the US and the global particle physics community. △ Less

Submitted 16 January, 2023; v1 submitted 14 March, 2022; originally announced March 2022.

Comments: 356 pages, Large pdf file (40 MB) submitted to Snowmass 2021; v2 references to Snowmass contributions added, additional authors; v3 references added, some updates, additional authors

Report number: DESY-22-045, IFT--UAM/CSIC--22-028, KEK Preprint 2021-61, PNNL-SA-160884, SLAC-PUB-17662

arXiv:2202.07384 [pdf, other]

doi 10.1088/1748-0221/17/05/P05024

The Phase-I Trigger Readout Electronics Upgrade of the ATLAS Liquid Argon Calorimeters

Authors: G. Aad, A. V. Akimov, K. Al Khoury, M. Aleksa, T. Andeen, C. Anelli, N. Aranzabal, C. Armijo, A. Bagulia, J. Ban, T. Barillari, F. Bellachia, M. Benoit, F. Bernon, A. Berthold, H. Bervas, D. Besin, A. Betti, Y. Bianga, M. Biaut, D. Boline, J. Boudreau, T. Bouedo, N. Braam, M. Cano Bret , et al. (173 additional authors not shown)

Abstract: The Phase-I trigger readout electronics upgrade of the ATLAS Liquid Argon calorimeters enhances the physics reach of the experiment during the upcoming operation at increasing Large Hadron Collider luminosities. The new system, installed during the second Large Hadron Collider Long Shutdown, increases the trigger readout granularity by up to a factor of ten as well as its precision and range. Cons… ▽ More The Phase-I trigger readout electronics upgrade of the ATLAS Liquid Argon calorimeters enhances the physics reach of the experiment during the upcoming operation at increasing Large Hadron Collider luminosities. The new system, installed during the second Large Hadron Collider Long Shutdown, increases the trigger readout granularity by up to a factor of ten as well as its precision and range. Consequently, the background rejection at trigger level is improved through enhanced filtering algorithms utilizing the additional information for topological discrimination of electromagnetic and hadronic shower shapes. This paper presents the final designs of the new electronic elements, their custom electronic devices, the procedures used to validate their proper functioning, and the performance achieved during the commissioning of this system. △ Less

Submitted 16 May, 2022; v1 submitted 15 February, 2022; originally announced February 2022.

Comments: 56 pages, 41 figures, 6 tables

Journal ref: 2022 JINST 17 P05024

arXiv:2108.13451 [pdf, other]

doi 10.1007/JHEP04(2022)156

High-dimensional Anomaly Detection with Radiative Return in $e^{+}e^{-}$ Collisions

Authors: Julia Gonski, Jerry Lai, Benjamin Nachman, Inês Ochoa

Abstract: Experiments at a future $e^{+}e^{-}$ collider will be able to search for new particles with masses below the nominal centre-of-mass energy by analyzing collisions with initial-state radiation (radiative return). We show that machine learning methods that use imperfect or missing training labels can achieve sensitivity to generic new particle production in radiative return events. In addition to pr… ▽ More Experiments at a future $e^{+}e^{-}$ collider will be able to search for new particles with masses below the nominal centre-of-mass energy by analyzing collisions with initial-state radiation (radiative return). We show that machine learning methods that use imperfect or missing training labels can achieve sensitivity to generic new particle production in radiative return events. In addition to presenting an application of the classification without labels (CWoLa) search method in $e^{+}e^{-}$ collisions, our study combines weak supervision with variable-dimensional information by deploying a deep sets neural network architecture. We have also investigated some of the experimental aspects of anomaly detection in radiative return events and discuss these in the context of future detector design. △ Less

Submitted 8 February, 2022; v1 submitted 30 August, 2021; originally announced August 2021.

Comments: 24 pages, 13 figures

arXiv:2105.09274 [pdf, other]

doi 10.1088/1748-0221/16/08/P08012

Anomalous Jet Identification via Sequence Modeling

Authors: Alan Kahn, Julia Gonski, Inês Ochoa, Daniel Williams, Gustaaf Brooijmans

Abstract: This paper presents a novel method of searching for boosted hadronically decaying objects by treating them as anomalous elements of a contaminated dataset. A Variational Recurrent Neural Network (VRNN) is used to model jets as sequences of constituent four-vectors. After applying a pre-processing method which boosts each jet to the same reference mass and energy, the VRNN provides each jet an Anom… ▽ More This paper presents a novel method of searching for boosted hadronically decaying objects by treating them as anomalous elements of a contaminated dataset. A Variational Recurrent Neural Network (VRNN) is used to model jets as sequences of constituent four-vectors. After applying a pre-processing method which boosts each jet to the same reference mass and energy, the VRNN provides each jet an Anomaly Score that distinguishes between the structure of signal and background jets. The model is trained in an entirely unsupervised setting and without high level variables, making the score more robust against mass and $p_{T}$ correlations when compared to methods based primarily on jet substructure. Performance is evaluated on the jet level, as well as in an analysis context by searching for a heavy resonance with a final state of two boosted jets. The Anomaly Score shows consistent performance along a wide range of signal contamination amounts, for both two and three-pronged jet substructure hypotheses. Analysis results demonstrate that the use of Anomaly Score as a classifier enhances signal sensitivity while retaining a smoothly falling background jet mass distribution. The model's discriminatory performance resulting from an unsupervised training scenario opens up the possibility to train directly on data without a pre-defined signal hypothesis. △ Less

Submitted 8 July, 2021; v1 submitted 19 May, 2021; originally announced May 2021.

Comments: 22 pages, 14 figures

arXiv:2101.08320 [pdf, other]

doi 10.1088/1361-6633/ac36b9

The LHC Olympics 2020: A Community Challenge for Anomaly Detection in High Energy Physics

Authors: Gregor Kasieczka, Benjamin Nachman, David Shih, Oz Amram, Anders Andreassen, Kees Benkendorfer, Blaz Bortolato, Gustaaf Brooijmans, Florencia Canelli, Jack H. Collins, Biwei Dai, Felipe F. De Freitas, Barry M. Dillon, Ioan-Mihail Dinu, Zhongtian Dong, Julien Donini, Javier Duarte, D. A. Faroughy, Julia Gonski, Philip Harris, Alan Kahn, Jernej F. Kamenik, Charanjit K. Khosa, Patrick Komiske, Luc Le Pottier , et al. (22 additional authors not shown)

Abstract: A new paradigm for data-driven, model-agnostic new physics searches at colliders is emerging, and aims to leverage recent breakthroughs in anomaly detection and machine learning. In order to develop and benchmark new anomaly detection methods within this framework, it is essential to have standard datasets. To this end, we have created the LHC Olympics 2020, a community challenge accompanied by a… ▽ More A new paradigm for data-driven, model-agnostic new physics searches at colliders is emerging, and aims to leverage recent breakthroughs in anomaly detection and machine learning. In order to develop and benchmark new anomaly detection methods within this framework, it is essential to have standard datasets. To this end, we have created the LHC Olympics 2020, a community challenge accompanied by a set of simulated collider events. Participants in these Olympics have developed their methods using an R&D dataset and then tested them on black boxes: datasets with an unknown anomaly (or not). This paper will review the LHC Olympics 2020 challenge, including an overview of the competition, a description of methods deployed in the competition, lessons learned from the experience, and implications for data analyses with future datasets as well as future colliders. △ Less

Submitted 20 January, 2021; originally announced January 2021.

Comments: 108 pages, 53 figures, 3 tables

arXiv:1912.06093 [pdf, other]

doi 10.1088/1748-0221/15/04/P04012

Performance and Quality Control of a Radiation-Hard 12-bit 40 MSPS ADC for the ATLAS Liquid Argon Calorimeter Trigger Readout Electronics Phase-I Upgrade at the LHC

Authors: T. Andeen, J. Ban, G. Brooijmans, A. Emerman, P. Kinget, J. Kuppambatti, D. Mahon, I. Ochoa, W. Sippach, Q. Wang

Abstract: A radiation-hard quad-channel 12-bit 40 MSPS pipeline analog-to-digital converter (ADC) has been designed for the trigger readout electronics Phase-I upgrade of the ATLAS Liquid Argon calorimeter, at the CERN Large Hadron Collider. The final version of the custom design, fabricated in a commercial 130 nm CMOS process, is presented and found to meet the system requirements for analog performance an… ▽ More A radiation-hard quad-channel 12-bit 40 MSPS pipeline analog-to-digital converter (ADC) has been designed for the trigger readout electronics Phase-I upgrade of the ATLAS Liquid Argon calorimeter, at the CERN Large Hadron Collider. The final version of the custom design, fabricated in a commercial 130 nm CMOS process, is presented and found to meet the system requirements for analog performance and radiation tolerance. The procedure for quality control of 17200 ADCs is described and the results are presented. △ Less

Submitted 6 March, 2020; v1 submitted 12 December, 2019; originally announced December 2019.

Comments: 13 pages, 11 figures

arXiv:1911.03572 [pdf, other]

DZip: improved general-purpose lossless compression based on novel neural network modeling

Authors: Mohit Goyal, Kedar Tatwawadi, Shubham Chandak, Idoia Ochoa

Abstract: We consider lossless compression based on statistical data modeling followed by prediction-based encoding, where an accurate statistical model for the input data leads to substantial improvements in compression. We propose DZip, a general-purpose compressor for sequential data that exploits the well-known modeling capabilities of neural networks (NNs) for prediction, followed by arithmetic coding.… ▽ More We consider lossless compression based on statistical data modeling followed by prediction-based encoding, where an accurate statistical model for the input data leads to substantial improvements in compression. We propose DZip, a general-purpose compressor for sequential data that exploits the well-known modeling capabilities of neural networks (NNs) for prediction, followed by arithmetic coding. Dzip uses a novel hybrid architecture based on adaptive and semi-adaptive training. Unlike most NN based compressors, DZip does not require additional training data and is not restricted to specific data types, only needing the alphabet size of the input data. The proposed compressor outperforms general-purpose compressors such as Gzip (on average 26% reduction) on a variety of real datasets, achieves near-optimal compression on synthetic datasets, and performs close to specialized compressors for large sequence lengths, without any human input. The main limitation of DZip in its current implementation is the encoding/decoding time, which limits its practicality. Nevertheless, the results showcase the potential of develo** improved general-purpose compressors based on neural networks and hybrid modeling. △ Less

Submitted 18 September, 2020; v1 submitted 8 November, 2019; originally announced November 2019.

Comments: Updated manuscript and an efficient implementation added

arXiv:1811.08162 [pdf, other]

DeepZip: Lossless Data Compression using Recurrent Neural Networks

Authors: Mohit Goyal, Kedar Tatwawadi, Shubham Chandak, Idoia Ochoa

Abstract: Sequential data is being generated at an unprecedented pace in various forms, including text and genomic data. This creates the need for efficient compression mechanisms to enable better storage, transmission and processing of such data. To solve this problem, many of the existing compressors attempt to learn models for the data and perform prediction-based compression. Since neural networks are k… ▽ More Sequential data is being generated at an unprecedented pace in various forms, including text and genomic data. This creates the need for efficient compression mechanisms to enable better storage, transmission and processing of such data. To solve this problem, many of the existing compressors attempt to learn models for the data and perform prediction-based compression. Since neural networks are known as universal function approximators with the capability to learn arbitrarily complex map**s, and in practice show excellent performance in prediction tasks, we explore and devise methods to compress sequential data using neural network predictors. We combine recurrent neural network predictors with an arithmetic coder and losslessly compress a variety of synthetic, text and genomic datasets. The proposed compressor outperforms Gzip on the real datasets and achieves near-optimal compression for the synthetic datasets. The results also help understand why and where neural networks are good alternatives for traditional finite context models △ Less

Submitted 20 November, 2018; originally announced November 2018.

arXiv:1506.04973 [pdf, ps, other]

doi 10.1140/epjc/s10052-015-3592-5

Boosted Higgs $\rightarrow b\bar{b}$ in vector-boson associated production at 14 TeV

Authors: Jonathan M. Butterworth, Inês Ochoa, Tim Scanlon

Abstract: The production of the Standard Model Higgs boson in association with a vector boson, followed by the dominant decay to $H \rightarrow b\bar{b}$, is a strong prospect for confirming and measuring the coupling to $b$-quarks in $pp$ collisions at $\sqrt{s}=14$ TeV. We present an updated study of the prospects for this analysis, focussing on the most sensitive highly Lorentz-boosted region. The evolut… ▽ More The production of the Standard Model Higgs boson in association with a vector boson, followed by the dominant decay to $H \rightarrow b\bar{b}$, is a strong prospect for confirming and measuring the coupling to $b$-quarks in $pp$ collisions at $\sqrt{s}=14$ TeV. We present an updated study of the prospects for this analysis, focussing on the most sensitive highly Lorentz-boosted region. The evolution of the efficiency and composition of the signal and main background processes as a function of the transverse momentum of the vector boson are studied covering the region $200-1000$ GeV, comparing both a conventional dijet and jet substructure selection. The lower transverse momentum region ($200-400$ GeV) is identified as the most sensitive region for the Standard Model search, with higher transverse momentum regions not improving the statistical sensitivity. For much of the studied region ($200-600$ GeV), a conventional dijet selection performs as well as the substructure approach, while for the highest transverse momentum regions ($> 600$ GeV), which are particularly interesting for Beyond the Standard Model and high luminosity measurements, the jet substructure techniques are essential. △ Less

Submitted 18 June, 2015; v1 submitted 16 June, 2015; originally announced June 2015.

Comments: 13 pages.(Fixed figure layout error)

arXiv:1303.4558 [pdf, ps, other]

doi 10.1007/s10686-013-9333-6

Design of a 7m Davies-Cotton Cherenkov telescope mount for the high energy section of the Cherenkov Telescope Array

Authors: A. C. Rovero, P. Ringegni, G. Vallejo, A. D. Supanitsky, M. Actis, A. Botani, I. Ochoa, G. Hughes

Abstract: The Cherenkov Telescope Array is the next generation ground-based observatory for the study of very-high-energy gamma-rays. It will provide an order of magnitude more sensitivity and greater angular resolution than present systems as well as an increased energy range (20 GeV to 300 TeV). For the high energy portion of this range, a relatively large area has to be covered by the array. For this, th… ▽ More The Cherenkov Telescope Array is the next generation ground-based observatory for the study of very-high-energy gamma-rays. It will provide an order of magnitude more sensitivity and greater angular resolution than present systems as well as an increased energy range (20 GeV to 300 TeV). For the high energy portion of this range, a relatively large area has to be covered by the array. For this, the construction of ~7 m diameter Cherenkov telescopes is an option under study. We have proposed an innovative design of a Davies-Cotton mount for such a telescope, within Cherenkov Telescope Array specifications, and evaluated its mechanical and optical performance. The mount is a reticulated-type structure with steel tubes and tensioned wires, designed in three main parts to be assembled on site. In this work we show the structural characteristics of the mount and the optical aberrations at the focal plane for three options of mirror facet size caused by mount deformations due to wind and gravity. △ Less

Submitted 19 March, 2013; originally announced March 2013.

Comments: To appear in Experimental Astronomy

arXiv:1207.5184 [pdf, ps, other]

Lossy Compression of Quality Values via Rate Distortion Theory

Authors: Himanshu Asnani, Dinesh Bharadia, Mainak Chowdhury, Idoia Ochoa, Itai Sharon, Tsachy Weissman

Abstract: Motivation: Next Generation Sequencing technologies revolutionized many fields in biology by enabling the fast and cheap sequencing of large amounts of genomic data. The ever increasing sequencing capacities enabled by current sequencing machines hold a lot of promise as for the future applications of these technologies, but also create increasing computational challenges related to the analysis a… ▽ More Motivation: Next Generation Sequencing technologies revolutionized many fields in biology by enabling the fast and cheap sequencing of large amounts of genomic data. The ever increasing sequencing capacities enabled by current sequencing machines hold a lot of promise as for the future applications of these technologies, but also create increasing computational challenges related to the analysis and storage of these data. A typical sequencing data file may occupy tens or even hundreds of gigabytes of disk space, prohibitively large for many users. Raw sequencing data consists of both the DNA sequences (reads) and per-base quality values that indicate the level of confidence in the readout of these sequences. Quality values account for about half of the required disk space in the commonly used FASTQ format and therefore their compression can significantly reduce storage requirements and speed up analysis and transmission of these data. Results: In this paper we present a framework for the lossy compression of the quality value sequences of genomic read files. Numerical experiments with reference based alignment using these quality values suggest that we can achieve significant compression with little compromise in performance for several downstream applications of interest, as is consistent with our theoretical analysis. Our framework also allows compression in a regime - below one bit per quality value - for which there are no existing compressors. △ Less

Submitted 21 July, 2012; originally announced July 2012.

Comments: 7 Pages, 8 Figures, Submitted to Bioinformatics

arXiv:1204.1912 [pdf, ps, other]

doi 10.1109/ITW.2012.6404708

Reference Based Genome Compression

Authors: Bobbie Chern, Idoia Ochoa, Alexandros Manolakos, Albert No, Kartik Venkat, Tsachy Weissman

Abstract: DNA sequencing technology has advanced to a point where storage is becoming the central bottleneck in the acquisition and mining of more data. Large amounts of data are vital for genomics research, and generic compression tools, while viable, cannot offer the same savings as approaches tuned to inherent biological properties. We propose an algorithm to compress a target genome given a known refere… ▽ More DNA sequencing technology has advanced to a point where storage is becoming the central bottleneck in the acquisition and mining of more data. Large amounts of data are vital for genomics research, and generic compression tools, while viable, cannot offer the same savings as approaches tuned to inherent biological properties. We propose an algorithm to compress a target genome given a known reference genome. The proposed algorithm first generates a map** from the reference to the target genome, and then compresses this map** with an entropy coder. As an illustration of the performance: applying our algorithm to James Watson's genome with hg18 as a reference, we are able to reduce the 2991 megabyte (MB) genome down to 6.99 MB, while Gzip compresses it to 834.8 MB. △ Less

Submitted 9 April, 2012; originally announced April 2012.

Comments: 5 pages; Submitted to the IEEE Information Theory Workshop (ITW) 2012

Showing 1–16 of 16 results for author: Ochoa, I