-
Odds for an enlightened rather than barren future
Authors:
David Haussler
Abstract:
We are at a stage in our evolution where we do not yet know if we will ever communicate with intelligent beings that have evolved on other planets, yet we are intelligent and curious enough to wonder about this. We find ourselves wondering about this at the very beginning of a long era in which stellar luminosity warms many planets, and by our best models, continues to provide equally good opportu…
▽ More
We are at a stage in our evolution where we do not yet know if we will ever communicate with intelligent beings that have evolved on other planets, yet we are intelligent and curious enough to wonder about this. We find ourselves wondering about this at the very beginning of a long era in which stellar luminosity warms many planets, and by our best models, continues to provide equally good opportunities for intelligent life to evolve. By simple Bayesian reasoning, if, as we believe, intelligent life forms have the same propensity to evolve later on other planets as we had to evolve on ours, it follows that they will likely not pass through a similar wondering stage in their evolution. This suggests that the future holds some kind of interstellar communication that will serve to inform newly evolved intelligent life forms that they are not alone before they become curious.
△ Less
Submitted 19 August, 2016;
originally announced August 2016.
-
Chromosome-scale shotgun assembly using an in vitro method for long-range linkage
Authors:
Nicholas H. Putnam,
Brendan O'Connell,
Jonathan C. Stites,
Brandon J. Rice,
Andrew Fields,
Paul D. Hartley,
Charles W. Sugnet,
David Haussler,
Daniel S. Rokhsar,
Richard E. Green
Abstract:
Long-range and highly accurate de novo assembly from short-read data is one of the most pressing challenges in genomics. Recently, it has been shown that read pairs generated by proximity ligation of DNA in chromatin of living tissue can address this problem. These data dramatically increase the scaffold contiguity of assemblies and provide haplotype phasing information. Here, we describe a simple…
▽ More
Long-range and highly accurate de novo assembly from short-read data is one of the most pressing challenges in genomics. Recently, it has been shown that read pairs generated by proximity ligation of DNA in chromatin of living tissue can address this problem. These data dramatically increase the scaffold contiguity of assemblies and provide haplotype phasing information. Here, we describe a simpler approach ("Chicago") based on in vitro reconstituted chromatin. We generated two Chicago datasets with human DNA and used a new software pipeline ("HiRise") to construct a highly accurate de novo assembly and scaffolding of a human genome with scaffold N50 of 30 Mb. We also demonstrated the utility of Chicago for improving existing assemblies by re-assembling and scaffolding the genome of the American alligator. With a single library and one lane of Illumina HiSeq sequencing, we increased the scaffold N50 of the American alligator from 508 kb to 10 Mb. Our method uses established molecular biology procedures and can be used to analyze any genome, as it requires only about 5 micrograms of DNA as the starting material.
△ Less
Submitted 18 February, 2015;
originally announced February 2015.
-
Retrotransposon mobilization in cancer genomes
Authors:
Tracy Ballinger,
Adam D. Ewing,
David Haussler
Abstract:
The Cancer Genome Atlas project was initiated by the National Cancer Institute in order to characterize the genomes of hundreds of tumors of various cancer types. While much effort has been put into detecting somatic genomic variation in these data, somatic structural variation induced by the activity of transposable element insertions has not been reported. Transposable elements (TEs) are particu…
▽ More
The Cancer Genome Atlas project was initiated by the National Cancer Institute in order to characterize the genomes of hundreds of tumors of various cancer types. While much effort has been put into detecting somatic genomic variation in these data, somatic structural variation induced by the activity of transposable element insertions has not been reported. Transposable elements (TEs) are particularly relevant in cancer in part because of several known cases in which a TE insertion is directly linked to cancer formation and studies linking the epigenetic status of retrotransposons to carcinogenesis and patient outcome. Additionally, evidence for somatic retrotransposition in eukaryotic genomes suggests that some tissues and therefore some cancer types may be disposed to increased retrotransposition. We built upon previous work to develop a highly efficient computational pipeline for the detection of non-reference mobile ele- ment insertions from high-throughput paired-end whole genome sequencing data that is capable of detecting breakpoints through a local assembly strategy. Using this, we analyzed 33 whole genome tumor datasets with paired normal samples from TCGA across 3 different cancer types: glioblastoma multiforme (GBM), ovarian serous cystoadenocarcinoma (OV) and colorectal ade- nocarcinoma (COAD). We detected 72 insertions in colon samples, almost all of them LINE-1 elements, and none in GBM or OV. The amount of somatic retrotransposition varies widely between samples with 61 insertions present in one case. The lack of somatic retrotransposon insertions in GBM and OV samples suggests that TE activity in cancer is restricted to certain cancer types.
△ Less
Submitted 18 January, 2015;
originally announced January 2015.
-
Canonical, Stable, General Map** using Context Schemes
Authors:
Adam Novak,
Yohei Rosen,
David Haussler,
Benedict Paten
Abstract:
Motivation: Sequence map** is the cornerstone of modern genomics. However, most existing sequence map** algorithms are insufficiently general.
Results: We introduce context schemes: a method that allows the unambiguous recognition of a reference base in a query sequence by testing the query for substrings from an algorithmically defined set. Context schemes only map when there is a unique be…
▽ More
Motivation: Sequence map** is the cornerstone of modern genomics. However, most existing sequence map** algorithms are insufficiently general.
Results: We introduce context schemes: a method that allows the unambiguous recognition of a reference base in a query sequence by testing the query for substrings from an algorithmically defined set. Context schemes only map when there is a unique best map**, and define this criterion uniformly for all reference bases. Map**s under context schemes can also be made stable, so that extension of the query string (e.g. by increasing read length) will not alter the map** of previously mapped positions. Context schemes are general in several senses. They natively support the detection of arbitrary complex, novel rearrangements relative to the reference. They can scale over orders of magnitude in query sequence length. Finally, they are trivially extensible to more complex reference structures, such as graphs, that incorporate additional variation. We demonstrate empirically the existence of high performance context schemes, and present efficient context scheme map** algorithms.
Availability and Implementation: The software test framework created for this work is available from https://registry.hub.docker.com/u/adamnovak/sequence-graphs/.
Contact: [email protected]
Supplementary Information: Six supplementary figures and one supplementary section are available with the online version of this article.
△ Less
Submitted 11 June, 2015; v1 submitted 16 January, 2015;
originally announced January 2015.
-
Map** to a Reference Genome Structure
Authors:
Benedict Paten,
Adam Novak,
David Haussler
Abstract:
To support comparative genomics, population genetics, and medical genetics, we propose that a reference genome should come with a scheme for map** each base in any DNA string to a position in that reference genome. We refer to a collection of one or more reference genomes and a scheme for map** to their positions as a reference structure. Here we describe the desirable properties of reference…
▽ More
To support comparative genomics, population genetics, and medical genetics, we propose that a reference genome should come with a scheme for map** each base in any DNA string to a position in that reference genome. We refer to a collection of one or more reference genomes and a scheme for map** to their positions as a reference structure. Here we describe the desirable properties of reference structures and give examples. To account for natural genetic variation, we consider the more general case in which a reference genome is represented by a graph rather than a set of phased chromosomes; the latter is treated as a special case.
△ Less
Submitted 20 April, 2014;
originally announced April 2014.
-
RADIA: RNA and DNA Integrated Analysis for Somatic Mutation Detection
Authors:
Amie J. Radenbaugh,
Singer Ma,
Adam Ewing,
Joshua Stuart,
Eric Collisson,
**gchun Zhu,
David Haussler
Abstract:
The detection of somatic single nucleotide variants is a crucial component to the characterization of the cancer genome. Mutation calling algorithms thus far have focused on comparing the normal and tumor genomes from the same individual. In recent years, it has become routine for projects like The Cancer Genome Atlas (TCGA) to also sequence the tumor RNA. Here we present RADIA (RNA and DNA Integr…
▽ More
The detection of somatic single nucleotide variants is a crucial component to the characterization of the cancer genome. Mutation calling algorithms thus far have focused on comparing the normal and tumor genomes from the same individual. In recent years, it has become routine for projects like The Cancer Genome Atlas (TCGA) to also sequence the tumor RNA. Here we present RADIA (RNA and DNA Integrated Analysis), a method that combines the patient-matched normal and tumor DNA with the tumor RNA to detect somatic mutations. The inclusion of the RNA increases the power to detect somatic mutations, especially at low DNA allelic frequencies. By integrating the DNA and RNA, we are able to rescue back calls that would be missed by traditional mutation calling algorithms that only examine the DNA.
RADIA was developed for the identification of somatic mutations using both DNA and RNA from the same individual. We demonstrate high sensitivity (84%) and very high specificity (98% and 99%) in real data from endometrial carcinoma and lung adenocarcinoma from TCGA. Mutations with both high DNA and RNA read support have the highest validation rate of over 99%. We also introduce a simulation package that spikes in artificial mutations to real data, rather than simulating sequencing data from a reference genome. We evaluate sensitivity on the simulation data and demonstrate our ability to rescue back calls at low DNA allelic frequencies by including the RNA. Finally, we highlight mutations in important cancer genes that were rescued back due to the incorporation of the RNA.
Software available at https://github.com/aradenbaugh/radia/
△ Less
Submitted 4 February, 2014;
originally announced February 2014.
-
Comparative Assembly Hubs: Web Accessible Browsers for Comparative Genomics
Authors:
Ngan Nguyen,
Glenn Hickey,
Brian J. Raney,
Joel Armstrong,
Hiram Clawson,
Ann Zweig,
Jim Kent,
David Haussler,
Benedict Paten
Abstract:
We introduce a pipeline to easily generate collections of web accessible UCSC genome browsers interrelated by an alignment. Using the alignment, all annotations and the alignment itself can be efficiently viewed with reference to any genome in the collection, symmetrically. A new, intelligently scaled alignment display makes it simple to view all changes between the genomes at all levels of resolu…
▽ More
We introduce a pipeline to easily generate collections of web accessible UCSC genome browsers interrelated by an alignment. Using the alignment, all annotations and the alignment itself can be efficiently viewed with reference to any genome in the collection, symmetrically. A new, intelligently scaled alignment display makes it simple to view all changes between the genomes at all levels of resolution, from substitutions to complex structural rearrangements, including duplications.
△ Less
Submitted 5 November, 2013;
originally announced November 2013.
-
Representing and decomposing genomic structural variants as balanced integer flows on sequence graphs
Authors:
Daniel R. Zerbino,
Tracy Ballinger,
Benedict Paten,
Glenn Hickey,
David Haussler
Abstract:
The study of genomic variation has provided key insights into the functional role of mutations. Predominantly, studies have focused on single nucleotide variants (SNV), which are relatively easy to detect and can be described with rich mathematical models. However, it has been observed that genomes are highly plastic, and that whole regions can be moved, removed or duplicated in bulk. These struct…
▽ More
The study of genomic variation has provided key insights into the functional role of mutations. Predominantly, studies have focused on single nucleotide variants (SNV), which are relatively easy to detect and can be described with rich mathematical models. However, it has been observed that genomes are highly plastic, and that whole regions can be moved, removed or duplicated in bulk. These structural variants (SV) have been shown to have significant impact on the phenotype, but their study has been held back by the combinatorial complexity of the underlying models. We describe here a general model of structural variation that encompasses both balanced rearrangements and arbitrary copy-numbers variants (CNV). In this model, we show that the space of possible evolutionary histories that explain the structural differences between any two genomes can be sampled ergodically.
△ Less
Submitted 3 September, 2015; v1 submitted 22 March, 2013;
originally announced March 2013.
-
A Unifying Model of Genome Evolution Under Parsimony
Authors:
Benedict Paten,
Daniel R. Zerbino,
Glenn Hickey,
David Haussler
Abstract:
We present a data structure called a history graph that offers a practical basis for the analysis of genome evolution. It conceptually simplifies the study of parsimonious evolutionary histories by representing both substitutions and double cut and join (DCJ) rearrangements in the presence of duplications. The problem of constructing parsimonious history graphs thus subsumes related maximum parsim…
▽ More
We present a data structure called a history graph that offers a practical basis for the analysis of genome evolution. It conceptually simplifies the study of parsimonious evolutionary histories by representing both substitutions and double cut and join (DCJ) rearrangements in the presence of duplications. The problem of constructing parsimonious history graphs thus subsumes related maximum parsimony problems in the fields of phylogenetic reconstruction and genome rearrangement. We show that tractable functions can be used to define upper and lower bounds on the minimum number of substitutions and DCJ rearrangements needed to explain any history graph. These bounds become tight for a special type of unambiguous history graph called an ancestral variation graph (AVG), which constrains in its combinatorial structure the number of operations required. We finally demonstrate that for a given history graph $G$, a finite set of AVGs describe all parsimonious interpretations of $G$, and this set can be explored with a few sampling moves.
△ Less
Submitted 12 May, 2014; v1 submitted 9 March, 2013;
originally announced March 2013.
-
Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species
Authors:
Keith R. Bradnam,
Joseph N. Fass,
Anton Alexandrov,
Paul Baranay,
Michael Bechner,
İnanç Birol,
Sébastien Boisvert,
Jarrod A. Chapman,
Guillaume Chapuis,
Rayan Chikhi,
Hamidreza Chitsaz,
Wen-Chi Chou,
Jacques Corbeil,
Cristian Del Fabbro,
T. Roderick Docking,
Richard Durbin,
Dent Earl,
Scott Emrich,
Pavel Fedotov,
Nuno A. Fonseca,
Ganeshkumar Ganapathy,
Richard A. Gibbs,
Sante Gnerre,
Élénie Godzaridis,
Steve Goldstein
, et al. (66 additional authors not shown)
Abstract:
Background - The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and…
▽ More
Background - The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and in their final output (composition of assembled sequence). More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. The Assemblathon competitions are intended to assess current state-of-the-art methods in genome assembly. Results - In Assemblathon 2, we provided a variety of sequence data to be assembled for three vertebrate species (a bird, a fish, and snake). This resulted in a total of 43 submitted assemblies from 21 participating teams. We evaluated these assemblies using a combination of optical map data, Fosmid sequences, and several statistical methods. From over 100 different metrics, we chose ten key measures by which to assess the overall quality of the assemblies. Conclusions - Many current genome assemblers produced useful assemblies, containing a significant representation of their genes, regulatory sequences, and overall genome structure. However, the high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another.
△ Less
Submitted 27 June, 2013; v1 submitted 23 January, 2013;
originally announced January 2013.