Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species
Authors:
Keith R. Bradnam,
Joseph N. Fass,
Anton Alexandrov,
Paul Baranay,
Michael Bechner,
İnanç Birol,
Sébastien Boisvert,
Jarrod A. Chapman,
Guillaume Chapuis,
Rayan Chikhi,
Hamidreza Chitsaz,
Wen-Chi Chou,
Jacques Corbeil,
Cristian Del Fabbro,
T. Roderick Docking,
Richard Durbin,
Dent Earl,
Scott Emrich,
Pavel Fedotov,
Nuno A. Fonseca,
Ganeshkumar Ganapathy,
Richard A. Gibbs,
Sante Gnerre,
Élénie Godzaridis,
Steve Goldstein
, et al. (66 additional authors not shown)
Abstract:
Background - The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and…
▽ More
Background - The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and in their final output (composition of assembled sequence). More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. The Assemblathon competitions are intended to assess current state-of-the-art methods in genome assembly. Results - In Assemblathon 2, we provided a variety of sequence data to be assembled for three vertebrate species (a bird, a fish, and snake). This resulted in a total of 43 submitted assemblies from 21 participating teams. We evaluated these assemblies using a combination of optical map data, Fosmid sequences, and several statistical methods. From over 100 different metrics, we chose ten key measures by which to assess the overall quality of the assemblies. Conclusions - Many current genome assemblers produced useful assemblies, containing a significant representation of their genes, regulatory sequences, and overall genome structure. However, the high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another.
△ Less
Submitted 27 June, 2013; v1 submitted 23 January, 2013;
originally announced January 2013.
Comparative Analysis of Tandem Repeats from Hundreds of Species Reveals Unique Insights into Centromere Evolution
Authors:
Daniël P. Melters,
Keith R. Bradnam,
Hugh A. Young,
Natalie Telis,
Michael R. May,
J. Graham Ruby,
Robert Sebra,
Paul Peluso,
John Eid,
David Rank,
José Fernando Garcia,
Joseph L. DeRisi,
Timothy Smith,
Christian Tobias,
Jeffrey Ross-Ibarra,
Ian F. Korf,
Simon W. -L. Chan
Abstract:
Centromeres are essential for chromosome segregation, yet their DNA sequences evolve rapidly. In most animals and plants that have been studied, centromeres contain megabase-scale arrays of tandem repeats. Despite their importance, very little is known about the degree to which centromere tandem repeats share common properties between different species across different phyla. We used bioinformatic…
▽ More
Centromeres are essential for chromosome segregation, yet their DNA sequences evolve rapidly. In most animals and plants that have been studied, centromeres contain megabase-scale arrays of tandem repeats. Despite their importance, very little is known about the degree to which centromere tandem repeats share common properties between different species across different phyla. We used bioinformatic methods to identify high-copy tandem repeats from 282 species using publicly available genomic sequence and our own data. The assumption that the most abundant tandem repeat is the centromere DNA was true for most species whose centromeres have been previously characterized, suggesting this is a general property of genomes. Our methods are compatible with all current sequencing technologies. Long Pacific Biosciences sequence reads allowed us to find tandem repeat monomers up to 1,419 bp. High-copy centromere tandem repeats were found in almost all animal and plant genomes, but repeat monomers were highly variable in sequence composition and in length. Furthermore, phylogenetic analysis of sequence homology showed little evidence of sequence conservation beyond ~50 million years of divergence. We find that despite an overall lack of sequence conservation, centromere tandem repeats from diverse species showed similar modes of evolution, including the appearance of higher order repeat structures in which several polymorphic monomers make up a larger repeating unit. While centromere position in most eukaryotes is epigenetically determined, our results indicate that tandem repeats are highly prevalent at centromeres of both animals and plants. This suggests a functional role for such repeats, perhaps in promoting concerted evolution of centromere DNA across chromosomes.
△ Less
Submitted 22 September, 2012;
originally announced September 2012.