The Dawn of Open Access to Phylogenetic Data
Authors:
Andrew F. Magee,
Michael R. May,
Brian R. Moore
Abstract:
The scientific enterprise depends critically on the preservation of and open access to published data. This basic tenet applies acutely to phylogenies (estimates of evolutionary relationships among species). Increasingly, phylogenies are estimated from increasingly large, genome-scale datasets using increasingly complex statistical methods that require increasing levels of expertise and computatio…
▽ More
The scientific enterprise depends critically on the preservation of and open access to published data. This basic tenet applies acutely to phylogenies (estimates of evolutionary relationships among species). Increasingly, phylogenies are estimated from increasingly large, genome-scale datasets using increasingly complex statistical methods that require increasing levels of expertise and computational investment. Moreover, the resulting phylogenetic data provide an explicit historical perspective that critically informs research in a vast and growing number of scientific disciplines. One such use is the study of changes in rates of lineage diversification (speciation - extinction) through time. As part of a meta-analysis in this area, we sought to collect phylogenetic data (comprising nucleotide sequence alignment and tree files) from 217 studies published in 46 journals over a 13-year period. We document our attempts to procure those data (from online archives and by direct request to corresponding authors), and report results of analyses (using Bayesian logistic regression) to assess the impact of various factors on the success of our efforts. Overall, complete phylogenetic data for ~60% of these studies are effectively lost to science. Our study indicates that phylogenetic data are more likely to be deposited in online archives and/or shared upon request when: (1) the publishing journal has a strong data-sharing policy; (2) the publishing journal has a higher impact factor, and; (3) the data are requested from faculty rather than students. Although the situation appears dire, our analyses suggest that it is far from hopeless: recent initiatives by the scientific community -- including policy changes by journals and funding agencies -- are improving the state of affairs.
△ Less
Submitted 22 May, 2014;
originally announced May 2014.
Comparative Analysis of Tandem Repeats from Hundreds of Species Reveals Unique Insights into Centromere Evolution
Authors:
Daniël P. Melters,
Keith R. Bradnam,
Hugh A. Young,
Natalie Telis,
Michael R. May,
J. Graham Ruby,
Robert Sebra,
Paul Peluso,
John Eid,
David Rank,
José Fernando Garcia,
Joseph L. DeRisi,
Timothy Smith,
Christian Tobias,
Jeffrey Ross-Ibarra,
Ian F. Korf,
Simon W. -L. Chan
Abstract:
Centromeres are essential for chromosome segregation, yet their DNA sequences evolve rapidly. In most animals and plants that have been studied, centromeres contain megabase-scale arrays of tandem repeats. Despite their importance, very little is known about the degree to which centromere tandem repeats share common properties between different species across different phyla. We used bioinformatic…
▽ More
Centromeres are essential for chromosome segregation, yet their DNA sequences evolve rapidly. In most animals and plants that have been studied, centromeres contain megabase-scale arrays of tandem repeats. Despite their importance, very little is known about the degree to which centromere tandem repeats share common properties between different species across different phyla. We used bioinformatic methods to identify high-copy tandem repeats from 282 species using publicly available genomic sequence and our own data. The assumption that the most abundant tandem repeat is the centromere DNA was true for most species whose centromeres have been previously characterized, suggesting this is a general property of genomes. Our methods are compatible with all current sequencing technologies. Long Pacific Biosciences sequence reads allowed us to find tandem repeat monomers up to 1,419 bp. High-copy centromere tandem repeats were found in almost all animal and plant genomes, but repeat monomers were highly variable in sequence composition and in length. Furthermore, phylogenetic analysis of sequence homology showed little evidence of sequence conservation beyond ~50 million years of divergence. We find that despite an overall lack of sequence conservation, centromere tandem repeats from diverse species showed similar modes of evolution, including the appearance of higher order repeat structures in which several polymorphic monomers make up a larger repeating unit. While centromere position in most eukaryotes is epigenetically determined, our results indicate that tandem repeats are highly prevalent at centromeres of both animals and plants. This suggests a functional role for such repeats, perhaps in promoting concerted evolution of centromere DNA across chromosomes.
△ Less
Submitted 22 September, 2012;
originally announced September 2012.