Search | arXiv e-print repository

Deep Learning for Reference-Free Geolocation for Poplar Trees

Authors: Cai W. John, Owen Queen, Wellington Muchero, Scott J. Emrich

Abstract: A core task in precision agriculture is the identification of climatic and ecological conditions that are advantageous for a given crop. The most succinct approach is geolocation, which is concerned with locating the native region of a given sample based on its genetic makeup. Here, we investigate genomic geolocation of Populus trichocarpa, or poplar, which has been identified by the US Department… ▽ More A core task in precision agriculture is the identification of climatic and ecological conditions that are advantageous for a given crop. The most succinct approach is geolocation, which is concerned with locating the native region of a given sample based on its genetic makeup. Here, we investigate genomic geolocation of Populus trichocarpa, or poplar, which has been identified by the US Department of Energy as a fast-rotation biofuel crop to be harvested nationwide. In particular, we approach geolocation from a reference-free perspective, circumventing the need for compute-intensive processes such as variant calling and alignment. Our model, MashNet, predicts latitude and longitude for poplar trees from randomly-sampled, unaligned sequence fragments. We show that our model performs comparably to Locator, a state-of-the-art method based on aligned whole-genome sequence data. MashNet achieves an error of 34.0 km^2 compared to Locator's 22.1 km^2. MashNet allows growers to quickly and efficiently identify natural varieties that will be most productive in their growth environment based on genotype. This paper explores geolocation for precision agriculture while providing a framework and data source for further development by the machine learning community. △ Less

Submitted 30 January, 2023; originally announced January 2023.

Comments: Accepted at NeurIPS 2022 AI for Science Workshop

arXiv:2105.07079 [pdf, ps, other]

Dynamic network analysis improves protein 3D structural classification

Authors: Khalique Newaz, Jacob Piland, Patricia L. Clark, Scott J. Emrich, Jun Li, Tijana Milenkovic

Abstract: Protein structural classification (PSC) is a supervised problem of assigning proteins into pre-defined structural (e.g., CATH or SCOPe) classes based on the proteins' sequence or 3D structural features. We recently proposed PSC approaches that model protein 3D structures as protein structure networks (PSNs) and analyze PSN-based protein features, which performed better than or comparable to state-… ▽ More Protein structural classification (PSC) is a supervised problem of assigning proteins into pre-defined structural (e.g., CATH or SCOPe) classes based on the proteins' sequence or 3D structural features. We recently proposed PSC approaches that model protein 3D structures as protein structure networks (PSNs) and analyze PSN-based protein features, which performed better than or comparable to state-of-the-art sequence or other 3D structure-based approaches in the task of PSC. However, existing PSN-based PSC approaches model the whole 3D structure of a protein as a static PSN. Because folding of a protein is a dynamic process, where some parts of a protein fold before others, modeling the 3D structure of a protein as a dynamic PSN might further help improve the existing PSC performance. Here, we propose for the first time a way to model 3D structures of proteins as dynamic PSNs, with the hypothesis that this will improve upon the current state-of-the-art PSC approaches that are based on static PSNs (and thus upon the existing state-of-the-art sequence and other 3D structural approaches). Indeed, we confirm this on 71 datasets spanning ~44,000 protein domains from CATH and SCOPe △ Less

Submitted 14 May, 2021; originally announced May 2021.

arXiv:1910.02594 [pdf, ps, other]

Weighted graphlets and deep neural networks for protein structure classification

Authors: Hongyu Guo, Khalique Newaz, Scott Emrich, Tijana Milenkovic, Jun Li

Abstract: As proteins with similar structures often have similar functions, analysis of protein structures can help predict protein functions and is thus important. We consider the problem of protein structure classification, which computationally classifies the structures of proteins into pre-defined groups. We develop a weighted network that depicts the protein structures, and more importantly, we propose… ▽ More As proteins with similar structures often have similar functions, analysis of protein structures can help predict protein functions and is thus important. We consider the problem of protein structure classification, which computationally classifies the structures of proteins into pre-defined groups. We develop a weighted network that depicts the protein structures, and more importantly, we propose the first graphlet-based measure that applies to weighted networks. Further, we develop a deep neural network (DNN) composed of both convolutional and recurrent layers to use this measure for classification. Put together, our approach shows dramatic improvements in performance over existing graphlet-based approaches on 36 real datasets. Even comparing with the state-of-the-art approach, it almost halves the classification error. In addition to protein structure networks, our weighted-graphlet measure and DNN classifier can potentially be applied to classification of other weighted networks in computational biology as well as in other domains. △ Less

Submitted 6 October, 2019; originally announced October 2019.

arXiv:1907.03351 [pdf, ps, other]

Network analysis of synonymous codon usage

Authors: Khalique Newaz, Gabriel Wright, Jacob Piland, Jun Li, Patricia Clark, Scott Emrich, Tijana Milenkovic

Abstract: Most amino acids are encoded by multiple synonymous codons. For an amino acid, some of its synonymous codons are used much more rarely than others. Analyses of positions of such rare codons in protein sequences revealed that rare codons can impact co-translational protein folding and that positions of some rare codons are evolutionary conserved. Analyses of positions of rare codons in proteins' 3-… ▽ More Most amino acids are encoded by multiple synonymous codons. For an amino acid, some of its synonymous codons are used much more rarely than others. Analyses of positions of such rare codons in protein sequences revealed that rare codons can impact co-translational protein folding and that positions of some rare codons are evolutionary conserved. Analyses of positions of rare codons in proteins' 3-dimensional structures, which are richer in biochemical information than sequences alone, might further explain the role of rare codons in protein folding. We analyze a protein set recently annotated with codon usage information, considering non-redundant proteins with sufficient structural information. We model the proteins' structures as networks and study potential differences between network positions of amino acids encoded by evolutionary conserved rare, evolutionary non-conserved rare, and commonly used codons. In 84% of the proteins, at least one of the three codon categories occupies significantly more or less network-central positions than the other codon categories. Different protein groups showing different codon centrality trends (i.e., different types of relationships between network positions of the three codon categories) are enriched in different biological functions, implying the existence of a link between codon usage, protein folding, and protein function. △ Less

Submitted 7 July, 2019; originally announced July 2019.

arXiv:1605.07247 [pdf, ps, other]

Network approach integrates 3D structural and sequence data to improve protein structural comparison

Authors: Fazle E. Faisal, Julie L. Chaney, Khalique Newaz, Jun Li, Scott J. Emrich, Patricia L. Clark, Tijana Milenkovic

Abstract: Initial protein structural comparisons were sequence-based. Since amino acids that are distant in the sequence can be close in the 3-dimensional (3D) structure, 3D contact approaches can complement sequence approaches. Traditional 3D contact approaches study 3D structures directly. Instead, 3D structures can be modeled as protein structure networks (PSNs). Then, network approaches can compare prot… ▽ More Initial protein structural comparisons were sequence-based. Since amino acids that are distant in the sequence can be close in the 3-dimensional (3D) structure, 3D contact approaches can complement sequence approaches. Traditional 3D contact approaches study 3D structures directly. Instead, 3D structures can be modeled as protein structure networks (PSNs). Then, network approaches can compare proteins by comparing their PSNs. Network approaches may improve upon traditional 3D contact approaches. We cannot use existing PSN approaches to test this, because: 1) They rely on naive measures of network topology. 2) They are not robust to PSN size. They cannot integrate 3) multiple PSN measures or 4) PSN data with sequence data, although this could help because the different data types capture complementary biological knowledge. We address these limitations by: 1) exploiting well-established graphlet measures via a new network approach, 2) introducing normalized graphlet measures to remove the bias of PSN size, 3) allowing for integrating multiple PSN measures, and 4) using ordered graphlets to combine the complementary PSN data and sequence data. We compare both synthetic networks and real-world PSNs more accurately and faster than existing network, 3D contact, or sequence approaches. Our approach finds PSN patterns that may be biochemically interesting. △ Less

Submitted 27 February, 2017; v1 submitted 23 May, 2016; originally announced May 2016.

arXiv:1511.06754 [pdf, other]

Hot RAD: A Tool for Analysis of Next-Gen RAD Tag Data

Authors: Lauren A. Assour, Nicholas LaRosa, Scott J. Emrich

Abstract: Restriction site Associated DNA (RAD) tagging (also known as RAD-seq, etc.) is an emerging method for analyzing an organism's genome without completely sequencing it. This can be applied to a non-model organism without a reference genome, though this creates the problem of how to begin data analysis on unmapped and unannotated reads. Our program, Hot RAD, presents a straightforward and easy-to-use… ▽ More Restriction site Associated DNA (RAD) tagging (also known as RAD-seq, etc.) is an emerging method for analyzing an organism's genome without completely sequencing it. This can be applied to a non-model organism without a reference genome, though this creates the problem of how to begin data analysis on unmapped and unannotated reads. Our program, Hot RAD, presents a straightforward and easy-to-use method to take raw Illumina data that has been RAD tagged and produce consensus contigs or sequence stacks using a distributed framework, creating a basis on which to begin analyzing an organism's DNA. The GUI (graphical user interface) element of our tool makes it easy for those not familiar with the command line to take raw sequence files and produce usable data in a timely manner. △ Less

Submitted 20 November, 2015; originally announced November 2015.

arXiv:1301.5406 [pdf]

doi 10.1186/2047-217X-2-10

Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species

Authors: Keith R. Bradnam, Joseph N. Fass, Anton Alexandrov, Paul Baranay, Michael Bechner, İnanç Birol, Sébastien Boisvert, Jarrod A. Chapman, Guillaume Chapuis, Rayan Chikhi, Hamidreza Chitsaz, Wen-Chi Chou, Jacques Corbeil, Cristian Del Fabbro, T. Roderick Docking, Richard Durbin, Dent Earl, Scott Emrich, Pavel Fedotov, Nuno A. Fonseca, Ganeshkumar Ganapathy, Richard A. Gibbs, Sante Gnerre, Élénie Godzaridis, Steve Goldstein , et al. (66 additional authors not shown)

Abstract: Background - The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and… ▽ More Background - The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and in their final output (composition of assembled sequence). More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. The Assemblathon competitions are intended to assess current state-of-the-art methods in genome assembly. Results - In Assemblathon 2, we provided a variety of sequence data to be assembled for three vertebrate species (a bird, a fish, and snake). This resulted in a total of 43 submitted assemblies from 21 participating teams. We evaluated these assemblies using a combination of optical map data, Fosmid sequences, and several statistical methods. From over 100 different metrics, we chose ten key measures by which to assess the overall quality of the assemblies. Conclusions - Many current genome assemblers produced useful assemblies, containing a significant representation of their genes, regulatory sequences, and overall genome structure. However, the high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another. △ Less

Submitted 27 June, 2013; v1 submitted 23 January, 2013; originally announced January 2013.

Comments: Additional files available at http://korflab.ucdavis.edu/Datasets/Assemblathon/Assemblathon2/Additional_files/ Major changes 1. Accessions for the 3 read data sets have now been included 2. New file: spreadsheet containing details of all Study, Sample, Run, & Experiment identifiers 3. Made miscellaneous changes to address reviewers comments. DOIs added to GigaDB datasets

Journal ref: GigaScience 2:10 (2013)

Showing 1–7 of 7 results for author: Emrich, S