-
skandiver: a divergence-based analysis tool for identifying intercellular mobile genetic elements
Authors:
Xiaolei Brian Zhang,
Grace Oualline,
Jim Shaw,
Yun William Yu
Abstract:
Mobile genetic elements (MGEs) are as ubiquitous in nature as they are varied in type, ranging from viral insertions to transposons to incorporated plasmids. Horizontal transfer of MGEs across bacterial species may also pose a significant threat to global health due to their capability to harbour antibiotic resistance genes. However, despite cheap and rapid whole genome sequencing, the varied natu…
▽ More
Mobile genetic elements (MGEs) are as ubiquitous in nature as they are varied in type, ranging from viral insertions to transposons to incorporated plasmids. Horizontal transfer of MGEs across bacterial species may also pose a significant threat to global health due to their capability to harbour antibiotic resistance genes. However, despite cheap and rapid whole genome sequencing, the varied nature of MGEs makes it difficult to fully characterize them, and existing methods for detecting MGEs often don't agree on what should count. In this manuscript, we first define and argue in favor of a divergence-based characterization of mobile-genetic elements. Using that paradigm, we present skandiver, a tool designed to efficiently detect MGEs from whole genome assemblies without the need for gene annotation or markers. skandiver determines mobile elements via genome fragmentation, average nucleotide identity (ANI), and divergence time. By building on the scalable skani software for ANI computation, skandiver can query hundreds of complete assemblies against $>$65,000 representative genomes in a few minutes and 19 GB memory, providing scalable and efficient method for elucidating mobile element profiles in incomplete, uncharacterized genomic sequences. For isolated and integrated large plasmids (>10kbp), skandiver's recall was 48\% and 47\%, MobileElementFinder was 59\% and 17\%, and geNomad was 86\% and 32\%, respectively. For isolated large plasmids, skandiver's recall (48\%) is lower than state-of-the-art reference-based methods geNomad (86\%) and MobileElementFinder (59\%). However, skandiver achieves higher recall on integrated plasmids and, unlike other methods, without comparing against a curated database, making skandiver suitable for discovery of novel MGEs.
Availability: https://github.com/YoukaiFromAccounting/skandiver
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Novel community data in ecology -- properties and prospects
Authors:
Florian Hartig,
Nerea Abrego,
Alex Bush,
Jonathan M. Chase,
Gurutzeta Guillera-Arroita,
Mathew A. Leibold,
Otso Ovaskainen,
Loïc Pellissier,
Maximilian Pichler,
Giovanni Poggiato,
Laura Pollock,
Sara Si-Moussi,
Wilfried Thuiller,
Duarte S. Viana,
David I. Warton,
Damaris Zurell,
Douglas W. Yu
Abstract:
New technologies for acquiring biological information such as eDNA, acoustic or optical sensors, make it possible to generate spatial community observations at unprecedented scales. The potential of these novel community data to standardize community observations at high spatial, temporal, and taxonomic resolution and at large spatial scale ('many rows and many columns') has been widely discussed,…
▽ More
New technologies for acquiring biological information such as eDNA, acoustic or optical sensors, make it possible to generate spatial community observations at unprecedented scales. The potential of these novel community data to standardize community observations at high spatial, temporal, and taxonomic resolution and at large spatial scale ('many rows and many columns') has been widely discussed, but so far, there has been little integration of these data with ecological models and theory. Here, we review these developments and highlight emerging solutions, focusing on statistical methods for analyzing novel community data, in particular joint species distribution models; the new ecological questions that can be answered with these data; and the potential implications of these developments for policy and conservation.
△ Less
Submitted 19 January, 2024;
originally announced January 2024.
-
Levenshtein Distance Embedding with Poisson Regression for DNA Storage
Authors:
Xiang Wei,
Alan J. X. Guo,
Sihan Sun,
Mengyi Wei,
Wei Yu
Abstract:
Efficient computation or approximation of Levenshtein distance, a widely-used metric for evaluating sequence similarity, has attracted significant attention with the emergence of DNA storage and other biological applications. Sequence embedding, which maps Levenshtein distance to a conventional distance between embedding vectors, has emerged as a promising solution. In this paper, a novel neural n…
▽ More
Efficient computation or approximation of Levenshtein distance, a widely-used metric for evaluating sequence similarity, has attracted significant attention with the emergence of DNA storage and other biological applications. Sequence embedding, which maps Levenshtein distance to a conventional distance between embedding vectors, has emerged as a promising solution. In this paper, a novel neural network-based sequence embedding technique using Poisson regression is proposed. We first provide a theoretical analysis of the impact of embedding dimension on model performance and present a criterion for selecting an appropriate embedding dimension. Under this embedding dimension, the Poisson regression is introduced by assuming the Levenshtein distance between sequences of fixed length following a Poisson distribution, which naturally aligns with the definition of Levenshtein distance. Moreover, from the perspective of the distribution of embedding distances, Poisson regression approximates the negative log likelihood of the chi-squared distribution and offers advancements in removing the skewness. Through comprehensive experiments on real DNA storage data, we demonstrate the superior performance of the proposed method compared to state-of-the-art approaches.
△ Less
Submitted 13 December, 2023;
originally announced December 2023.
-
Does GNN Pretraining Help Molecular Representation?
Authors:
Ruoxi Sun,
Hanjun Dai,
Adams Wei Yu
Abstract:
Extracting informative representations of molecules using Graph neural networks (GNNs) is crucial in AI-driven drug discovery. Recently, the graph research community has been trying to replicate the success of self-supervised pretraining in natural language processing, with several successes claimed. However, we find the benefit brought by self-supervised pretraining on small molecular data can be…
▽ More
Extracting informative representations of molecules using Graph neural networks (GNNs) is crucial in AI-driven drug discovery. Recently, the graph research community has been trying to replicate the success of self-supervised pretraining in natural language processing, with several successes claimed. However, we find the benefit brought by self-supervised pretraining on small molecular data can be negligible in many cases. We conduct thorough ablation studies on the key components of GNN pretraining, including pretraining objectives, data splitting methods, input features, pretraining dataset scales, and GNN architectures, to see how they affect the accuracy of the downstream tasks. Our first important finding is, self-supervised graph pretraining do not always have statistically significant advantages over non-pretraining methods in many settings. Secondly, although noticeable improvement can be observed with additional supervised pretraining, the improvement may diminish with richer features or more balanced data splits. Thirdly, hyper-parameters could have larger impacts on accuracy of downstream tasks than the choice of pretraining tasks, especially when the scales of downstream tasks are small. Finally, we provide our conjectures where the complexity of some pretraining methods on small molecules might be insufficient, followed by empirical evidences on different pretraining datasets.
△ Less
Submitted 2 November, 2022; v1 submitted 13 July, 2022;
originally announced July 2022.
-
On minimizers and convolutional filters: theoretical connections and applications to genome analysis
Authors:
Yun William Yu
Abstract:
Minimizers and convolutional neural networks (CNNs) are two quite distinct popular techniques that have both been employed to analyze categorical biological sequences. At face value, the methods seem entirely dissimilar. Minimizers use min-wise hashing on a rolling window to extract a single important k-mer feature per window. CNNs start with a wide array of randomly initialized convolutional filt…
▽ More
Minimizers and convolutional neural networks (CNNs) are two quite distinct popular techniques that have both been employed to analyze categorical biological sequences. At face value, the methods seem entirely dissimilar. Minimizers use min-wise hashing on a rolling window to extract a single important k-mer feature per window. CNNs start with a wide array of randomly initialized convolutional filters, paired with a pooling operation, and then multiple additional neural layers to learn both the filters themselves and how they can be used to classify the sequence.
Here, our main result is a careful mathematical analysis of hash function properties showing that for sequences over a categorical alphabet, random Gaussian initialization of convolutional filters with max-pooling is equivalent to choosing a minimizer ordering such that selected k-mers are (in Hamming distance) far from the k-mers within the sequence but close to other minimizers. In empirical experiments, we find that this property manifests as decreased density in repetitive regions, both in simulation and on real human telomeres. We additionally train from scratch a CNN embedding of synthetic short-reads from the SARS-CoV-2 genome into 3D Euclidean space that locally recapitulates the linear sequence distance of the read origins, a modest step towards building a deep learning assembler, though it is at present too slow to be practical. In total, this manuscript provides a partial explanation for the effectiveness of CNNs in categorical sequence analysis.
△ Less
Submitted 26 January, 2024; v1 submitted 9 November, 2021;
originally announced November 2021.
-
Estimating Cost Savings from Early Cancer Diagnosis
Authors:
Zura Kakushadze,
Rakesh Raghubanshi,
Willie Yu
Abstract:
We estimate treatment cost-savings from early cancer diagnosis. For breast, lung, prostate and colorectal cancers and melanoma, which account for more than 50% of new incidences projected in 2017, we combine published cancer treatment cost estimates by stage with incidence rates by stage at diagnosis. We extrapolate to other cancer sites by using estimated national expenditures and incidence rates…
▽ More
We estimate treatment cost-savings from early cancer diagnosis. For breast, lung, prostate and colorectal cancers and melanoma, which account for more than 50% of new incidences projected in 2017, we combine published cancer treatment cost estimates by stage with incidence rates by stage at diagnosis. We extrapolate to other cancer sites by using estimated national expenditures and incidence rates. A rough estimate for the U.S. national annual treatment cost-savings from early cancer diagnosis is in 11 digits. Using this estimate and cost-neutrality, we also estimate a rough upper bound on the cost of a routine early cancer screening test.
△ Less
Submitted 2 April, 2019; v1 submitted 30 August, 2017;
originally announced September 2017.
-
Mutation Clusters from Cancer Exome
Authors:
Zura Kakushadze,
Willie Yu
Abstract:
We apply our statistically deterministic machine learning/clustering algorithm *K-means (recently developed in https://ssrn.com/abstract=2908286) to 10,656 published exome samples for 32 cancer types. A majority of cancer types exhibit mutation clustering structure. Our results are in-sample stable. They are also out-of-sample stable when applied to 1,389 published genome samples across 14 cancer…
▽ More
We apply our statistically deterministic machine learning/clustering algorithm *K-means (recently developed in https://ssrn.com/abstract=2908286) to 10,656 published exome samples for 32 cancer types. A majority of cancer types exhibit mutation clustering structure. Our results are in-sample stable. They are also out-of-sample stable when applied to 1,389 published genome samples across 14 cancer types. In contrast, we find in- and out-of-sample instabilities in cancer signatures extracted from exome samples via nonnegative matrix factorization (NMF), a computationally costly and non-deterministic method. Extracting stable mutation structures from exome data could have important implications for speed and cost, which are critical for early-stage cancer diagnostics such as novel blood-test methods currently in development.
△ Less
Submitted 26 July, 2017;
originally announced July 2017.
-
*K-means and Cluster Models for Cancer Signatures
Authors:
Zura Kakushadze,
Willie Yu
Abstract:
We present *K-means clustering algorithm and source code by expanding statistical clustering methods applied in https://ssrn.com/abstract=2802753 to quantitative finance. *K-means is statistically deterministic without specifying initial centers, etc. We apply *K-means to extracting cancer signatures from genome data without using nonnegative matrix factorization (NMF). *K-means' computational cos…
▽ More
We present *K-means clustering algorithm and source code by expanding statistical clustering methods applied in https://ssrn.com/abstract=2802753 to quantitative finance. *K-means is statistically deterministic without specifying initial centers, etc. We apply *K-means to extracting cancer signatures from genome data without using nonnegative matrix factorization (NMF). *K-means' computational cost is a fraction of NMF's. Using 1,389 published samples for 14 cancer types, we find that 3 cancers (liver cancer, lung cancer and renal cell carcinoma) stand out and do not have cluster-like structures. Two clusters have especially high within-cluster correlations with 11 other cancers indicating common underlying structures. Our approach opens a novel avenue for studying such structures. *K-means is universal and can be applied in other fields. We discuss some potential applications in quantitative finance.
△ Less
Submitted 18 July, 2017; v1 submitted 2 March, 2017;
originally announced March 2017.
-
Controlling the joint local false discovery rate is more powerful than meta-analysis methods in joint analysis of summary statistics from multiple genome-wide association studies
Authors:
Wei Jiang,
Weichuan Yu
Abstract:
In genome-wide association studies (GWASs) of common diseases/traits, we often analyze multiple GWASs with the same phenotype together to discover associated genetic variants with higher power. Since it is difficult to access data with detailed individual measurements, summary-statistics-based meta-analysis methods have become popular to jointly analyze data sets from multiple GWASs. In this paper…
▽ More
In genome-wide association studies (GWASs) of common diseases/traits, we often analyze multiple GWASs with the same phenotype together to discover associated genetic variants with higher power. Since it is difficult to access data with detailed individual measurements, summary-statistics-based meta-analysis methods have become popular to jointly analyze data sets from multiple GWASs. In this paper, we propose a novel summary-statistics-based joint analysis method based on controlling the joint local false discovery rate (Jlfdr). We prove that our method is the most powerful summary-statistics-based joint analysis method when controlling the false discovery rate at a certain level. In particular, the Jlfdr-based method achieves higher power than commonly used meta-analysis methods when analyzing heterogeneous data sets from multiple GWASs. Simulation experiments demonstrate the superior power of our method over meta-analysis methods. Also, our method discovers more associations than meta-analysis methods from empirical data sets of four phenotypes. The R-package is available at: http://bioinformatics.ust.hk/Jlfdr.html.
△ Less
Submitted 28 May, 2016;
originally announced May 2016.
-
Factor Models for Cancer Signatures
Authors:
Zura Kakushadze,
Willie Yu
Abstract:
We present a novel method for extracting cancer signatures by applying statistical risk models (http://ssrn.com/abstract=2732453) from quantitative finance to cancer genome data. Using 1389 whole genome sequenced samples from 14 cancers, we identify an "overall" mode of somatic mutational noise. We give a prescription for factoring out this noise and source code for fixing the number of signatures…
▽ More
We present a novel method for extracting cancer signatures by applying statistical risk models (http://ssrn.com/abstract=2732453) from quantitative finance to cancer genome data. Using 1389 whole genome sequenced samples from 14 cancers, we identify an "overall" mode of somatic mutational noise. We give a prescription for factoring out this noise and source code for fixing the number of signatures. We apply nonnegative matrix factorization (NMF) to genome data aggregated by cancer subtype and filtered using our method. The resultant signatures have substantially lower variability than those from unfiltered data. Also, the computational cost of signature extraction is cut by about a factor of 10. We find 3 novel cancer signatures, including a liver cancer dominant signature (96% contribution) and a renal cell carcinoma signature (70% contribution). Our method accelerates finding new cancer signatures and improves their overall stability. Reciprocally, the methods for extracting cancer signatures could have interesting applications in quantitative finance.
△ Less
Submitted 22 January, 2017; v1 submitted 29 April, 2016;
originally announced April 2016.
-
Approximation hardness of Shortest Common Superstring variants
Authors:
Y. William Yu
Abstract:
The shortest common superstring (SCS) problem has been studied at great length because of its connections to the de novo assembly problem in computational genomics. The base problem is APX-complete, but several generalizations of the problem have also been studied. In particular, previous results include that SCS with Negative strings (SCSN) is in Log-APX (though there is no known hardness result)…
▽ More
The shortest common superstring (SCS) problem has been studied at great length because of its connections to the de novo assembly problem in computational genomics. The base problem is APX-complete, but several generalizations of the problem have also been studied. In particular, previous results include that SCS with Negative strings (SCSN) is in Log-APX (though there is no known hardness result) and SCS with Wildcards (SCSW) is Poly-APX-hard. Here, we prove two new hardness results: (1) SCSN is Log-APX-hard (and therefore Log-APX-complete) by a reduction from Minimum Set Cover and (2) SCS with Negative strings and Wildcards (SCSNW) is NPOPB-hard by a reduction from Minimum Ones 3SAT.
△ Less
Submitted 27 February, 2016;
originally announced February 2016.
-
Estimating Reproducibility in Genome-Wide Association Studies
Authors:
Wei Jiang,
**g-Hao Xue,
Weichuan Yu
Abstract:
Genome-wide association studies (GWAS) are widely used to discover genetic variants associated with diseases. To control false positives, all findings from GWAS need to be verified with additional evidences, even for associations discovered from a high power study. Replication study is a common verification method by using independent samples. An association is regarded as true positive with a hig…
▽ More
Genome-wide association studies (GWAS) are widely used to discover genetic variants associated with diseases. To control false positives, all findings from GWAS need to be verified with additional evidences, even for associations discovered from a high power study. Replication study is a common verification method by using independent samples. An association is regarded as true positive with a high confidence when it can be identified in both primary study and replication study. Currently, there is no systematic study on the behavior of positives in the replication study when the positive results of primary study are considered as the prior information.
In this paper, two probabilistic measures named Reproducibility Rate (RR) and False Irreproducibility Rate (FIR) are proposed to quantitatively describe the behavior of primary positive associations (i.e. positive associations identified in the primary study) in the replication study. RR is a conditional probability measuring how likely a primary positive association will also be positive in the replication study. This can be used to guide the design of replication study, and to check the consistency between the results of primary study and those of replication study. FIR, on the contrary, measures how likely a primary positive association may still be a true positive even when it is negative in the replication study. This can be used to generate a list of potentially true associations in the irreproducible findings for further scrutiny. The estimation methods of these two measures are given. Simulation results and real experiments show that our estimation methods have high accuracy and good prediction performance.
△ Less
Submitted 26 August, 2015;
originally announced August 2015.
-
Entropy-scaling search of massive biological data
Authors:
Y. William Yu,
Noah M. Daniels,
David Christian Danko,
Bonnie Berger
Abstract:
Many datasets exhibit a well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here, we introduce a framework for similarity search based on characterizing a dataset's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimensio…
▽ More
Many datasets exhibit a well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here, we introduce a framework for similarity search based on characterizing a dataset's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimension of the dataset is low, and scales in space with the sum of metric entropy and information-theoretic entropy (randomness of the data). Using these ideas, we present accelerated versions of standard tools, with no loss in specificity and little loss in sensitivity, for use in three domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics (MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search (esFragBag, 10x speedup of FragBag). Our framework can be used to achieve "compressive omics," and the general theory can be readily applied to data science problems outside of biology.
△ Less
Submitted 21 September, 2015; v1 submitted 18 March, 2015;
originally announced March 2015.
-
Translocation of stiff polymers through a nanopore driven by binding particles
Authors:
Wancheng Yu,
Yiding Ma,
Kaifu Luo
Abstract:
We investigate the translocation of stiff polymers in the presence of binding particles through a nanopore by two-dimensional Langevin dynamics simulations. We find that the mean translocation time shows a minimum as a function of the binding energy $ε$ and the particle concentration $φ$, due to the interplay of the force from binding and the frictional force. Particularly, for the strong binding…
▽ More
We investigate the translocation of stiff polymers in the presence of binding particles through a nanopore by two-dimensional Langevin dynamics simulations. We find that the mean translocation time shows a minimum as a function of the binding energy $ε$ and the particle concentration $φ$, due to the interplay of the force from binding and the frictional force. Particularly, for the strong binding the translocation proceeds with a decreasing translocation velocity induced by a significant increase of the frictional force. In addition, both $ε$ and $φ$ have an notable impact on the distribution of the translocation time. With increasing $ε$ and $φ$, it undergoes a transition from an asymmetric and broad distribution under the weak binding to a nearly Gaussian one under the strong binding, and its width becomes gradually narrower.
△ Less
Submitted 5 December, 2012;
originally announced December 2012.
-
Running PeptideProphet Separately on Replicates Improves Peptide Identification Results
Authors:
Chao Yang,
Zengyou He,
Weichuan Yu
Abstract:
Limited spectrum coverage is a problem in shotgun proteomics. Replicates are generated to improve the spectrum coverage. When integrating peptide identification results obtained from replicates, the state-of-the-art algorithm PeptideProphet combines Peptide-Spectrum Matches (PSMs) before building the statistical model to calculate peptide probabilities.
In this paper, we find the connection betw…
▽ More
Limited spectrum coverage is a problem in shotgun proteomics. Replicates are generated to improve the spectrum coverage. When integrating peptide identification results obtained from replicates, the state-of-the-art algorithm PeptideProphet combines Peptide-Spectrum Matches (PSMs) before building the statistical model to calculate peptide probabilities.
In this paper, we find the connection between merging results of replicates and Bagging, which is a standard routine to improve the power of statistical methods. Following Bagging's philosophy, we propose to run PeptideProphet separately on each replicate and combine the outputs to obtain the final peptide probabilities. In our experiments, we show that the proposed routine can improve PeptideProphet consistently on a standard protein dataset, a Human dataset and a Yeast dataset.
△ Less
Submitted 2 December, 2012; v1 submitted 26 November, 2012;
originally announced November 2012.
-
A Combinatorial Perspective of the Protein Inference Problem
Authors:
Chao Yang,
Zengyou He,
Weichuan Yu
Abstract:
In a shotgun proteomics experiment, proteins are the most biologically meaningful output. The success of proteomics studies depends on the ability to accurately and efficiently identify proteins. Many methods have been proposed to facilitate the identification of proteins from the results of peptide identification. However, the relationship between protein identification and peptide identification…
▽ More
In a shotgun proteomics experiment, proteins are the most biologically meaningful output. The success of proteomics studies depends on the ability to accurately and efficiently identify proteins. Many methods have been proposed to facilitate the identification of proteins from the results of peptide identification. However, the relationship between protein identification and peptide identification has not been thoroughly explained before.
In this paper, we are devoted to a combinatorial perspective of the protein inference problem. We employ combinatorial mathematics to calculate the conditional protein probabilities (Protein probability means the probability that a protein is correctly identified) under three assumptions, which lead to a lower bound, an upper bound and an empirical estimation of protein probabilities, respectively. The combinatorial perspective enables us to obtain a closed-form formulation for protein inference.
Based on our model, we study the impact of unique peptides and degenerate peptides on protein probabilities. Here, degenerate peptides are peptides shared by at least two proteins. Meanwhile, we also study the relationship of our model with other methods such as ProteinProphet. A probability confidence interval can be calculated and used together with probability to filter the protein identification result. Our method achieves competitive results with ProteinProphet in a more efficient manner in the experiment based on two datasets of standard protein mixtures and two datasets of real samples.
We name our program ProteinInfer. Its Java source code is available at http://bioinformatics.ust.hk/proteininfer
△ Less
Submitted 28 November, 2012; v1 submitted 26 November, 2012;
originally announced November 2012.
-
SparseAssembler2: Sparse k-mer Graph for Memory Efficient Genome Assembly
Authors:
Chengxi Ye,
Charles H. Cannon,
Zhanshan Sam Ma,
Douglas W. Yu,
Mihai Pop
Abstract:
The formal version of our work has been published in BMC Bioinformatics and can be found here: http://www.biomedcentral.com/1471-2105/13/S6/S1 Motivation: To tackle the problem of huge memory usage associated with de Bruijn graph-based algorithms, upon which some of the most widely used de novo genome assemblers have been built, we released SparseAssembler1. SparseAssembler1 can save as much as 90…
▽ More
The formal version of our work has been published in BMC Bioinformatics and can be found here: http://www.biomedcentral.com/1471-2105/13/S6/S1 Motivation: To tackle the problem of huge memory usage associated with de Bruijn graph-based algorithms, upon which some of the most widely used de novo genome assemblers have been built, we released SparseAssembler1. SparseAssembler1 can save as much as 90% memory consumption in comparison with the state-of-art assemblers, but it requires rounds of denoising to accurately assemble genomes. In this paper, we introduce a new general model for genome assembly that uses only sparse k-mers. The new model replaces the idea of the de Bruijn graph from the beginning, and achieves similar memory efficiency and much better robustness compared with our previous SparseAssembler1. Results: We demonstrate that the decomposition of reads of all overlap** k-mers, which is used in existing de Bruijn graph genome assemblers, is overly cautious. We introduce a sparse k-mer graph structure for saving sparse k-mers, which greatly reduces memory space requirements necessary for de novo genome assembly. In contrast with the de Bruijn graph approach, we devise a simple but powerful strategy, i.e., finding links between the k-mers in the genome and traversing following the links, which can be done by saving only a few k-mers. To implement the strategy, we need to only select some k-mers that may not even be overlap** ones, and build the links between these k-mers indicated by the reads. We can traverse through this sparse k-mer graph to build the contigs, and ultimately complete the genome assembly. Since the new sparse k-mers graph shares almost all advantages of de Bruijn graph, we are able to adapt a Dijkstra-like breadth-first search algorithm to circumvent sequencing errors and resolve polymorphisms.
△ Less
Submitted 9 January, 2013; v1 submitted 17 August, 2011;
originally announced August 2011.
-
Chaperone-assisted translocation of a polymer through a nanopore
Authors:
Wancheng Yu,
Kaifu Luo
Abstract:
Using Langevin dynamics simulations, we investigate the dynamics of chaperone-assisted translocation of a flexible polymer through a nanopore. We find that increasing the binding energy $ε$ between the chaperone and the chain and the chaperone concentration $N_c$ can greatly improve the translocation probability. Particularly, with increasing the chaperone concentration a maximum translocation pro…
▽ More
Using Langevin dynamics simulations, we investigate the dynamics of chaperone-assisted translocation of a flexible polymer through a nanopore. We find that increasing the binding energy $ε$ between the chaperone and the chain and the chaperone concentration $N_c$ can greatly improve the translocation probability. Particularly, with increasing the chaperone concentration a maximum translocation probability is observed for weak binding. For a fixed chaperone concentration, the histogram of translocation time $τ$ has a transition from long-tailed distribution to Gaussian distribution with increasing $ε$. $τ$ rapidly decreases and then almost saturates with increasing binding energy for short chain, however, it has a minimum for longer chains at lower chaperone concentration. We also show that $τ$ has a minimum as a function of the chaperone concentration. For different $ε$, a nonuniversal dependence of $τ$ on the chain length $N$ is also observed. These results can be interpreted by characteristic entropic effects for flexible polymers induced by either crowding effect from high chaperone concentration or the intersegmental binding for the high binding energy.
△ Less
Submitted 2 August, 2011;
originally announced August 2011.
-
SparseAssembler: de novo Assembly with the Sparse de Bruijn Graph
Authors:
Chengxi Ye,
Zhanshan Sam Ma,
Charles H. Cannon,
Mihai Pop,
Douglas W. Yu
Abstract:
de Bruijn graph-based algorithms are one of the two most widely used approaches for de novo genome assembly. A major limitation of this approach is the large computational memory space requirement to construct the de Bruijn graph, which scales with k-mer length and total diversity (N) of unique k-mers in the genome expressed in base pairs or roughly (2k+8)N bits. This limitation is particularly im…
▽ More
de Bruijn graph-based algorithms are one of the two most widely used approaches for de novo genome assembly. A major limitation of this approach is the large computational memory space requirement to construct the de Bruijn graph, which scales with k-mer length and total diversity (N) of unique k-mers in the genome expressed in base pairs or roughly (2k+8)N bits. This limitation is particularly important with large-scale genome analysis and for sequencing centers that simultaneously process multiple genomes. We present a sparse de Bruijn graph structure, based on which we developed SparseAssembler that greatly reduces memory space requirements. The structure also allows us to introduce a novel method for the removal of substitution errors introduced during sequencing. The sparse de Bruijn graph structure skips g intermediate k-mers, therefore reducing the theoretical memory space requirement to ~(2k/g+8)N. We have found that a practical value of g=16 consumes approximately 10% of the memory required by standard de Bruijn graph-based algorithms but yields comparable results. A high error rate could potentially derail the SparseAssembler. Therefore, we developed a sparse de Bruijn graph-based denoising algorithm that can remove more than 99% of substitution errors from datasets with a \leq 2% error rate. Given that substitution error rates for the current generation of sequencers is lower than 1%, our denoising procedure is sufficiently effective to safeguard the performance of our algorithm. Finally, we also introduce a novel Dijkstra-like breadth-first search algorithm for the sparse de Bruijn graph structure to circumvent residual errors and resolve polymorphisms.
△ Less
Submitted 14 June, 2011;
originally announced June 2011.
-
BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies
Authors:
Xiang Wan,
Can Yang,
Qiang Yang,
Hong Xue,
Xiaodan Fan,
Nelson L. S. Tang,
Weichuan Yu
Abstract:
Gene-gene interactions have long been recognized to be fundamentally important to understand genetic causes of complex disease traits. At present, identifying gene-gene interactions from genome-wide case-control studies is computationally and methodologically challenging. In this paper, we introduce a simple but powerful method, named `BOolean Operation based Screening and Testing'(BOOST). To di…
▽ More
Gene-gene interactions have long been recognized to be fundamentally important to understand genetic causes of complex disease traits. At present, identifying gene-gene interactions from genome-wide case-control studies is computationally and methodologically challenging. In this paper, we introduce a simple but powerful method, named `BOolean Operation based Screening and Testing'(BOOST). To discover unknown gene-gene interactions that underlie complex diseases, BOOST allows examining all pairwise interactions in genome-wide case-control studies in a remarkably fast manner. We have carried out interaction analyses on seven data sets from the Wellcome Trust Case Control Consortium (WTCCC). Each analysis took less than 60 hours on a standard 3.0 GHz desktop with 4G memory running Windows XP system. The interaction patterns identified from the type 1 diabetes data set display significant difference from those identified from the rheumatoid arthritis data set, while both data sets share a very similar hit region in the WTCCC report. BOOST has also identified many undiscovered interactions between genes in the major histocompatibility complex (MHC) region in the type 1 diabetes data set. In the coming era of large-scale interaction map** in genome-wide case-control studies, our method can serve as a computationally and statistically useful tool.
△ Less
Submitted 28 January, 2010;
originally announced January 2010.
-
Stable Feature Selection for Biomarker Discovery
Authors:
Zengyou He,
Weichuan Yu
Abstract:
Feature selection techniques have been used as the workhorse in biomarker discovery applications for a long time. Surprisingly, the stability of feature selection with respect to sampling variations has long been under-considered. It is only until recently that this issue has received more and more attention. In this article, we review existing stable feature selection methods for biomarker disc…
▽ More
Feature selection techniques have been used as the workhorse in biomarker discovery applications for a long time. Surprisingly, the stability of feature selection with respect to sampling variations has long been under-considered. It is only until recently that this issue has received more and more attention. In this article, we review existing stable feature selection methods for biomarker discovery using a generic hierarchal framework. We have two objectives: (1) providing an overview on this new yet fast growing topic for a convenient reference; (2) categorizing existing methods under an expandable framework for future research and development.
△ Less
Submitted 6 January, 2010;
originally announced January 2010.
-
Electrokinetic behavior of two touching inhomogeneous biological cells and colloidal particles: Effects of multipolar interactions
Authors:
J. P. Huang,
Mikko Karttunen,
K. W. Yu,
L. Dong,
G. Q. Gu
Abstract:
We present a theory to investigate electro-kinetic behavior, namely, electrorotation and dielectrophoresis under alternating current (AC) applied fields for a pair of touching inhomogeneous colloidal particles and biological cells. These inhomogeneous particles are treated as graded ones with physically motivated model dielectric and conductivity profiles. The mutual polarization interaction bet…
▽ More
We present a theory to investigate electro-kinetic behavior, namely, electrorotation and dielectrophoresis under alternating current (AC) applied fields for a pair of touching inhomogeneous colloidal particles and biological cells. These inhomogeneous particles are treated as graded ones with physically motivated model dielectric and conductivity profiles. The mutual polarization interaction between the particles yields a change in their respective dipole moments, and hence in the AC electrokinetic spectra. The multipolar interactions between polarized particles are accurately captured by the multiple images method. In the point-dipole limit, our theory reproduces the known results. We find that the multipolar interactions as well as the spatial fluctuations inside the particles can affect the AC electrokinetic spectra significantly.
△ Less
Submitted 7 November, 2003; v1 submitted 11 June, 2003;
originally announced June 2003.
-
Dielectric behavior of oblate spheroidal particles: Application to erythrocytes suspensions
Authors:
J. P. Huang,
K. W. Yu
Abstract:
We have investigated the effect of particle shape on the eletrorotation (ER) spectrum of living cells suspensions. In particular, we consider coated oblate spheroidal particles and present a theoretical study of ER based on the spectral representation theory. Analytic expressions for the characteristic frequency as well as the dispersion strength can be obtained, thus simplifying the fitting of…
▽ More
We have investigated the effect of particle shape on the eletrorotation (ER) spectrum of living cells suspensions. In particular, we consider coated oblate spheroidal particles and present a theoretical study of ER based on the spectral representation theory. Analytic expressions for the characteristic frequency as well as the dispersion strength can be obtained, thus simplifying the fitting of experimental data on oblate spheroidal cells that abound in the literature. From the theoretical analysis, we find that the cell shape, coating as well as material parameters can change the ER spectrum. We demonstrate good agreement between our theoretical predictions and experimental data on human erthrocytes suspensions.
△ Less
Submitted 26 February, 2002;
originally announced February 2002.
-
Spectral Representation Theory for Dielectric Behavior of Nonspherical Cell Suspensions
Authors:
J. P. Huang,
K. W. Yu,
Jun Lei,
Hong Sun
Abstract:
Recent experiments revealed that the dielectric dispersion spectrum of fission yeast cells in a suspension was mainly composed of two sub-dispersions. The low-frequency sub-dispersion depended on the cell length, while the high-frequency one was independent of it. The cell shape effect was simulated by an ellipsoidal cell model but the comparison between theory and experiment was far from being…
▽ More
Recent experiments revealed that the dielectric dispersion spectrum of fission yeast cells in a suspension was mainly composed of two sub-dispersions. The low-frequency sub-dispersion depended on the cell length, while the high-frequency one was independent of it. The cell shape effect was simulated by an ellipsoidal cell model but the comparison between theory and experiment was far from being satisfactory. Prompted by the discrepancy, we proposed the use of spectral representation to analyze more realistic cell models. We adopted a shell-spheroidal model to analyze the effects of the cell membrane. It is found that the dielectric property of the cell membrane has only a minor effect on the dispersion magnitude ratio and the characteristic frequency ratio. We further included the effect of rotation of dipole induced by an external electric field, and solved the dipole-rotation spheroidal model in the spectral representation. Good agreement between theory and experiment has been obtained.
△ Less
Submitted 23 April, 2001;
originally announced April 2001.
-
Dielectric Behavior of Nonspherical Cell Suspensions
Authors:
Jun Lei,
Jones T. K. Wan,
K. W. Yu,
Hong Sun
Abstract:
Recent experiments revealed that the dielectric dispersion spectrum of fission yeast cells in a suspension was mainly composed of two sub-dispersions. The low-frequency sub-dispersion depended on the cell length, whereas the high-frequency one was independent of it. The cell shape effect was qualitatively simulated by an ellipsoidal cell model. However, the comparison between theory and experime…
▽ More
Recent experiments revealed that the dielectric dispersion spectrum of fission yeast cells in a suspension was mainly composed of two sub-dispersions. The low-frequency sub-dispersion depended on the cell length, whereas the high-frequency one was independent of it. The cell shape effect was qualitatively simulated by an ellipsoidal cell model. However, the comparison between theory and experiment was far from being satisfactory. In an attempt to close up the gap between theory and experiment, we considered the more realistic cells of spherocylinders, i.e., circular cylinders with two hemispherical caps at both ends. We have formulated a Green function formalism for calculating the spectral representation of cells of finite length. The Green function can be reduced because of the azimuthal symmetry of the cell. This simplification enables us to calculate the dispersion spectrum and hence access the effect of cell structure on the dielectric behavior of cell suspensions.
△ Less
Submitted 23 March, 2001;
originally announced March 2001.