Skip to main content

Showing 1–25 of 25 results for author: Yu, W

Searching in archive q-bio. Search in all archives.
.
  1. arXiv:2406.12064  [pdf, other

    q-bio.GN

    skandiver: a divergence-based analysis tool for identifying intercellular mobile genetic elements

    Authors: Xiaolei Brian Zhang, Grace Oualline, Jim Shaw, Yun William Yu

    Abstract: Mobile genetic elements (MGEs) are as ubiquitous in nature as they are varied in type, ranging from viral insertions to transposons to incorporated plasmids. Horizontal transfer of MGEs across bacterial species may also pose a significant threat to global health due to their capability to harbour antibiotic resistance genes. However, despite cheap and rapid whole genome sequencing, the varied natu… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: 9 pages, 6 figures

  2. Novel community data in ecology -- properties and prospects

    Authors: Florian Hartig, Nerea Abrego, Alex Bush, Jonathan M. Chase, Gurutzeta Guillera-Arroita, Mathew A. Leibold, Otso Ovaskainen, Loïc Pellissier, Maximilian Pichler, Giovanni Poggiato, Laura Pollock, Sara Si-Moussi, Wilfried Thuiller, Duarte S. Viana, David I. Warton, Damaris Zurell, Douglas W. Yu

    Abstract: New technologies for acquiring biological information such as eDNA, acoustic or optical sensors, make it possible to generate spatial community observations at unprecedented scales. The potential of these novel community data to standardize community observations at high spatial, temporal, and taxonomic resolution and at large spatial scale ('many rows and many columns') has been widely discussed,… ▽ More

    Submitted 19 January, 2024; originally announced January 2024.

    Journal ref: Trends in Ecology & Evolution, 2024

  3. Levenshtein Distance Embedding with Poisson Regression for DNA Storage

    Authors: Xiang Wei, Alan J. X. Guo, Sihan Sun, Mengyi Wei, Wei Yu

    Abstract: Efficient computation or approximation of Levenshtein distance, a widely-used metric for evaluating sequence similarity, has attracted significant attention with the emergence of DNA storage and other biological applications. Sequence embedding, which maps Levenshtein distance to a conventional distance between embedding vectors, has emerged as a promising solution. In this paper, a novel neural n… ▽ More

    Submitted 13 December, 2023; originally announced December 2023.

    Journal ref: Proceedings of the AAAI Conference on Artificial Intelligence, (2024) 38(14), 15796-15804

  4. arXiv:2207.06010  [pdf, other

    cs.LG q-bio.BM

    Does GNN Pretraining Help Molecular Representation?

    Authors: Ruoxi Sun, Hanjun Dai, Adams Wei Yu

    Abstract: Extracting informative representations of molecules using Graph neural networks (GNNs) is crucial in AI-driven drug discovery. Recently, the graph research community has been trying to replicate the success of self-supervised pretraining in natural language processing, with several successes claimed. However, we find the benefit brought by self-supervised pretraining on small molecular data can be… ▽ More

    Submitted 2 November, 2022; v1 submitted 13 July, 2022; originally announced July 2022.

  5. arXiv:2111.08452  [pdf, other

    cs.LG cs.AI q-bio.GN

    On minimizers and convolutional filters: theoretical connections and applications to genome analysis

    Authors: Yun William Yu

    Abstract: Minimizers and convolutional neural networks (CNNs) are two quite distinct popular techniques that have both been employed to analyze categorical biological sequences. At face value, the methods seem entirely dissimilar. Minimizers use min-wise hashing on a rolling window to extract a single important k-mer feature per window. CNNs start with a wide array of randomly initialized convolutional filt… ▽ More

    Submitted 26 January, 2024; v1 submitted 9 November, 2021; originally announced November 2021.

    Comments: 14 pages, 4 figures, submitted to a journal

  6. arXiv:1709.01484  [pdf

    q-bio.TO q-fin.GN

    Estimating Cost Savings from Early Cancer Diagnosis

    Authors: Zura Kakushadze, Rakesh Raghubanshi, Willie Yu

    Abstract: We estimate treatment cost-savings from early cancer diagnosis. For breast, lung, prostate and colorectal cancers and melanoma, which account for more than 50% of new incidences projected in 2017, we combine published cancer treatment cost estimates by stage with incidence rates by stage at diagnosis. We extrapolate to other cancer sites by using estimated national expenditures and incidence rates… ▽ More

    Submitted 2 April, 2019; v1 submitted 30 August, 2017; originally announced September 2017.

    Comments: 22 pages; a trivial grammar typo corrected

    Journal ref: Data 2(3) (2017) 30

  7. arXiv:1707.08504  [pdf, other

    q-bio.GN q-bio.QM q-fin.ST

    Mutation Clusters from Cancer Exome

    Authors: Zura Kakushadze, Willie Yu

    Abstract: We apply our statistically deterministic machine learning/clustering algorithm *K-means (recently developed in https://ssrn.com/abstract=2908286) to 10,656 published exome samples for 32 cancer types. A majority of cancer types exhibit mutation clustering structure. Our results are in-sample stable. They are also out-of-sample stable when applied to 1,389 published genome samples across 14 cancer… ▽ More

    Submitted 26 July, 2017; originally announced July 2017.

    Comments: 84 pages

    Journal ref: Genes 8(8) (2017) 201

  8. arXiv:1703.00703  [pdf, other

    q-bio.GN q-bio.QM q-fin.ST

    *K-means and Cluster Models for Cancer Signatures

    Authors: Zura Kakushadze, Willie Yu

    Abstract: We present *K-means clustering algorithm and source code by expanding statistical clustering methods applied in https://ssrn.com/abstract=2802753 to quantitative finance. *K-means is statistically deterministic without specifying initial centers, etc. We apply *K-means to extracting cancer signatures from genome data without using nonnegative matrix factorization (NMF). *K-means' computational cos… ▽ More

    Submitted 18 July, 2017; v1 submitted 2 March, 2017; originally announced March 2017.

    Comments: 124 pages, 69 figures; a trivial typo corrected; to appear in Biomolecular Detection and Quantification

    Journal ref: Biomolecular Detection and Quantification 13 (2017) 7-31

  9. arXiv:1605.08887  [pdf, other

    q-bio.GN

    Controlling the joint local false discovery rate is more powerful than meta-analysis methods in joint analysis of summary statistics from multiple genome-wide association studies

    Authors: Wei Jiang, Weichuan Yu

    Abstract: In genome-wide association studies (GWASs) of common diseases/traits, we often analyze multiple GWASs with the same phenotype together to discover associated genetic variants with higher power. Since it is difficult to access data with detailed individual measurements, summary-statistics-based meta-analysis methods have become popular to jointly analyze data sets from multiple GWASs. In this paper… ▽ More

    Submitted 28 May, 2016; originally announced May 2016.

  10. arXiv:1604.08743  [pdf, other

    q-bio.GN q-bio.QM q-fin.ST

    Factor Models for Cancer Signatures

    Authors: Zura Kakushadze, Willie Yu

    Abstract: We present a novel method for extracting cancer signatures by applying statistical risk models (http://ssrn.com/abstract=2732453) from quantitative finance to cancer genome data. Using 1389 whole genome sequenced samples from 14 cancers, we identify an "overall" mode of somatic mutational noise. We give a prescription for factoring out this noise and source code for fixing the number of signatures… ▽ More

    Submitted 22 January, 2017; v1 submitted 29 April, 2016; originally announced April 2016.

    Comments: 70 pages, 21 figures; a few trivial typos corrected

    Journal ref: Physica A 462 (2016) 527-559

  11. arXiv:1602.08648  [pdf, other

    cs.CC q-bio.GN

    Approximation hardness of Shortest Common Superstring variants

    Authors: Y. William Yu

    Abstract: The shortest common superstring (SCS) problem has been studied at great length because of its connections to the de novo assembly problem in computational genomics. The base problem is APX-complete, but several generalizations of the problem have also been studied. In particular, previous results include that SCS with Negative strings (SCSN) is in Log-APX (though there is no known hardness result)… ▽ More

    Submitted 27 February, 2016; originally announced February 2016.

    Comments: 10 pages

  12. arXiv:1508.06715  [pdf, other

    q-bio.GN stat.AP

    Estimating Reproducibility in Genome-Wide Association Studies

    Authors: Wei Jiang, **g-Hao Xue, Weichuan Yu

    Abstract: Genome-wide association studies (GWAS) are widely used to discover genetic variants associated with diseases. To control false positives, all findings from GWAS need to be verified with additional evidences, even for associations discovered from a high power study. Replication study is a common verification method by using independent samples. An association is regarded as true positive with a hig… ▽ More

    Submitted 26 August, 2015; originally announced August 2015.

  13. Entropy-scaling search of massive biological data

    Authors: Y. William Yu, Noah M. Daniels, David Christian Danko, Bonnie Berger

    Abstract: Many datasets exhibit a well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here, we introduce a framework for similarity search based on characterizing a dataset's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimensio… ▽ More

    Submitted 21 September, 2015; v1 submitted 18 March, 2015; originally announced March 2015.

    Comments: Including supplement: 41 pages, 6 figures, 4 tables, 1 box

    Journal ref: Cell Systems, Volume 1, Issue 2, 130-140, 2015

  14. arXiv:1212.1055  [pdf, ps, other

    cond-mat.soft cond-mat.stat-mech physics.bio-ph q-bio.BM

    Translocation of stiff polymers through a nanopore driven by binding particles

    Authors: Wancheng Yu, Yiding Ma, Kaifu Luo

    Abstract: We investigate the translocation of stiff polymers in the presence of binding particles through a nanopore by two-dimensional Langevin dynamics simulations. We find that the mean translocation time shows a minimum as a function of the binding energy $ε$ and the particle concentration $φ$, due to the interplay of the force from binding and the frictional force. Particularly, for the strong binding… ▽ More

    Submitted 5 December, 2012; originally announced December 2012.

    Comments: 7 pages, 6 figures, accepted to J. Chem. Phys

  15. arXiv:1211.6198   

    q-bio.QM q-bio.GN stat.AP

    Running PeptideProphet Separately on Replicates Improves Peptide Identification Results

    Authors: Chao Yang, Zengyou He, Weichuan Yu

    Abstract: Limited spectrum coverage is a problem in shotgun proteomics. Replicates are generated to improve the spectrum coverage. When integrating peptide identification results obtained from replicates, the state-of-the-art algorithm PeptideProphet combines Peptide-Spectrum Matches (PSMs) before building the statistical model to calculate peptide probabilities. In this paper, we find the connection betw… ▽ More

    Submitted 2 December, 2012; v1 submitted 26 November, 2012; originally announced November 2012.

    Comments: Due to an error

  16. arXiv:1211.6179  [pdf, other

    q-bio.QM q-bio.GN

    A Combinatorial Perspective of the Protein Inference Problem

    Authors: Chao Yang, Zengyou He, Weichuan Yu

    Abstract: In a shotgun proteomics experiment, proteins are the most biologically meaningful output. The success of proteomics studies depends on the ability to accurately and efficiently identify proteins. Many methods have been proposed to facilitate the identification of proteins from the results of peptide identification. However, the relationship between protein identification and peptide identification… ▽ More

    Submitted 28 November, 2012; v1 submitted 26 November, 2012; originally announced November 2012.

  17. arXiv:1108.3556  [pdf

    cs.DS q-bio.GN

    SparseAssembler2: Sparse k-mer Graph for Memory Efficient Genome Assembly

    Authors: Chengxi Ye, Charles H. Cannon, Zhanshan Sam Ma, Douglas W. Yu, Mihai Pop

    Abstract: The formal version of our work has been published in BMC Bioinformatics and can be found here: http://www.biomedcentral.com/1471-2105/13/S6/S1 Motivation: To tackle the problem of huge memory usage associated with de Bruijn graph-based algorithms, upon which some of the most widely used de novo genome assemblers have been built, we released SparseAssembler1. SparseAssembler1 can save as much as 90… ▽ More

    Submitted 9 January, 2013; v1 submitted 17 August, 2011; originally announced August 2011.

    Comments: Corresponding authors: Zhanshan (Sam) Ma, [email protected]; Mihai Pop, [email protected] || Availability: Programs in both Windows and Linux are available at: https://sites.google.com/site/sparseassembler/

  18. arXiv:1108.0562  [pdf, ps, other

    cond-mat.soft cond-mat.stat-mech q-bio.BM

    Chaperone-assisted translocation of a polymer through a nanopore

    Authors: Wancheng Yu, Kaifu Luo

    Abstract: Using Langevin dynamics simulations, we investigate the dynamics of chaperone-assisted translocation of a flexible polymer through a nanopore. We find that increasing the binding energy $ε$ between the chaperone and the chain and the chaperone concentration $N_c$ can greatly improve the translocation probability. Particularly, with increasing the chaperone concentration a maximum translocation pro… ▽ More

    Submitted 2 August, 2011; originally announced August 2011.

    Comments: 10 pages, to appear in J. Am. Chem. Soc

    Journal ref: J. Am. Chem. Soc. 133, 13565 (2011)

  19. arXiv:1106.2603  [pdf

    cs.DS q-bio.GN

    SparseAssembler: de novo Assembly with the Sparse de Bruijn Graph

    Authors: Chengxi Ye, Zhanshan Sam Ma, Charles H. Cannon, Mihai Pop, Douglas W. Yu

    Abstract: de Bruijn graph-based algorithms are one of the two most widely used approaches for de novo genome assembly. A major limitation of this approach is the large computational memory space requirement to construct the de Bruijn graph, which scales with k-mer length and total diversity (N) of unique k-mers in the genome expressed in base pairs or roughly (2k+8)N bits. This limitation is particularly im… ▽ More

    Submitted 14 June, 2011; originally announced June 2011.

    Comments: Corresponding author: Douglas W. Yu, [email protected]

  20. arXiv:1001.5130  [pdf, ps, other

    q-bio.GN cs.CE q-bio.QM

    BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies

    Authors: Xiang Wan, Can Yang, Qiang Yang, Hong Xue, Xiaodan Fan, Nelson L. S. Tang, Weichuan Yu

    Abstract: Gene-gene interactions have long been recognized to be fundamentally important to understand genetic causes of complex disease traits. At present, identifying gene-gene interactions from genome-wide case-control studies is computationally and methodologically challenging. In this paper, we introduce a simple but powerful method, named `BOolean Operation based Screening and Testing'(BOOST). To di… ▽ More

    Submitted 28 January, 2010; originally announced January 2010.

    Comments: Submitted

  21. arXiv:1001.0887  [pdf, ps, other

    cs.CE q-bio.QM

    Stable Feature Selection for Biomarker Discovery

    Authors: Zengyou He, Weichuan Yu

    Abstract: Feature selection techniques have been used as the workhorse in biomarker discovery applications for a long time. Surprisingly, the stability of feature selection with respect to sampling variations has long been under-considered. It is only until recently that this issue has received more and more attention. In this article, we review existing stable feature selection methods for biomarker disc… ▽ More

    Submitted 6 January, 2010; originally announced January 2010.

  22. arXiv:cond-mat/0306271  [pdf, ps, other

    cond-mat.soft physics.bio-ph q-bio

    Electrokinetic behavior of two touching inhomogeneous biological cells and colloidal particles: Effects of multipolar interactions

    Authors: J. P. Huang, Mikko Karttunen, K. W. Yu, L. Dong, G. Q. Gu

    Abstract: We present a theory to investigate electro-kinetic behavior, namely, electrorotation and dielectrophoresis under alternating current (AC) applied fields for a pair of touching inhomogeneous colloidal particles and biological cells. These inhomogeneous particles are treated as graded ones with physically motivated model dielectric and conductivity profiles. The mutual polarization interaction bet… ▽ More

    Submitted 7 November, 2003; v1 submitted 11 June, 2003; originally announced June 2003.

    Comments: Revised version with minor changes: References added and discussion extended

    Journal ref: Phys. Rev. E 69, 051402 (2004)

  23. arXiv:cond-mat/0202458  [pdf, ps, other

    cond-mat.soft cond-mat.mtrl-sci physics.bio-ph q-bio

    Dielectric behavior of oblate spheroidal particles: Application to erythrocytes suspensions

    Authors: J. P. Huang, K. W. Yu

    Abstract: We have investigated the effect of particle shape on the eletrorotation (ER) spectrum of living cells suspensions. In particular, we consider coated oblate spheroidal particles and present a theoretical study of ER based on the spectral representation theory. Analytic expressions for the characteristic frequency as well as the dispersion strength can be obtained, thus simplifying the fitting of… ▽ More

    Submitted 26 February, 2002; originally announced February 2002.

    Comments: RevTex; 5 eps figures

    Journal ref: Commun. Theor. Phys. 39, 506 (2003)

  24. arXiv:cond-mat/0104437  [pdf, ps, other

    cond-mat.soft physics.bio-ph q-bio

    Spectral Representation Theory for Dielectric Behavior of Nonspherical Cell Suspensions

    Authors: J. P. Huang, K. W. Yu, Jun Lei, Hong Sun

    Abstract: Recent experiments revealed that the dielectric dispersion spectrum of fission yeast cells in a suspension was mainly composed of two sub-dispersions. The low-frequency sub-dispersion depended on the cell length, while the high-frequency one was independent of it. The cell shape effect was simulated by an ellipsoidal cell model but the comparison between theory and experiment was far from being… ▽ More

    Submitted 23 April, 2001; originally announced April 2001.

    Comments: 19 pages, 5 eps figures

    Journal ref: Comm. Theor. Phys. 38, 113 (2002)

  25. arXiv:cond-mat/0103506  [pdf, ps, other

    cond-mat.soft physics.bio-ph q-bio

    Dielectric Behavior of Nonspherical Cell Suspensions

    Authors: Jun Lei, Jones T. K. Wan, K. W. Yu, Hong Sun

    Abstract: Recent experiments revealed that the dielectric dispersion spectrum of fission yeast cells in a suspension was mainly composed of two sub-dispersions. The low-frequency sub-dispersion depended on the cell length, whereas the high-frequency one was independent of it. The cell shape effect was qualitatively simulated by an ellipsoidal cell model. However, the comparison between theory and experime… ▽ More

    Submitted 23 March, 2001; originally announced March 2001.

    Comments: Preliminary results have been reported in the 2001 March Meeting of the American Physical Society. Accepted for publications in J. Phys.: Condens. Matter

    Journal ref: J. Phys.: Condens. Matter 13, 3583 (2001).