Search | arXiv e-print repository

Parameter Estimation from Single Patient, Single Time-Point Sequencing Data of Recurrent Tumors

Authors: Kevin Leder, Ru** Sun, Zicheng Wang, Xuanming Zhang

Abstract: In this study, we develop consistent estimators for key parameters that govern the dynamics of tumor cell populations when subjected to pharmacological treatments. While these treatments often lead to an initial reduction in the abundance of drug-sensitive cells, a population of drug-resistant cells frequently emerges over time, resulting in cancer recurrence. Samples from recurrent tumors present… ▽ More In this study, we develop consistent estimators for key parameters that govern the dynamics of tumor cell populations when subjected to pharmacological treatments. While these treatments often lead to an initial reduction in the abundance of drug-sensitive cells, a population of drug-resistant cells frequently emerges over time, resulting in cancer recurrence. Samples from recurrent tumors present as an invaluable data source that can offer crucial insights into the ability of cancer cells to adapt and withstand treatment interventions. To effectively utilize the data obtained from recurrent tumors, we derive several large number limit theorems, specifically focusing on the metrics that quantify the clonal diversity of cancer cell populations at the time of cancer recurrence. These theorems then serve as the foundation for constructing our estimators. A distinguishing feature of our approach is that our estimators only require a single time-point sequencing data from a single tumor, thereby enhancing the practicality of our approach and enabling the understanding of cancer recurrence at the individual level. △ Less

Submitted 19 March, 2024; originally announced March 2024.

arXiv:2403.00875 [pdf, other]

Enhancing Protein Predictive Models via Proteins Data Augmentation: A Benchmark and New Directions

Authors: Rui Sun, Lirong Wu, Haitao Lin, Yufei Huang, Stan Z. Li

Abstract: Augmentation is an effective alternative to utilize the small amount of labeled protein data. However, most of the existing work focuses on design-ing new architectures or pre-training tasks, and relatively little work has studied data augmentation for proteins. This paper extends data augmentation techniques previously used for images and texts to proteins and then benchmarks these techniques on… ▽ More Augmentation is an effective alternative to utilize the small amount of labeled protein data. However, most of the existing work focuses on design-ing new architectures or pre-training tasks, and relatively little work has studied data augmentation for proteins. This paper extends data augmentation techniques previously used for images and texts to proteins and then benchmarks these techniques on a variety of protein-related tasks, providing the first comprehensive evaluation of protein augmentation. Furthermore, we propose two novel semantic-level protein augmentation methods, namely Integrated Gradients Substitution and Back Translation Substitution, which enable protein semantic-aware augmentation through saliency detection and biological knowledge. Finally, we integrate extended and proposed augmentations into an augmentation pool and propose a simple but effective framework, namely Automated Protein Augmentation (APA), which can adaptively select the most suitable augmentation combinations for different tasks. Extensive experiments have shown that APA enhances the performance of five protein related tasks by an average of 10.55% across three architectures compared to vanilla implementations without augmentation, highlighting its potential to make a great impact on the field. △ Less

Submitted 1 March, 2024; originally announced March 2024.

arXiv:2207.06010 [pdf, other]

Does GNN Pretraining Help Molecular Representation?

Authors: Ruoxi Sun, Hanjun Dai, Adams Wei Yu

Abstract: Extracting informative representations of molecules using Graph neural networks (GNNs) is crucial in AI-driven drug discovery. Recently, the graph research community has been trying to replicate the success of self-supervised pretraining in natural language processing, with several successes claimed. However, we find the benefit brought by self-supervised pretraining on small molecular data can be… ▽ More Extracting informative representations of molecules using Graph neural networks (GNNs) is crucial in AI-driven drug discovery. Recently, the graph research community has been trying to replicate the success of self-supervised pretraining in natural language processing, with several successes claimed. However, we find the benefit brought by self-supervised pretraining on small molecular data can be negligible in many cases. We conduct thorough ablation studies on the key components of GNN pretraining, including pretraining objectives, data splitting methods, input features, pretraining dataset scales, and GNN architectures, to see how they affect the accuracy of the downstream tasks. Our first important finding is, self-supervised graph pretraining do not always have statistically significant advantages over non-pretraining methods in many settings. Secondly, although noticeable improvement can be observed with additional supervised pretraining, the improvement may diminish with richer features or more balanced data splits. Thirdly, hyper-parameters could have larger impacts on accuracy of downstream tasks than the choice of pretraining tasks, especially when the scales of downstream tasks are small. Finally, we provide our conjectures where the complexity of some pretraining methods on small molecules might be insufficient, followed by empirical evidences on different pretraining datasets. △ Less

Submitted 2 November, 2022; v1 submitted 13 July, 2022; originally announced July 2022.

arXiv:2102.07713 [pdf, other]

Cancer Gene Profiling through Unsupervised Discovery

Authors: Enzo Battistella, Maria Vakalopoulou, Roger Sun, Théo Estienne, Marvin Lerousseau, Sergey Nikolaev, Emilie Alvarez Andres, Alexandre Carré, Stéphane Niyoteka, Charlotte Robert, Nikos Paragios, Eric Deutsch

Abstract: Precision medicine is a paradigm shift in healthcare relying heavily on genomics data. However, the complexity of biological interactions, the large number of genes as well as the lack of comparisons on the analysis of data, remain a tremendous bottleneck regarding clinical adoption. In this paper, we introduce a novel, automatic and unsupervised framework to discover low-dimensional gene biomarke… ▽ More Precision medicine is a paradigm shift in healthcare relying heavily on genomics data. However, the complexity of biological interactions, the large number of genes as well as the lack of comparisons on the analysis of data, remain a tremendous bottleneck regarding clinical adoption. In this paper, we introduce a novel, automatic and unsupervised framework to discover low-dimensional gene biomarkers. Our method is based on the LP-Stability algorithm, a high dimensional center-based unsupervised clustering algorithm, that offers modularity as concerns metric functions and scalability, while being able to automatically determine the best number of clusters. Our evaluation includes both mathematical and biological criteria. The recovered signature is applied to a variety of biological tasks, including screening of biological pathways and functions, and characterization relevance on tumor types and subtypes. Quantitative comparisons among different distance metrics, commonly used clustering methods and a referential gene signature used in the literature, confirm state of the art performance of our approach. In particular, our signature, that is based on 27 genes, reports at least $30$ times better mathematical significance (average Dunn's Index) and 25% better biological significance (average Enrichment in Protein-Protein Interaction) than those produced by other referential clustering methods. Finally, our signature reports promising results on distinguishing immune inflammatory and immune desert tumors, while reporting a high balanced accuracy of 92% on tumor types classification and averaged balanced accuracy of 68% on tumor subtypes classification, which represents, respectively 7% and 9% higher performance compared to the referential signature. △ Less

Submitted 11 February, 2021; originally announced February 2021.

arXiv:2010.15191 [pdf]

Chronic, cortex-wide imaging of specific cell populations during behavior

Authors: Joao Couto, Simon Musall, Xiaonan R Sun, Anup Khanal, Steven Gluf, Shreya Saxena, Ian Kinsella, Taiga Abe, John P. Cunningham, Liam Paninski, Anne K Churchland

Abstract: Measurements of neuronal activity across brain areas are important for understanding the neural correlates of cognitive and motor processes like attention, decision-making, and action selection. However, techniques that allow cellular resolution measurements are expensive and require a high degree of technical expertise, which limits their broad use. Widefield imaging of genetically encoded indica… ▽ More Measurements of neuronal activity across brain areas are important for understanding the neural correlates of cognitive and motor processes like attention, decision-making, and action selection. However, techniques that allow cellular resolution measurements are expensive and require a high degree of technical expertise, which limits their broad use. Widefield imaging of genetically encoded indicators is a high throughput, cost effective, and flexible approach to measure activity of specific cell populations with high temporal resolution and a cortex-wide field of view. Here we outline our protocol for assembling a widefield setup, a surgical preparation to image through the intact skull, and imaging neural activity chronically in behaving, transgenic mice that express a calcium indicator in specific subpopulations of cortical neurons. Further, we highlight a processing pipeline that leverages novel, cloud-based methods to analyze large-scale imaging datasets. The protocol targets labs that are seeking to build macroscopes, optimize surgical procedures for long-term chronic imaging, and/or analyze cortex-wide neuronal recordings. △ Less

Submitted 28 October, 2020; originally announced October 2020.

Comments: 36 pages, 7 figures, 2 supplementary figures

arXiv:2007.13437 [pdf, other]

Energy-based View of Retrosynthesis

Authors: Ruoxi Sun, Hanjun Dai, Li Li, Steven Kearnes, Bo Dai

Abstract: Retrosynthesis -- the process of identifying a set of reactants to synthesize a target molecule -- is of vital importance to material design and drug discovery. Existing machine learning approaches based on language models and graph neural networks have achieved encouraging results. In this paper, we propose a framework that unifies sequence- and graph-based methods as energy-based models (EBMs) w… ▽ More Retrosynthesis -- the process of identifying a set of reactants to synthesize a target molecule -- is of vital importance to material design and drug discovery. Existing machine learning approaches based on language models and graph neural networks have achieved encouraging results. In this paper, we propose a framework that unifies sequence- and graph-based methods as energy-based models (EBMs) with different energy functions. This unified perspective provides critical insights about EBM variants through a comprehensive assessment of performance. Additionally, we present a novel dual variant within the framework that performs consistent training over Bayesian forward- and backward-prediction by constraining the agreement between the two directions. This model improves state-of-the-art performance by 9.6% for template-free approaches where the reaction type is unknown. △ Less

Submitted 8 December, 2021; v1 submitted 14 July, 2020; originally announced July 2020.

arXiv:1610.03182 [pdf]

wtest: an R Package for Testing Main and Interaction Effect in Genotype Data with Binary Traits

Authors: Rui Sun, Billy Chang, Benny Chung-Ying Zee, Maggie Haitian Wang

Abstract: This R package evaluates main and pair-wise interaction effect of single nucleotide polymorphisms (SNPs) via the W-test, scalable to whole genome-wide data sets. The package provides fast and accurate p-value estimation of genetic markers, as well as diagnostic checking on the probability distributions. It allows flexible stage-wise or exhaustive association testing in a user-friendly interface. A… ▽ More This R package evaluates main and pair-wise interaction effect of single nucleotide polymorphisms (SNPs) via the W-test, scalable to whole genome-wide data sets. The package provides fast and accurate p-value estimation of genetic markers, as well as diagnostic checking on the probability distributions. It allows flexible stage-wise or exhaustive association testing in a user-friendly interface. Availability: The package is available in CRAN, or from website: http://www2.ccrb.cuhk.edu.hk/wtest △ Less

Submitted 11 October, 2016; originally announced October 2016.

Comments: 7 pages, 1 figure

arXiv:1607.07834 [pdf]

A W-test collapsing method for rare variant testing with applications to exome sequencing data of hypertensive disorder

Authors: Rui Sun, Haoyi Weng, Inchi Hu, Junfeng Guo, William K. K. Wu, Benny Chung-Ying Zee, Maggie Haitian Wang

Abstract: Advancement in sequencing technology enables the study of association between complex disorders and rare variants with low minor allele frequencies. One of the major challenges in rare variant testing is lack of statistical power of traditional testing methods due to extremely low variances of single nucleotide polymorphisms. In this paper, we introduce a W-test collapsing method that evaluates th… ▽ More Advancement in sequencing technology enables the study of association between complex disorders and rare variants with low minor allele frequencies. One of the major challenges in rare variant testing is lack of statistical power of traditional testing methods due to extremely low variances of single nucleotide polymorphisms. In this paper, we introduce a W-test collapsing method that evaluates the distributional differences in cases and controls using a combined log of odds ratio. The proposed method is compared with the Weighted-Sum Statistic and Sequence Kernel Association Test using simulation data sets and showed better performances and faster computing speed. In the study of real next generation sequencing data set of hypertensive disorder, we identified genes of interesting biological functions that are associated to metabolism disorder and inflammation, which include the MACROD1, NLRP7, AGK, PAK6 and APBB1. The W-test collapsing method offers a fast, effective and alternative way for rare variants association analysis. △ Less

Submitted 26 July, 2016; originally announced July 2016.

Comments: 18 pages, 1 figure, 4 tables. Genetic Epidemiology accepted

arXiv:1606.08941 [pdf]

Enhancing power of rare variant association test by Zoom-Focus Algorithm (ZFA) to locate optimal testing region

Authors: Maggie Haitian Wang, Haoyi Weng, Rui Sun, Benny Chung-Ying Zee

Abstract: Motivation: Exome or targeted sequencing data exerts analytical challenge to test single nucleotide polymorphisms (SNPs) with extremely small minor allele frequency (MAF). Various rare variant tests were proposed to increase power by aggregating SNPs within a fixed genomic region, such as a gene or pathway. However, a gene could contain from several to thousands of markers, and not all of them may… ▽ More Motivation: Exome or targeted sequencing data exerts analytical challenge to test single nucleotide polymorphisms (SNPs) with extremely small minor allele frequency (MAF). Various rare variant tests were proposed to increase power by aggregating SNPs within a fixed genomic region, such as a gene or pathway. However, a gene could contain from several to thousands of markers, and not all of them may be related to the phenotype. Combining functional and non-functional SNPs in arbitrary genomic region could impair the testing power. Results: We propose a Zoom-Focus algorithm (ZFA) to locate the optimal testing region within a given genomic region, as a wrapper function to be applied in conjunction with rare variant association tests. In the first Zooming step, a given genomic region is partitioned by order of two, and the best partition is located within all partition levels. In the next Focusing step, boundaries of the zoomed region are refined. Simulation studies showed that ZFA substantially enhanced the statistical power of rare variant tests by over 10 folds, including the WSS, SKAT and W-test. The algorithm is applied on real exome sequencing data of hypertensive disorder, and identified biologically relevant genetic markers to metabolic disorder that are undiscoverable by testing using full gene. The proposed algorithm is an efficient and powerful tool to increase the effectiveness of rare variant association tests for exome sequencing datasets of complex disorder. △ Less

Submitted 28 June, 2016; originally announced June 2016.

Comments: Main paper: 13 pages, 2 figures, 3 tables, 3 diagrams; Submitted to Bioinformatics, and the 27th International Conference on Genome Informatics

Showing 1–9 of 9 results for author: Sun, R