Search | arXiv e-print repository

arXiv:1911.07335 [pdf, other]

doi 10.1007/s10994-020-05897-1

Using Error Decay Prediction to Overcome Practical Issues of Deep Active Learning for Named Entity Recognition

Authors: Haw-Shiuan Chang, Shankar Vembu, Sunil Mohan, Rheeya Uppaal, Andrew McCallum

Abstract: Existing deep active learning algorithms achieve impressive sampling efficiency on natural language processing tasks. However, they exhibit several weaknesses in practice, including (a) inability to use uncertainty sampling with black-box models, (b) lack of robustness to labeling noise, and (c) lack of transparency. In response, we propose a transparent batch active sampling framework by estimati… ▽ More Existing deep active learning algorithms achieve impressive sampling efficiency on natural language processing tasks. However, they exhibit several weaknesses in practice, including (a) inability to use uncertainty sampling with black-box models, (b) lack of robustness to labeling noise, and (c) lack of transparency. In response, we propose a transparent batch active sampling framework by estimating the error decay curves of multiple feature-defined subsets of the data. Experiments on four named entity recognition (NER) tasks demonstrate that the proposed methods significantly outperform diversification-based methods for black-box NER taggers, and can make the sampling process more robust to labeling noise when combined with uncertainty-based methods. Furthermore, the analysis of experimental results sheds light on the weaknesses of different active sampling strategies, and when traditional uncertainty-based or diversification-based methods can be expected to work well. △ Less

Submitted 20 July, 2020; v1 submitted 17 November, 2019; originally announced November 2019.

Comments: This is a pre-print of an article published in Springer Machine Learning journal. The final authenticated version is available online at: https://doi.org/10.1007/s10994-020-05897-1

arXiv:1710.08579 [pdf, other]

Implementing Recommendation Algorithms in a Large-Scale Biomedical Science Knowledge Base

Authors: Jessica Perrie, Yanqi Hao, Zack Hayat, Recep Colak, Kelly Lyons, Shankar Vembu, Sam Molyneux

Abstract: The number of biomedical research articles published has doubled in the past 20 years. Search engine based systems naturally center around searching, but researchers may not have a clear goal in mind, or the goal may be expressed in a query that a literature search engine cannot easily answer, such as identifying the most prominent authors in a given field of research. The discovery process can be… ▽ More The number of biomedical research articles published has doubled in the past 20 years. Search engine based systems naturally center around searching, but researchers may not have a clear goal in mind, or the goal may be expressed in a query that a literature search engine cannot easily answer, such as identifying the most prominent authors in a given field of research. The discovery process can be improved by providing researchers with recommendations for relevant papers or for researchers who are dealing with related bodies of work. In this paper we describe several recommendation algorithms that were implemented in the Meta platform. The Meta platform contains over 27 million articles and continues to grow daily. It provides an online map of science that organizes, in real time, all published biomedical research. The ultimate goal is to make it quicker and easier for researchers to: filter through scientific papers; find the most important work and, keep up with emerging research results. Meta generates and maintains a semantic knowledge network consisting of these core entities: authors, papers, journals, institutions, and concepts. We implemented several recommendation algorithms and evaluated their efficiency in this large-scale biomedical knowledge base. We selected recommendation algorithms that could take advantage of the unique environment of the Meta platform such as those that make use of diverse datasets such as a citation networks, text content, semantic tag content, and co-authorship information and those that can scale to very large datasets. In this paper, we describe the recommendation algorithms that were implemented and report on their relative efficiency and the challenges associated with develo** and deploying a production recommendation engine system. △ Less

Submitted 23 October, 2017; originally announced October 2017.

Comments: 21 pages; 5 figures

arXiv:1607.06988 [pdf, other]

Interactive Learning from Multiple Noisy Labels

Authors: Shankar Vembu, Sandra Zilles

Abstract: Interactive learning is a process in which a machine learning algorithm is provided with meaningful, well-chosen examples as opposed to randomly chosen examples typical in standard supervised learning. In this paper, we propose a new method for interactive learning from multiple noisy labels where we exploit the disagreement among annotators to quantify the easiness (or meaningfulness) of an examp… ▽ More Interactive learning is a process in which a machine learning algorithm is provided with meaningful, well-chosen examples as opposed to randomly chosen examples typical in standard supervised learning. In this paper, we propose a new method for interactive learning from multiple noisy labels where we exploit the disagreement among annotators to quantify the easiness (or meaningfulness) of an example. We demonstrate the usefulness of this method in estimating the parameters of a latent variable classification model, and conduct experimental analyses on a range of synthetic and benchmark datasets. Furthermore, we theoretically analyze the performance of perceptron in this interactive learning framework. △ Less

Submitted 23 July, 2016; originally announced July 2016.

arXiv:1408.2552 [pdf, other]

Comparing Nonparametric Bayesian Tree Priors for Clonal Reconstruction of Tumors

Authors: Amit G. Deshwar, Shankar Vembu, Quaid Morris

Abstract: Statistical machine learning methods, especially nonparametric Bayesian methods, have become increasingly popular to infer clonal population structure of tumors. Here we describe the treeCRP, an extension of the Chinese restaurant process (CRP), a popular construction used in nonparametric mixture models, to infer the phylogeny and genotype of major subclonal lineages represented in the population… ▽ More Statistical machine learning methods, especially nonparametric Bayesian methods, have become increasingly popular to infer clonal population structure of tumors. Here we describe the treeCRP, an extension of the Chinese restaurant process (CRP), a popular construction used in nonparametric mixture models, to infer the phylogeny and genotype of major subclonal lineages represented in the population of cancer cells. We also propose new split-merge updates tailored to the subclonal reconstruction problem that improve the mixing time of Markov chains. In comparisons with the tree-structured stick breaking prior used in PhyloSub, we demonstrate superior mixing and running time using the treeCRP with our new split-merge procedures. We also show that given the same number of samples, TSSB and treeCRP have similar ability to recover the subclonal structure of a tumor. △ Less

Submitted 11 August, 2014; originally announced August 2014.

Comments: Preprint of an article submitted for consideration in the Pacific Symposium on Biocomputing \c{opyright} 2015; World Scientific Publishing Co., Singapore, 2015; http://psb.stanford.edu/

arXiv:1406.7250 [pdf, other]

Reconstructing subclonal composition and evolution from whole genome sequencing of tumors

Authors: Amit G. Deshwar, Shankar Vembu, Christina K. Yung, Gun Ho Jang, Lincoln Stein, Quaid Morris

Abstract: Tumors often contain multiple subpopulations of cancerous cells defined by distinct somatic mutations. We describe a new method, PhyloWGS, that can be applied to WGS data from one or more tumor samples to reconstruct complete genotypes of these subpopulations based on variant allele frequencies (VAFs) of point mutations and population frequencies of structural variations. We introduce a principled… ▽ More Tumors often contain multiple subpopulations of cancerous cells defined by distinct somatic mutations. We describe a new method, PhyloWGS, that can be applied to WGS data from one or more tumor samples to reconstruct complete genotypes of these subpopulations based on variant allele frequencies (VAFs) of point mutations and population frequencies of structural variations. We introduce a principled phylogenic correction for VAFs in loci affected by copy number alterations and we show that this correction greatly improves subclonal reconstruction compared to existing methods. △ Less

Submitted 6 January, 2015; v1 submitted 27 June, 2014; originally announced June 2014.

arXiv:1210.3384 [pdf, other]

Inferring clonal evolution of tumors from single nucleotide somatic mutations

Authors: Wei Jiao, Shankar Vembu, Amit G. Deshwar, Lincoln Stein, Quaid Morris

Abstract: High-throughput sequencing allows the detection and quantification of frequencies of somatic single nucleotide variants (SNV) in heterogeneous tumor cell populations. In some cases, the evolutionary history and population frequency of the subclonal lineages of tumor cells present in the sample can be reconstructed from these SNV frequency measurements. However, automated methods to do this reconst… ▽ More High-throughput sequencing allows the detection and quantification of frequencies of somatic single nucleotide variants (SNV) in heterogeneous tumor cell populations. In some cases, the evolutionary history and population frequency of the subclonal lineages of tumor cells present in the sample can be reconstructed from these SNV frequency measurements. However, automated methods to do this reconstruction are not available and the conditions under which reconstruction is possible have not been described. We describe the conditions under which the evolutionary history can be uniquely reconstructed from SNV frequencies from single or multiple samples from the tumor population and we introduce a new statistical model, PhyloSub, that infers the phylogeny and genotype of the major subclonal lineages represented in the population of cancer cells. It uses a Bayesian nonparametric prior over trees that groups SNVs into major subclonal lineages and automatically estimates the number of lineages and their ancestry. We sample from the joint posterior distribution over trees to identify evolutionary histories and cell population frequencies that have the highest probability of generating the observed SNV frequency data. When multiple phylogenies are consistent with a given set of SNV frequencies, PhyloSub represents the uncertainty in the tumor phylogeny using a partial order plot. Experiments on a simulated dataset and two real datasets comprising tumor samples from acute myeloid leukemia and chronic lymphocytic leukemia patients demonstrate that PhyloSub can infer both linear (or chain) and branching lineages and its inferences are in good agreement with ground truth, where it is available. △ Less

Submitted 2 November, 2013; v1 submitted 11 October, 2012; originally announced October 2012.

arXiv:1206.4661 [pdf]

Predicting accurate probabilities with a ranking loss

Authors: Aditya Menon, Xiaoqian Jiang, Shankar Vembu, Charles Elkan, Lucila Ohno-Machado

Abstract: In many real-world applications of machine learning classifiers, it is essential to predict the probability of an example belonging to a particular class. This paper proposes a simple technique for predicting probabilities based on optimizing a ranking loss, followed by isotonic regression. This semi-parametric technique offers both good ranking and regression performance, and models a richer set… ▽ More In many real-world applications of machine learning classifiers, it is essential to predict the probability of an example belonging to a particular class. This paper proposes a simple technique for predicting probabilities based on optimizing a ranking loss, followed by isotonic regression. This semi-parametric technique offers both good ranking and regression performance, and models a richer set of probability distributions than statistical workhorses such as logistic regression. We provide experimental results that show the effectiveness of this technique on real-world applications of probability prediction. △ Less

Submitted 18 June, 2012; originally announced June 2012.

Comments: ICML2012

arXiv:1205.2610 [pdf]

Probabilistic Structured Predictors

Authors: Shankar Vembu, Thomas Gartner, Mario Boley

Abstract: We consider MAP estimators for structured prediction with exponential family models. In particular, we concentrate on the case that efficient algorithms for uniform sampling from the output space exist. We show that under this assumption (i) exact computation of the partition function remains a hard problem, and (ii) the partition function and the gradient of the log partition function can be appr… ▽ More We consider MAP estimators for structured prediction with exponential family models. In particular, we concentrate on the case that efficient algorithms for uniform sampling from the output space exist. We show that under this assumption (i) exact computation of the partition function remains a hard problem, and (ii) the partition function and the gradient of the log partition function can be approximated efficiently. Our main result is an approximation scheme for the partition function based on Markov Chain Monte Carlo theory. We also show that the efficient uniform sampling assumption holds in several application settings that are of importance in machine learning. △ Less

Submitted 9 May, 2012; originally announced May 2012.

Comments: Appears in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI2009). arXiv admin note: substantial text overlap with arXiv:0912.4473

Report number: UAI-P-2009-PG-557-564

arXiv:0912.4473 [pdf, ps, other]

Learning to Predict Combinatorial Structures

Authors: Shankar Vembu

Abstract: The major challenge in designing a discriminative learning algorithm for predicting structured data is to address the computational issues arising from the exponential size of the output space. Existing algorithms make different assumptions to ensure efficient, polynomial time estimation of model parameters. For several combinatorial structures, including cycles, partially ordered sets, permutatio… ▽ More The major challenge in designing a discriminative learning algorithm for predicting structured data is to address the computational issues arising from the exponential size of the output space. Existing algorithms make different assumptions to ensure efficient, polynomial time estimation of model parameters. For several combinatorial structures, including cycles, partially ordered sets, permutations and other graph classes, these assumptions do not hold. In this thesis, we address the problem of designing learning algorithms for predicting combinatorial structures by introducing two new assumptions: (i) The first assumption is that a particular counting problem can be solved efficiently. The consequence is a generalisation of the classical ridge regression for structured prediction. (ii) The second assumption is that a particular sampling problem can be solved efficiently. The consequence is a new technique for designing and analysing probabilistic structured prediction models. These results can be applied to solve several complex learning problems including but not limited to multi-label classification, multi-category hierarchical classification, and label ranking. △ Less

Submitted 26 June, 2010; v1 submitted 22 December, 2009; originally announced December 2009.

Comments: PhD thesis, Department of Computer Science, University of Bonn (submitted, December 2009)

Showing 1–9 of 9 results for author: Vembu, S