-
Prognostic Covariate Adjustment for Binary Outcomes Using Stratification
Authors:
Alyssa M. Vanderbeek,
Jessica L. Ross,
David P. Miller,
Alejandro Schuler
Abstract:
Covariate adjustment and methods of incorporating historical data in randomized clinical trials (RCTs) each provide opportunities to increase trial power. We unite these approaches for the analysis of RCTs with binary outcomes based on the Cochran-Mantel-Haenszel (CMH) test for marginal risk ratio (RR). In PROCOVA-CMH, subjects are stratified on a single prognostic covariate reflective of their pr…
▽ More
Covariate adjustment and methods of incorporating historical data in randomized clinical trials (RCTs) each provide opportunities to increase trial power. We unite these approaches for the analysis of RCTs with binary outcomes based on the Cochran-Mantel-Haenszel (CMH) test for marginal risk ratio (RR). In PROCOVA-CMH, subjects are stratified on a single prognostic covariate reflective of their predicted outcome on the control treatment (e.g. placebo). This prognostic score is generated based on baseline covariates through a model trained on historical data. We propose two closed-form prospective estimators for the asymptotic sampling variance of the log RR that rely only on values obtainable from observed historical outcomes and the prognostic model. Importantly, these estimators can be used to inform sample size during trial planning. PROCOVA-CMH demonstrates type I error control and appropriate asymptotic coverage for valid inference. Like other covariate adjustment methods, PROCOVA-CMH can reduce the variance of the treatment effect estimate when compared to an unadjusted (unstratified) CMH analysis. In addition to statistical methods, simulations and a case study in Alzheimer's Disease are given to demonstrate performance. Results show that PROCOVA-CMH can provide a gain in power, which can be used to conduct smaller trials.
△ Less
Submitted 19 December, 2022;
originally announced December 2022.
-
Restricted mean survival time estimate using covariate adjusted pseudovalue regression to improve precision
Authors:
Yunfan Li,
Jessica L. Ross,
Aaron M. Smith,
David P. Miller
Abstract:
Covariate adjustment is desired by both practitioners and regulators of randomized clinical trials because it improves precision for estimating treatment effects. However, covariate adjustment presents a particular challenge in time-to-event analysis. We propose to apply covariate adjusted pseudovalue regression to estimate between-treatment difference in restricted mean survival times (RMST). Our…
▽ More
Covariate adjustment is desired by both practitioners and regulators of randomized clinical trials because it improves precision for estimating treatment effects. However, covariate adjustment presents a particular challenge in time-to-event analysis. We propose to apply covariate adjusted pseudovalue regression to estimate between-treatment difference in restricted mean survival times (RMST). Our proposed method incorporates a prognostic covariate to increase precision of treatment effect estimate, maintaining strict type I error control without introducing bias. In addition, the amount of increase in precision can be quantified and taken into account in sample size calculation at the study design stage. Consequently, our proposed method provides the ability to design smaller randomized studies at no expense to statistical power.
△ Less
Submitted 18 July, 2023; v1 submitted 8 August, 2022;
originally announced August 2022.
-
Never mind the metrics -- what about the uncertainty? Visualising confusion matrix metric distributions
Authors:
David Lovell,
Dimity Miller,
Jaiden Capra,
Andrew Bradley
Abstract:
There are strong incentives to build models that demonstrate outstanding predictive performance on various datasets and benchmarks. We believe these incentives risk a narrow focus on models and on the performance metrics used to evaluate and compare them -- resulting in a growing body of literature to evaluate and compare metrics. This paper strives for a more balanced perspective on classifier pe…
▽ More
There are strong incentives to build models that demonstrate outstanding predictive performance on various datasets and benchmarks. We believe these incentives risk a narrow focus on models and on the performance metrics used to evaluate and compare them -- resulting in a growing body of literature to evaluate and compare metrics. This paper strives for a more balanced perspective on classifier performance metrics by highlighting their distributions under different models of uncertainty and showing how this uncertainty can easily eclipse differences in the empirical performance of classifiers. We begin by emphasising the fundamentally discrete nature of empirical confusion matrices and show how binary matrices can be meaningfully represented in a three dimensional compositional lattice, whose cross-sections form the basis of the space of receiver operating characteristic (ROC) curves. We develop equations, animations and interactive visualisations of the contours of performance metrics within (and beyond) this ROC space, showing how some are affected by class imbalance. We provide interactive visualisations that show the discrete posterior predictive probability mass functions of true and false positive rates in ROC space, and how these relate to uncertainty in performance metrics such as Balanced Accuracy (BA) and the Matthews Correlation Coefficient (MCC). Our hope is that these insights and visualisations will raise greater awareness of the substantial uncertainty in performance metric estimates that can arise when classifiers are evaluated on empirical datasets and benchmarks, and that classification model performance claims should be tempered by this understanding.
△ Less
Submitted 5 June, 2022;
originally announced June 2022.
-
Lorentz Group Equivariant Neural Network for Particle Physics
Authors:
Alexander Bogatskiy,
Brandon Anderson,
Jan T. Offermann,
Marwah Roussi,
David W. Miller,
Risi Kondor
Abstract:
We present a neural network architecture that is fully equivariant with respect to transformations under the Lorentz group, a fundamental symmetry of space and time in physics. The architecture is based on the theory of the finite-dimensional representations of the Lorentz group and the equivariant nonlinearity involves the tensor product. For classification tasks in particle physics, we demonstra…
▽ More
We present a neural network architecture that is fully equivariant with respect to transformations under the Lorentz group, a fundamental symmetry of space and time in physics. The architecture is based on the theory of the finite-dimensional representations of the Lorentz group and the equivariant nonlinearity involves the tensor product. For classification tasks in particle physics, we demonstrate that such an equivariant architecture leads to drastically simpler models that have relatively few learnable parameters and are much more physically interpretable than leading approaches that use CNNs and point cloud approaches. The competitive performance of the network is demonstrated on a public classification dataset [27] for tagging top quark decays given energy-momenta of jet constituents produced in proton-proton collisions.
△ Less
Submitted 8 June, 2020;
originally announced June 2020.
-
Understanding the stochastic partial differential equation approach to smoothing
Authors:
David L Miller,
Richard Glennie,
Andrew E Seaton
Abstract:
Correlation and smoothness are terms used to describe a wide variety of random quantities. In time, space, and many other domains, they both imply the same idea: quantities that occur closer together are more similar than those further apart. Two popular statistical models that represent this idea are basis-penalty smoothers (Wood, 2017) and stochastic partial differential equations (SPDE) (Lindgr…
▽ More
Correlation and smoothness are terms used to describe a wide variety of random quantities. In time, space, and many other domains, they both imply the same idea: quantities that occur closer together are more similar than those further apart. Two popular statistical models that represent this idea are basis-penalty smoothers (Wood, 2017) and stochastic partial differential equations (SPDE) (Lindgren et al., 2011). In this paper, we discuss how the SPDE can be interpreted as a smoothing penalty and can be fitted using the R package mgcv, allowing practitioners with existing knowledge of smoothing penalties to better understand the implementation and theory behind the SPDE approach.
△ Less
Submitted 9 June, 2020; v1 submitted 21 January, 2020;
originally announced January 2020.
-
Automated metrics calculation in a dynamic heterogeneous environment
Authors:
Craig Boucher,
Ulf Knoblich,
Daniel Miller,
Sasha Patotski,
Amin Saied,
Venky Venkateshaiah
Abstract:
A consistent theme in software experimentation at Microsoft has been solving problems of experimentation at scale for a diverse set of products. Running experiments at scale (i.e., many experiments on many users) has become state of the art across the industry. However, providing a single platform that allows software experimentation in a highly heterogenous and constantly evolving ecosystem remai…
▽ More
A consistent theme in software experimentation at Microsoft has been solving problems of experimentation at scale for a diverse set of products. Running experiments at scale (i.e., many experiments on many users) has become state of the art across the industry. However, providing a single platform that allows software experimentation in a highly heterogenous and constantly evolving ecosystem remains a challenge. In our case, heterogeneity spans multiple dimensions. First, we need to support experimentation for many types of products: websites, search engines, mobile apps, operating systems, cloud services and others. Second, due to the diversity of the products and teams using our platform, it needs to be flexible enough to analyze data in multiple compute fabrics (e.g. Spark, Azure Data Explorer), with a way to easily add support for new fabrics if needed. Third, one of the main factors in facilitating growth of experimentation culture in an organization is to democratize metric definition and analysis processes. To achieve that, our system needs to be simple enough to be used not only by data scientists, but also engineers, product managers and sales teams. Finally, different personas might need to use the platform for different types of analyses, e.g. dashboards or experiment analysis, and the platform should be flexible enough to accommodate that. This paper presents our solution to the problems of heterogeneity listed above.
△ Less
Submitted 2 December, 2019;
originally announced December 2019.
-
Revealing Perceptible Backdoors, without the Training Set, via the Maximum Achievable Misclassification Fraction Statistic
Authors:
Zhen Xiang,
David J. Miller,
Hang Wang,
George Kesidis
Abstract:
Recently, a backdoor data poisoning attack was proposed, which adds mislabeled examples to the training set, with an embedded backdoor pattern, aiming to have the classifier learn to classify to a target class whenever the backdoor pattern is present in a test sample. Here, we address post-training detection of innocuous perceptible backdoors in DNN image classifiers, wherein the defender does not…
▽ More
Recently, a backdoor data poisoning attack was proposed, which adds mislabeled examples to the training set, with an embedded backdoor pattern, aiming to have the classifier learn to classify to a target class whenever the backdoor pattern is present in a test sample. Here, we address post-training detection of innocuous perceptible backdoors in DNN image classifiers, wherein the defender does not have access to the poisoned training set, but only to the trained classifier, as well as unpoisoned examples. This problem is challenging because without the poisoned training set, we have no hint about the actual backdoor pattern used during training. This post-training scenario is also of great import because in many practical contexts the DNN user did not train the DNN and does not have access to the training data. We identify two important properties of perceptible backdoor patterns - spatial invariance and robustness - based upon which we propose a novel detector using the maximum achievable misclassification fraction (MAMF) statistic. We detect whether the trained DNN has been backdoor-attacked and infer the source and target classes. Our detector outperforms other existing detectors and, coupled with an imperceptible backdoor detector, helps achieve post-training detection of all evasive backdoors.
△ Less
Submitted 6 April, 2020; v1 submitted 18 November, 2019;
originally announced November 2019.
-
Notes on Margin Training and Margin p-Values for Deep Neural Network Classifiers
Authors:
George Kesidis,
David J. Miller,
Zhen Xiang
Abstract:
We provide a new local class-purity theorem for Lipschitz continuous DNN classifiers. In addition, we discuss how to achieve classification margin for training samples. Finally, we describe how to compute margin p-values for test samples.
We provide a new local class-purity theorem for Lipschitz continuous DNN classifiers. In addition, we discuss how to achieve classification margin for training samples. Finally, we describe how to compute margin p-values for test samples.
△ Less
Submitted 5 December, 2019; v1 submitted 14 October, 2019;
originally announced October 2019.
-
Detection of Backdoors in Trained Classifiers Without Access to the Training Set
Authors:
Zhen Xiang,
David J. Miller,
George Kesidis
Abstract:
Recently, a special type of data poisoning (DP) attack targeting Deep Neural Network (DNN) classifiers, known as a backdoor, was proposed. These attacks do not seek to degrade classification accuracy, but rather to have the classifier learn to classify to a target class whenever the backdoor pattern is present in a test example. Launching backdoor attacks does not require knowledge of the classifi…
▽ More
Recently, a special type of data poisoning (DP) attack targeting Deep Neural Network (DNN) classifiers, known as a backdoor, was proposed. These attacks do not seek to degrade classification accuracy, but rather to have the classifier learn to classify to a target class whenever the backdoor pattern is present in a test example. Launching backdoor attacks does not require knowledge of the classifier or its training process - it only needs the ability to poison the training set with (a sufficient number of) exemplars containing a sufficiently strong backdoor pattern (labeled with the target class). Here we address post-training detection of backdoor attacks in DNN image classifiers, seldom considered in existing works, wherein the defender does not have access to the poisoned training set, but only to the trained classifier itself, as well as to clean examples from the classification domain. This is an important scenario because a trained classifier may be the basis of e.g. a phone app that will be shared with many users. Detecting backdoors post-training may thus reveal a widespread attack. We propose a purely unsupervised anomaly detection (AD) defense against imperceptible backdoor attacks that: i) detects whether the trained DNN has been backdoor-attacked; ii) infers the source and target classes involved in a detected attack; iii) we even demonstrate it is possible to accurately estimate the backdoor pattern. We test our AD approach, in comparison with alternative defenses, for several backdoor patterns, data sets, and attack settings and demonstrate its favorability. Our defense essentially requires setting a single hyperparameter (the detection threshold), which can e.g. be chosen to fix the system's false positive rate.
△ Less
Submitted 19 August, 2020; v1 submitted 27 August, 2019;
originally announced August 2019.
-
Leveraging BERT for Extractive Text Summarization on Lectures
Authors:
Derek Miller
Abstract:
In the last two decades, automatic extractive text summarization on lectures has demonstrated to be a useful tool for collecting key phrases and sentences that best represent the content. However, many current approaches utilize dated approaches, producing sub-par outputs or requiring several hours of manual tuning to produce meaningful results. Recently, new machine learning architectures have pr…
▽ More
In the last two decades, automatic extractive text summarization on lectures has demonstrated to be a useful tool for collecting key phrases and sentences that best represent the content. However, many current approaches utilize dated approaches, producing sub-par outputs or requiring several hours of manual tuning to produce meaningful results. Recently, new machine learning architectures have provided mechanisms for extractive summarization through the clustering of output embeddings from deep learning models. This paper reports on the project called Lecture Summarization Service, a python based RESTful service that utilizes the BERT model for text embeddings and KMeans clustering to identify sentences closes to the centroid for summary selection. The purpose of the service was to provide students a utility that could summarize lecture content, based on their desired number of sentences. On top of the summary work, the service also includes lecture and summary management, storing content on the cloud which can be used for collaboration. While the results of utilizing BERT for extractive summarization were promising, there were still areas where the model struggled, providing feature research opportunities for further improvement.
△ Less
Submitted 7 June, 2019;
originally announced June 2019.
-
Adversarial Learning in Statistical Classification: A Comprehensive Review of Defenses Against Attacks
Authors:
David J. Miller,
Zhen Xiang,
George Kesidis
Abstract:
There is great potential for damage from adversarial learning (AL) attacks on machine-learning based systems. In this paper, we provide a contemporary survey of AL, focused particularly on defenses against attacks on statistical classifiers. After introducing relevant terminology and the goals and range of possible knowledge of both attackers and defenders, we survey recent work on test-time evasi…
▽ More
There is great potential for damage from adversarial learning (AL) attacks on machine-learning based systems. In this paper, we provide a contemporary survey of AL, focused particularly on defenses against attacks on statistical classifiers. After introducing relevant terminology and the goals and range of possible knowledge of both attackers and defenders, we survey recent work on test-time evasion (TTE), data poisoning (DP), and reverse engineering (RE) attacks and particularly defenses against same. In so doing, we distinguish robust classification from anomaly detection (AD), unsupervised from supervised, and statistical hypothesis-based defenses from ones that do not have an explicit null (no attack) hypothesis; we identify the hyperparameters a particular method requires, its computational complexity, as well as the performance measures on which it was evaluated and the obtained quality. We then dig deeper, providing novel insights that challenge conventional AL wisdom and that target unresolved issues, including: 1) robust classification versus AD as a defense strategy; 2) the belief that attack success increases with attack strength, which ignores susceptibility to AD; 3) small perturbations for test-time evasion attacks: a fallacy or a requirement?; 4) validity of the universal assumption that a TTE attacker knows the ground-truth class for the example to be attacked; 5) black, grey, or white box attacks as the standard for defense evaluation; 6) susceptibility of query-based RE to an AD defense. We also discuss attacks on the privacy of training data. We then present benchmark comparisons of several defenses against TTE, RE, and backdoor DP attacks on images. The paper concludes with a discussion of future work.
△ Less
Submitted 2 December, 2019; v1 submitted 12 April, 2019;
originally announced April 2019.
-
Bayesian views of generalized additive modelling
Authors:
David L. Miller
Abstract:
Generalized additive models (GAMs) are a commonly used, flexible framework applied to many problems in statistical ecology. GAMs are often considered to be a purely frequentist framework (`generalized linear models with wiggly bits'), however links between frequentist and Bayesian approaches to these models were highlighted early on in the literature. Bayesian thinking underlies many parts of the…
▽ More
Generalized additive models (GAMs) are a commonly used, flexible framework applied to many problems in statistical ecology. GAMs are often considered to be a purely frequentist framework (`generalized linear models with wiggly bits'), however links between frequentist and Bayesian approaches to these models were highlighted early on in the literature. Bayesian thinking underlies many parts of the implementation in the popular R package \texttt{mgcv} as well as in GAM theory more generally. This article aims to highlight useful links (and differences) between Bayesian and frequentist approaches to smoothing, and their practical applications in ecology (with an \texttt{mgcv}-centric viewpoint). Here I give some background for these results then move onto two important topics for quantitative ecologists: term/model selection and uncertainty estimation.
△ Less
Submitted 6 October, 2021; v1 submitted 4 February, 2019;
originally announced February 2019.
-
When Not to Classify: Detection of Reverse Engineering Attacks on DNN Image Classifiers
Authors:
Yujia Wang,
David J. Miller,
George Kesidis
Abstract:
This paper addresses detection of a reverse engineering (RE) attack targeting a deep neural network (DNN) image classifier; by querying, RE's aim is to discover the classifier's decision rule. RE can enable test-time evasion attacks, which require knowledge of the classifier. Recently, we proposed a quite effective approach (ADA) to detect test-time evasion attacks. In this paper, we extend ADA to…
▽ More
This paper addresses detection of a reverse engineering (RE) attack targeting a deep neural network (DNN) image classifier; by querying, RE's aim is to discover the classifier's decision rule. RE can enable test-time evasion attacks, which require knowledge of the classifier. Recently, we proposed a quite effective approach (ADA) to detect test-time evasion attacks. In this paper, we extend ADA to detect RE attacks (ADA-RE). We demonstrate our method is successful in detecting "stealthy" RE attacks before they learn enough to launch effective test-time evasion attacks.
△ Less
Submitted 31 October, 2018;
originally announced November 2018.
-
A Mixture Model Based Defense for Data Poisoning Attacks Against Naive Bayes Spam Filters
Authors:
David J. Miller,
Xinyi Hu,
Zhen Xiang,
George Kesidis
Abstract:
Naive Bayes spam filters are highly susceptible to data poisoning attacks. Here, known spam sources/blacklisted IPs exploit the fact that their received emails will be treated as (ground truth) labeled spam examples, and used for classifier training (or re-training). The attacking source thus generates emails that will skew the spam model, potentially resulting in great degradation in classifier a…
▽ More
Naive Bayes spam filters are highly susceptible to data poisoning attacks. Here, known spam sources/blacklisted IPs exploit the fact that their received emails will be treated as (ground truth) labeled spam examples, and used for classifier training (or re-training). The attacking source thus generates emails that will skew the spam model, potentially resulting in great degradation in classifier accuracy. Such attacks are successful mainly because of the poor representation power of the naive Bayes (NB) model, with only a single (component) density to represent spam (plus a possible attack). We propose a defense based on the use of a mixture of NB models. We demonstrate that the learned mixture almost completely isolates the attack in a second NB component, with the original spam component essentially unchanged by the attack. Our approach addresses both the scenario where the classifier is being re-trained in light of new data and, significantly, the more challenging scenario where the attack is embedded in the original spam training set. Even for weak attack strengths, BIC-based model order selection chooses a two-component solution, which invokes the mixture-based defense. Promising results are presented on the TREC 2005 spam corpus.
△ Less
Submitted 31 October, 2018;
originally announced November 2018.
-
Backdoor Embedding in Convolutional Neural Network Models via Invisible Perturbation
Authors:
Cong Liao,
Haoti Zhong,
Anna Squicciarini,
Sencun Zhu,
David Miller
Abstract:
Deep learning models have consistently outperformed traditional machine learning models in various classification tasks, including image classification. As such, they have become increasingly prevalent in many real world applications including those where security is of great concern. Such popularity, however, may attract attackers to exploit the vulnerabilities of the deployed deep learning model…
▽ More
Deep learning models have consistently outperformed traditional machine learning models in various classification tasks, including image classification. As such, they have become increasingly prevalent in many real world applications including those where security is of great concern. Such popularity, however, may attract attackers to exploit the vulnerabilities of the deployed deep learning models and launch attacks against security-sensitive applications. In this paper, we focus on a specific type of data poisoning attack, which we refer to as a {\em backdoor injection attack}. The main goal of the adversary performing such attack is to generate and inject a backdoor into a deep learning model that can be triggered to recognize certain embedded patterns with a target label of the attacker's choice. Additionally, a backdoor injection attack should occur in a stealthy manner, without undermining the efficacy of the victim model. Specifically, we propose two approaches for generating a backdoor that is hardly perceptible yet effective in poisoning the model. We consider two attack settings, with backdoor injection carried out either before model training or during model updating. We carry out extensive experimental evaluations under various assumptions on the adversary model, and demonstrate that such attacks can be effective and achieve a high attack success rate (above $90\%$) at a small cost of model accuracy loss (below $1\%$) with a small injection rate (around $1\%$), even under the weakest assumption wherein the adversary has no knowledge either of the original training data or the classifier model.
△ Less
Submitted 30 August, 2018;
originally announced August 2018.
-
Variance propagation for density surface models
Authors:
Mark V Bravington,
David L Miller,
Sharon L Hedley
Abstract:
Spatially-explicit estimates of population density, together with appropriate estimates of uncertainty, are required in many management contexts. Density Surface Models (DSMs) are a two-stage approach for estimating spatially-varying density from distance-sampling data. First, detection probabilities -- perhaps depending on covariates -- are estimated based on details of individual encounters; nex…
▽ More
Spatially-explicit estimates of population density, together with appropriate estimates of uncertainty, are required in many management contexts. Density Surface Models (DSMs) are a two-stage approach for estimating spatially-varying density from distance-sampling data. First, detection probabilities -- perhaps depending on covariates -- are estimated based on details of individual encounters; next, local densities are estimated using a GAM, by fitting local encounter rates to location and/or spatially-varying covariates while allowing for the estimated detectabilities. One criticism of DSMs has been that uncertainty from the two stages is not usually propagated correctly into the final variance estimates. We show how to reformulate a DSM so that the uncertainty in detection probability from the distance sampling stage (regardless of its complexity) is captured as an extra random effect in the GAM stage. In effect, we refit an approximation to the detection function model at the same time as fitting the spatial model. This allows straightforward computation of the overall variance via exactly the same software already needed to fit the GAM. A further extension allows for spatial variation in group size, which can be an important covariate for detectability as well as directly affecting abundance. We illustrate these models using point transect survey data of Island Scrub-Jays on Santa Cruz Island, CA and harbour porpoise from the SCANS-II line transect survey of European waters.
△ Less
Submitted 26 December, 2020; v1 submitted 20 July, 2018;
originally announced July 2018.
-
Low-dose cryo electron ptychography via non-convex Bayesian optimization
Authors:
Philipp Michael Pelz,
Wen Xuan Qiu,
Robert Bücker,
Günther Kassier,
R. J. Dwayne Miller
Abstract:
Electron ptychography has seen a recent surge of interest for phase sensitive imaging at atomic or near-atomic resolution. However, applications are so far mainly limited to radiation-hard samples because the required doses are too high for imaging biological samples at high resolution. We propose the use of non-convex, Bayesian optimization to overcome this problem and reduce the dose required fo…
▽ More
Electron ptychography has seen a recent surge of interest for phase sensitive imaging at atomic or near-atomic resolution. However, applications are so far mainly limited to radiation-hard samples because the required doses are too high for imaging biological samples at high resolution. We propose the use of non-convex, Bayesian optimization to overcome this problem and reduce the dose required for successful reconstruction by two orders of magnitude compared to previous experiments. We suggest to use this method for imaging single biological macromolecules at cryogenic temperatures and demonstrate 2D single-particle reconstructions from simulated data with a resolution of 7.9 Å$\,$ at a dose of 20 $e^- / Å^2$. When averaging over only 15 low-dose datasets, a resolution of 4 Å$\,$ is possible for large macromolecular complexes. With its independence from microscope transfer function, direct recovery of phase contrast and better scaling of signal-to-noise ratio, cryo-electron ptychography may become a promising alternative to Zernike phase-contrast microscopy.
△ Less
Submitted 19 February, 2017;
originally announced February 2017.
-
ATD: Anomalous Topic Discovery in High Dimensional Discrete Data
Authors:
Hossein Soleimani,
David J. Miller
Abstract:
We propose an algorithm for detecting patterns exhibited by anomalous clusters in high dimensional discrete data. Unlike most anomaly detection (AD) methods, which detect individual anomalies, our proposed method detects groups (clusters) of anomalies; i.e. sets of points which collectively exhibit abnormal patterns. In many applications this can lead to better understanding of the nature of the a…
▽ More
We propose an algorithm for detecting patterns exhibited by anomalous clusters in high dimensional discrete data. Unlike most anomaly detection (AD) methods, which detect individual anomalies, our proposed method detects groups (clusters) of anomalies; i.e. sets of points which collectively exhibit abnormal patterns. In many applications this can lead to better understanding of the nature of the atypical behavior and to identifying the sources of the anomalies. Moreover, we consider the case where the atypical patterns exhibit on only a small (salient) subset of the very high dimensional feature space. Individual AD techniques and techniques that detect anomalies using all the features typically fail to detect such anomalies, but our method can detect such instances collectively, discover the shared anomalous patterns exhibited by them, and identify the subsets of salient features. In this paper, we focus on detecting anomalous topics in a batch of text documents, develo** our algorithm based on topic models. Results of our experiments show that our method can accurately detect anomalous topics and salient features (words) under each such topic in a synthetic data set and two real-world text corpora and achieves better performance compared to both standard group AD and individual AD techniques. All required code to reproduce our experiments is available from https://github.com/hsoleimani/ATD
△ Less
Submitted 20 May, 2016; v1 submitted 20 December, 2015;
originally announced December 2015.
-
Asymmetric Independence Model for Detecting Interactions between Variables
Authors:
Guoqiang Yu,
David J. Miller,
Carl D. Langefeld,
David M. Herrington,
Yue Wang
Abstract:
Detecting complex interactions among risk factors in case-control studies is a fundamental task in clinical and population research. However, though hypothesis testing using logistic regression (LR) is a convenient solution, the LR framework is poorly powered and ill-suited under several common circumstances in practice including missing or unmeasured risk factors, imperfectly correlated "surrogat…
▽ More
Detecting complex interactions among risk factors in case-control studies is a fundamental task in clinical and population research. However, though hypothesis testing using logistic regression (LR) is a convenient solution, the LR framework is poorly powered and ill-suited under several common circumstances in practice including missing or unmeasured risk factors, imperfectly correlated "surrogates", and multiple disease sub-types. The weakness of LR in these settings is related to the way in which the null hypothesis is defined. Here we propose the Asymmetric Independence Model (AIM) as a biologically-inspired alternative to LR, based on the key observation that the mechanisms associated with acquiring a "disease" versus maintaining "health" are asymmetric. We prove mathematically that, unlike LR, AIM is a robust model under the abovementioned confounding scenarios. Further, we provide a mathematical definition of a "synergistic" interaction, and prove that theoretically AIM has better power than LR for such interactions. We then experimentally show the superior performance of AIM as compared to LR on both simulations and four real datasets. While the principal application here involves genetic or environmental variables in the life sciences, our methodology is readily applied to other types of measurements and inferences, e.g. in the social sciences.
△ Less
Submitted 10 February, 2015;
originally announced February 2015.
-
Convex Analysis of Mixtures for Separating Non-negative Well-grounded Sources
Authors:
Yitan Zhu,
Niya Wang,
David J. Miller,
Yue Wang
Abstract:
Blind Source Separation (BSS) has proven to be a powerful tool for the analysis of composite patterns in engineering and science. We introduce Convex Analysis of Mixtures (CAM) for separating non-negative well-grounded sources, which learns the mixing matrix by identifying the lateral edges of the convex data scatter plot. We prove a sufficient and necessary condition for identifying the mixing ma…
▽ More
Blind Source Separation (BSS) has proven to be a powerful tool for the analysis of composite patterns in engineering and science. We introduce Convex Analysis of Mixtures (CAM) for separating non-negative well-grounded sources, which learns the mixing matrix by identifying the lateral edges of the convex data scatter plot. We prove a sufficient and necessary condition for identifying the mixing matrix through edge detection, which also serves as the foundation for CAM to be applied not only to the exact-determined and over-determined cases, but also to the under-determined case. We show the optimality of the edge detection strategy, even for cases where source well-groundedness is not strictly satisfied. The CAM algorithm integrates plug-in noise filtering using sector-based clustering, an efficient geometric convex analysis scheme, and stability-based model order selection. We demonstrate the principle of CAM on simulated data and numerically mixed natural images. The superior performance of CAM against a panel of benchmark BSS techniques is demonstrated on numerically mixed gene expression data. We then apply CAM to dissect dynamic contrast-enhanced magnetic resonance imaging data taken from breast tumors and time-course microarray gene expression data derived from in-vivo muscle regeneration in mice, both producing biologically plausible decomposition results.
△ Less
Submitted 10 December, 2015; v1 submitted 27 June, 2014;
originally announced June 2014.
-
IsoDOT Detects Differential RNA-isoform Expression/Usage with respect to a Categorical or Continuous Covariate with High Sensitivity and Specificity
Authors:
Wei Sun,
Yufeng Liu,
James J. Crowley,
Ting-Huei Chen,
Hua Zhou,
Haitao Chu,
Shun** Huang,
Pei-Fen Kuan,
Yuan Li,
Darla Miller,
Ginger Shaw,
Yichao Wu,
Vasyl Zhabotynsky,
Leonard McMillan,
Fei Zou,
Patrick F. Sullivan,
Fernando Pardo-Manuel de Villena
Abstract:
We have developed a statistical method named IsoDOT to assess differential isoform expression (DIE) and differential isoform usage (DIU) using RNA-seq data. Here isoform usage refers to relative isoform expression given the total expression of the corresponding gene. IsoDOT performs two tasks that cannot be accomplished by existing methods: to test DIE/DIU with respect to a continuous covariate, a…
▽ More
We have developed a statistical method named IsoDOT to assess differential isoform expression (DIE) and differential isoform usage (DIU) using RNA-seq data. Here isoform usage refers to relative isoform expression given the total expression of the corresponding gene. IsoDOT performs two tasks that cannot be accomplished by existing methods: to test DIE/DIU with respect to a continuous covariate, and to test DIE/DIU for one case versus one control. The latter task is not an uncommon situation in practice, e.g., comparing paternal and maternal allele of one individual or comparing tumor and normal sample of one cancer patient. Simulation studies demonstrate the high sensitivity and specificity of IsoDOT. We apply IsoDOT to study the effects of haloperidol treatment on mouse transcriptome and identify a group of genes whose isoform usages respond to haloperidol treatment.
△ Less
Submitted 29 October, 2014; v1 submitted 1 February, 2014;
originally announced February 2014.
-
Parsimonious Topic Models with Salient Word Discovery
Authors:
Hossein Soleimani,
David J. Miller
Abstract:
We propose a parsimonious topic model for text corpora. In related models such as Latent Dirichlet Allocation (LDA), all words are modeled topic-specifically, even though many words occur with similar frequencies across different topics. Our modeling determines salient words for each topic, which have topic-specific probabilities, with the rest explained by a universal shared model. Further, in LD…
▽ More
We propose a parsimonious topic model for text corpora. In related models such as Latent Dirichlet Allocation (LDA), all words are modeled topic-specifically, even though many words occur with similar frequencies across different topics. Our modeling determines salient words for each topic, which have topic-specific probabilities, with the rest explained by a universal shared model. Further, in LDA all topics are in principle present in every document. By contrast our model gives sparse topic representation, determining the (small) subset of relevant topics for each document. We derive a Bayesian Information Criterion (BIC), balancing model complexity and goodness of fit. Here, interestingly, we identify an effective sample size and corresponding penalty specific to each parameter type in our model. We minimize BIC to jointly determine our entire model -- the topic-specific words, document-specific topics, all model parameter values, {\it and} the total number of topics -- in a wholly unsupervised fashion. Results on three text corpora and an image dataset show that our model achieves higher test set likelihood and better agreement with ground-truth class labels, compared to LDA and to a model designed to incorporate sparsity.
△ Less
Submitted 11 September, 2014; v1 submitted 22 January, 2014;
originally announced January 2014.