-
Targeted active learning for probabilistic models
Authors:
Christopher Tosh,
Mauricio Tec,
Wesley Tansey
Abstract:
A fundamental task in science is to design experiments that yield valuable insights about the system under study. Mathematically, these insights can be represented as a utility or risk function that shapes the value of conducting each experiment. We present PDBAL, a targeted active learning method that adaptively designs experiments to maximize scientific utility. PDBAL takes a user-specified risk…
▽ More
A fundamental task in science is to design experiments that yield valuable insights about the system under study. Mathematically, these insights can be represented as a utility or risk function that shapes the value of conducting each experiment. We present PDBAL, a targeted active learning method that adaptively designs experiments to maximize scientific utility. PDBAL takes a user-specified risk function and combines it with a probabilistic model of the experimental outcomes to choose designs that rapidly converge on a high-utility model. We prove theoretical bounds on the label complexity of PDBAL and provide fast closed-form solutions for designing experiments with common exponential family likelihoods. In simulation studies, PDBAL consistently outperforms standard untargeted approaches that focus on maximizing expected information gain over the design space. Finally, we demonstrate the scientific potential of PDBAL through a study on a large cancer drug screen dataset where PDBAL quickly recovers the most efficacious drugs with a small fraction of the total number of experiments.
△ Less
Submitted 21 October, 2022;
originally announced October 2022.
-
DIET: Conditional independence testing with marginal dependence measures of residual information
Authors:
Mukund Sudarshan,
Aahlad Manas Puli,
Wesley Tansey,
Rajesh Ranganath
Abstract:
Conditional randomization tests (CRTs) assess whether a variable $x$ is predictive of another variable $y$, having observed covariates $z$. CRTs require fitting a large number of predictive models, which is often computationally intractable. Existing solutions to reduce the cost of CRTs typically split the dataset into a train and test portion, or rely on heuristics for interactions, both of which…
▽ More
Conditional randomization tests (CRTs) assess whether a variable $x$ is predictive of another variable $y$, having observed covariates $z$. CRTs require fitting a large number of predictive models, which is often computationally intractable. Existing solutions to reduce the cost of CRTs typically split the dataset into a train and test portion, or rely on heuristics for interactions, both of which lead to a loss in power. We propose the decoupled independence test (DIET), an algorithm that avoids both of these issues by leveraging marginal independence statistics to test conditional independence relationships. DIET tests the marginal independence of two random variables: $F(x \mid z)$ and $F(y \mid z)$ where $F(\cdot \mid z)$ is a conditional cumulative distribution function (CDF). These variables are termed "information residuals." We give sufficient conditions for DIET to achieve finite sample type-1 error control and power greater than the type-1 error rate. We then prove that when using the mutual information between the information residuals as a test statistic, DIET yields the most powerful conditionally valid test. Finally, we show DIET achieves higher power than other tractable CRTs on several synthetic and real benchmarks.
△ Less
Submitted 11 April, 2023; v1 submitted 17 August, 2022;
originally announced August 2022.
-
Deep Direct Likelihood Knockoffs
Authors:
Mukund Sudarshan,
Wesley Tansey,
Rajesh Ranganath
Abstract:
Predictive modeling often uses black box machine learning methods, such as deep neural networks, to achieve state-of-the-art performance. In scientific domains, the scientist often wishes to discover which features are actually important for making the predictions. These discoveries may lead to costly follow-up experiments and as such it is important that the error rate on discoveries is not too h…
▽ More
Predictive modeling often uses black box machine learning methods, such as deep neural networks, to achieve state-of-the-art performance. In scientific domains, the scientist often wishes to discover which features are actually important for making the predictions. These discoveries may lead to costly follow-up experiments and as such it is important that the error rate on discoveries is not too high. Model-X knockoffs enable important features to be discovered with control of the FDR. However, knockoffs require rich generative models capable of accurately modeling the knockoff features while ensuring they obey the so-called "swap" property. We develop Deep Direct Likelihood Knockoffs (DDLK), which directly minimizes the KL divergence implied by the knockoff swap property. DDLK consists of two stages: it first maximizes the explicit likelihood of the features, then minimizes the KL divergence between the joint distribution of features and knockoffs and any swap between them. To ensure that the generated knockoffs are valid under any possible swap, DDLK uses the Gumbel-Softmax trick to optimize the knockoff generator under the worst-case swap. We find DDLK has higher power than baselines while controlling the false discovery rate on a variety of synthetic and real benchmarks including a task involving a large dataset from one of the epicenters of COVID-19.
△ Less
Submitted 31 July, 2020;
originally announced July 2020.
-
A Bayesian Model of Dose-Response for Cancer Drug Studies
Authors:
Wesley Tansey,
Christopher Tosh,
David M. Blei
Abstract:
Exploratory cancer drug studies test multiple tumor cell lines against multiple candidate drugs. The goal in each paired (cell line, drug) experiment is to map out the dose-response curve of the cell line as the dose level of the drug increases. We propose Bayesian Tensor Filtering (BTF), a hierarchical Bayesian model for dose-response modeling in multi-sample, multi-treatment cancer drug studies.…
▽ More
Exploratory cancer drug studies test multiple tumor cell lines against multiple candidate drugs. The goal in each paired (cell line, drug) experiment is to map out the dose-response curve of the cell line as the dose level of the drug increases. We propose Bayesian Tensor Filtering (BTF), a hierarchical Bayesian model for dose-response modeling in multi-sample, multi-treatment cancer drug studies. BTF uses low-dimensional embeddings to share statistical strength between similar drugs and similar cell lines. Structured shrinkage priors in BTF encourage smoothness in the dose-response curves while remaining adaptive to sharp jumps when the data call for it. We focus on a pair of cancer drug studies exhibiting a particular pathology in their experimental design, leading us to a non-conjugate monotone mixture-of-Gammas likelihood. To perform posterior inference, we develop a variant of the elliptical slice sampling algorithm for sampling from linearly-constrained multivariate normal priors with non-conjugate likelihoods. In benchmarks, BTF outperforms state-of-the-art methods for covariance regression and dynamic Poisson matrix factorization. On the two cancer drug studies, BTF outperforms the current standard approach in biology and reveals potential new biomarkers of drug sensitivity in cancer. Code is available at https://github.com/tansey/functionalmf.
△ Less
Submitted 22 March, 2021; v1 submitted 10 June, 2019;
originally announced June 2019.
-
Interpreting Black Box Models via Hypothesis Testing
Authors:
Collin Burns,
Jesse Thomason,
Wesley Tansey
Abstract:
In science and medicine, model interpretations may be reported as discoveries of natural phenomena or used to guide patient treatments. In such high-stakes tasks, false discoveries may lead investigators astray. These applications would therefore benefit from control over the finite-sample error rate of interpretations. We reframe black box model interpretability as a multiple hypothesis testing p…
▽ More
In science and medicine, model interpretations may be reported as discoveries of natural phenomena or used to guide patient treatments. In such high-stakes tasks, false discoveries may lead investigators astray. These applications would therefore benefit from control over the finite-sample error rate of interpretations. We reframe black box model interpretability as a multiple hypothesis testing problem. The task is to discover "important" features by testing whether the model prediction is significantly different from what would be expected if the features were replaced with uninformative counterfactuals. We propose two testing methods: one that provably controls the false discovery rate but which is not yet feasible for large-scale applications, and an approximate testing method which can be applied to real-world data sets. In simulation, both tests have high power relative to existing interpretability methods. When applied to state-of-the-art vision and language models, the framework selects features that intuitively explain model predictions. The resulting explanations have the additional advantage that they are themselves easy to interpret.
△ Less
Submitted 17 August, 2020; v1 submitted 29 March, 2019;
originally announced April 2019.
-
Black Box FDR
Authors:
Wesley Tansey,
Yixin Wang,
David M. Blei,
Raul Rabadan
Abstract:
Analyzing large-scale, multi-experiment studies requires scientists to test each experimental outcome for statistical significance and then assess the results as a whole. We present Black Box FDR (BB-FDR), an empirical-Bayes method for analyzing multi-experiment studies when many covariates are gathered per experiment. BB-FDR learns a series of black box predictive models to boost power and contro…
▽ More
Analyzing large-scale, multi-experiment studies requires scientists to test each experimental outcome for statistical significance and then assess the results as a whole. We present Black Box FDR (BB-FDR), an empirical-Bayes method for analyzing multi-experiment studies when many covariates are gathered per experiment. BB-FDR learns a series of black box predictive models to boost power and control the false discovery rate (FDR) at two stages of study analysis. In Stage 1, it uses a deep neural network prior to report which experiments yielded significant outcomes. In Stage 2, a separate black box model of each covariate is used to select features that have significant predictive power across all experiments. In benchmarks, BB-FDR outperforms competing state-of-the-art methods in both stages of analysis. We apply BB-FDR to two real studies on cancer drug efficacy. For both studies, BB-FDR increases the proportion of significant outcomes discovered and selects variables that reveal key genomic drivers of drug sensitivity and resistance in cancer.
△ Less
Submitted 8 June, 2018;
originally announced June 2018.
-
Diet2Vec: Multi-scale analysis of massive dietary data
Authors:
Wesley Tansey,
Edward W. Lowe Jr.,
James G. Scott
Abstract:
Smart phone apps that enable users to easily track their diets have become widespread in the last decade. This has created an opportunity to discover new insights into obesity and weight loss by analyzing the eating habits of the users of such apps. In this paper, we present diet2vec: an approach to modeling latent structure in a massive database of electronic diet journals. Through an iterative c…
▽ More
Smart phone apps that enable users to easily track their diets have become widespread in the last decade. This has created an opportunity to discover new insights into obesity and weight loss by analyzing the eating habits of the users of such apps. In this paper, we present diet2vec: an approach to modeling latent structure in a massive database of electronic diet journals. Through an iterative contract-and-expand process, our model learns real-valued embeddings of users' diets, as well as embeddings for individual foods and meals. We demonstrate the effectiveness of our approach on a real dataset of 55K users of the popular diet-tracking app LoseIt\footnote{http://www.loseit.com/}. To the best of our knowledge, this is the largest fine-grained diet tracking study in the history of nutrition and obesity research. Our results suggest that diet2vec finds interpretable results at all levels, discovering intuitive representations of foods, meals, and diets.
△ Less
Submitted 1 December, 2016;
originally announced December 2016.