Search | arXiv e-print repository

Extrinsically-Focused Evaluation of Omissions in Medical Summarization

Authors: Elliot Schumacher, Daniel Rosenthal, Varun Nair, Luladay Price, Geoffrey Tso, Anitha Kannan

Abstract: The goal of automated summarization techniques (Paice, 1990; Kupiec et al, 1995) is to condense text by focusing on the most critical information. Generative large language models (LLMs) have shown to be robust summarizers, yet traditional metrics struggle to capture resulting performance (Goyal et al, 2022) in more powerful LLMs. In safety-critical domains such as medicine, more rigorous evaluati… ▽ More The goal of automated summarization techniques (Paice, 1990; Kupiec et al, 1995) is to condense text by focusing on the most critical information. Generative large language models (LLMs) have shown to be robust summarizers, yet traditional metrics struggle to capture resulting performance (Goyal et al, 2022) in more powerful LLMs. In safety-critical domains such as medicine, more rigorous evaluation is required, especially given the potential for LLMs to omit important information in the resulting summary. We propose MED-OMIT, a new omission benchmark for medical summarization. Given a doctor-patient conversation and a generated summary, MED-OMIT categorizes the chat into a set of facts and identifies which are omitted from the summary. We further propose to determine fact importance by simulating the impact of each fact on a downstream clinical task: differential diagnosis (DDx) generation. MED-OMIT leverages LLM prompt-based approaches which categorize the importance of facts and cluster them as supporting or negating evidence to the diagnosis. We evaluate MED-OMIT on a publicly-released dataset of patient-doctor conversations and find that MED-OMIT captures omissions better than alternative metrics. △ Less

Submitted 14 November, 2023; originally announced November 2023.

arXiv:2304.10643 [pdf, other]

Activity Classification Using Unsupervised Domain Transfer from Body Worn Sensors

Authors: Chaitra Hedge, Gezheng Wen, Layne C. Price

Abstract: Activity classification has become a vital feature of wearable health tracking devices. As innovation in this field grows, wearable devices worn on different parts of the body are emerging. To perform activity classification on a new body location, labeled data corresponding to the new locations are generally required, but this is expensive to acquire. In this work, we present an innovative method… ▽ More Activity classification has become a vital feature of wearable health tracking devices. As innovation in this field grows, wearable devices worn on different parts of the body are emerging. To perform activity classification on a new body location, labeled data corresponding to the new locations are generally required, but this is expensive to acquire. In this work, we present an innovative method to leverage an existing activity classifier, trained on Inertial Measurement Unit (IMU) data from a reference body location (the source domain), in order to perform activity classification on a new body location (the target domain) in an unsupervised way, i.e. without the need for classification labels at the new location. Specifically, given an IMU embedding model trained to perform activity classification at the source domain, we train an embedding model to perform activity classification at the target domain by replicating the embeddings at the source domain. This is achieved using simultaneous IMU measurements at the source and target domains. The replicated embeddings at the target domain are used by a classification model that has previously been trained on the source domain to perform activity classification at the target domain. We have evaluated the proposed methods on three activity classification datasets PAMAP2, MHealth, and Opportunity, yielding high F1 scores of 67.19%, 70.40% and 68.34%, respectively when the source domain is the wrist and the target domain is the torso. △ Less

Submitted 20 April, 2023; originally announced April 2023.

arXiv:2211.06428 [pdf, other]

Training self-supervised peptide sequence models on artificially chopped proteins

Authors: Gil Sadeh, Zichen Wang, Jasleen Grewal, Huzefa Rangwala, Layne Price

Abstract: Representation learning for proteins has primarily focused on the global understanding of protein sequences regardless of their length. However, shorter proteins (known as peptides) take on distinct structures and functions compared to their longer counterparts. Unfortunately, there are not as many naturally occurring peptides available to be sequenced and therefore less peptide-specific data to t… ▽ More Representation learning for proteins has primarily focused on the global understanding of protein sequences regardless of their length. However, shorter proteins (known as peptides) take on distinct structures and functions compared to their longer counterparts. Unfortunately, there are not as many naturally occurring peptides available to be sequenced and therefore less peptide-specific data to train with. In this paper, we propose a new peptide data augmentation scheme, where we train peptide language models on artificially constructed peptides that are small contiguous subsets of longer, wild-type proteins; we refer to the training peptides as "chopped proteins". We evaluate the representation potential of models trained with chopped proteins versus natural peptides and find that training language models with chopped proteins results in more generalized embeddings for short protein sequences. These peptide-specific models also retain information about the original protein they were derived from better than language models trained on full-length proteins. We compare masked language model training objectives to three novel peptide-specific training objectives: next-peptide prediction, contrastive peptide selection and evolution-weighted MLM. We demonstrate improved zero-shot learning performance for a deep mutational scan peptides benchmark. △ Less

Submitted 9 November, 2022; originally announced November 2022.

arXiv:2210.00116 [pdf, other]

Predicting Cellular Responses with Variational Causal Inference and Refined Relational Information

Authors: Yulun Wu, Robert A. Barton, Zichen Wang, Vassilis N. Ioannidis, Carlo De Donno, Layne C. Price, Luis F. Voloch, George Karypis

Abstract: Predicting the responses of a cell under perturbations may bring important benefits to drug discovery and personalized therapeutics. In this work, we propose a novel graph variational Bayesian causal inference framework to predict a cell's gene expressions under counterfactual perturbations (perturbations that this cell did not factually receive), leveraging information representing biological kno… ▽ More Predicting the responses of a cell under perturbations may bring important benefits to drug discovery and personalized therapeutics. In this work, we propose a novel graph variational Bayesian causal inference framework to predict a cell's gene expressions under counterfactual perturbations (perturbations that this cell did not factually receive), leveraging information representing biological knowledge in the form of gene regulatory networks (GRNs) to aid individualized cellular response predictions. Aiming at a data-adaptive GRN, we also developed an adjacency matrix updating technique for graph convolutional networks and used it to refine GRNs during pre-training, which generated more insights on gene relations and enhanced model performance. Additionally, we propose a robust estimator within our framework for the asymptotically efficient estimation of marginal perturbation effect, which is yet to be carried out in previous works. With extensive experiments, we exhibited the advantage of our approach over state-of-the-art deep learning models for individual response prediction. △ Less

Submitted 17 April, 2023; v1 submitted 30 September, 2022; originally announced October 2022.

arXiv:2209.05935 [pdf, ps, other]

Variational Causal Inference

Authors: Yulun Wu, Layne C. Price, Zichen Wang, Vassilis N. Ioannidis, Robert A. Barton, George Karypis

Abstract: Estimating an individual's potential outcomes under counterfactual treatments is a challenging task for traditional causal inference and supervised learning approaches when the outcome is high-dimensional (e.g. gene expressions, impulse responses, human faces) and covariates are relatively limited. In this case, to construct one's outcome under a counterfactual treatment, it is crucial to leverage… ▽ More Estimating an individual's potential outcomes under counterfactual treatments is a challenging task for traditional causal inference and supervised learning approaches when the outcome is high-dimensional (e.g. gene expressions, impulse responses, human faces) and covariates are relatively limited. In this case, to construct one's outcome under a counterfactual treatment, it is crucial to leverage individual information contained in its observed factual outcome on top of the covariates. We propose a deep variational Bayesian framework that rigorously integrates two main sources of information for outcome construction under a counterfactual treatment: one source is the individual features embedded in the high-dimensional factual outcome; the other source is the response distribution of similar subjects (subjects with the same covariates) that factually received this treatment of interest. △ Less

Submitted 31 January, 2023; v1 submitted 13 September, 2022; originally announced September 2022.

arXiv:2004.11929 [pdf, ps, other]

Robust posterior inference when statistically emulating forward simulations

Authors: Grigor Aslanyan, Richard Easther, Nathan Musoke, Layne C. Price

Abstract: Scientific analyses often rely on slow, but accurate forward models for observable data conditioned on known model parameters. While various emulation schemes exist to approximate these slow calculations, these approaches are only safe if the approximations are well understood and controlled. This workshop submission reviews and updates a previously published method, which has been used in cosmolo… ▽ More Scientific analyses often rely on slow, but accurate forward models for observable data conditioned on known model parameters. While various emulation schemes exist to approximate these slow calculations, these approaches are only safe if the approximations are well understood and controlled. This workshop submission reviews and updates a previously published method, which has been used in cosmological simulations, to (1) train an emulator while simultaneously estimating posterior probabilities with MCMC and (2) explicitly propagate the emulation error into errors on the posterior probabilities for model parameters. We demonstrate how these techniques can be applied to quickly estimate posterior distributions for parameters of the $Λ$CDM cosmology model, while also gauging the robustness of the emulator approximation. △ Less

Submitted 24 April, 2020; originally announced April 2020.

Comments: code available from https://doi.org/10.5281/zenodo.3764460 or https://github.com/auckland-cosmo/LearnAsYouGoEmulator

arXiv:2001.09100 [pdf, other]

Why Temporal Persistence of Biometric Features is so Valuable for Classification Performance

Authors: Lee Friedman, Hal Stern, Larry R. Price, Oleg V. Komogortsev

Abstract: It is generally accepted that relatively more permanent (i.e., more temporally persistent) traits are more valuable for biometric performance than less permanent traits. Although this finding is intuitive, there is no current work identifying exactly where in the biometric analysis temporal persistence makes a difference. In this paper, we answer this question. In a recent report, we introduced th… ▽ More It is generally accepted that relatively more permanent (i.e., more temporally persistent) traits are more valuable for biometric performance than less permanent traits. Although this finding is intuitive, there is no current work identifying exactly where in the biometric analysis temporal persistence makes a difference. In this paper, we answer this question. In a recent report, we introduced the intraclass correlation coefficient (ICC) as an index of temporal persistence for such features. In that report, we also showed that choosing only the most temporally persistent features yielded superior performance in 12 of 14 datasets. Motivated by those empirical results, we present a novel approach using synthetic features to study which aspects of a biometric identification study are influenced by the temporal persistence of features. What we show is that using more temporally persistent features produces effects on the similarity score distributions that explain why this quality is so key to biometric performance. The results identified with the synthetic data are largely reinforced by an analysis of two datasets, one based on eye-movements and one based on gait. There was one difference between the synthetic and real data: In real data, features are intercorrelated, with the level of intercorrelation increasing with increasing ICC. This increasedhttps://www.overleaf.com/project/5e2b14694c5dc600017292e6 intercorrelation in real data was associated with an increase in the spread of the impostor similarity score distributions. Removing these intercorrelations for real datasets with a decorrelation step produced results which were very similar to that obtained with synthetic features. △ Less

Submitted 24 January, 2020; originally announced January 2020.

Comments: 19 pages, 8 figures, 7 tables, 2 Appendices

arXiv:1911.03295 [pdf, other]

Discovering Invariances in Healthcare Neural Networks

Authors: Mohammad Taha Bahadori, Layne C. Price

Abstract: We study the invariance characteristics of pre-trained predictive models by empirically learning transformations on the input that leave the prediction function approximately unchanged. To learn invariant transformations, we minimize the Wasserstein distance between the predictive distribution conditioned on the data instances and the predictive distribution conditioned on the transformed data ins… ▽ More We study the invariance characteristics of pre-trained predictive models by empirically learning transformations on the input that leave the prediction function approximately unchanged. To learn invariant transformations, we minimize the Wasserstein distance between the predictive distribution conditioned on the data instances and the predictive distribution conditioned on the transformed data instances. To avoid finding degenerate or perturbative transformations, we add a similarity regularization to discourage similarity between the data and its transformed values. We theoretically analyze the correctness of the algorithm and the structure of the solutions. Applying the proposed technique to clinical time series data, we discover variables that commonly-used LSTM models do not rely on for their prediction, especially when the LSTM is trained to be adversarially robust. We also analyze the invariances of BioBERT on clinical notes and discover words that it is invariant to. △ Less

Submitted 3 March, 2020; v1 submitted 8 November, 2019; originally announced November 2019.

Comments: The extended version

arXiv:1811.08803 [pdf, other]

Distinguishing correlation from causation using genome-wide association studies

Authors: Luke J. O'Connor, Alkes L. Price

Abstract: Genome-wide association studies (GWAS) have emerged as a rich source of genetic clues into disease biology, and they have revealed strong genetic correlations among many diseases and traits. Some of these genetic correlations may reflect causal relationships. We developed a method to quantify causal relationships between genetically correlated traits using GWAS summary association statistics. In p… ▽ More Genome-wide association studies (GWAS) have emerged as a rich source of genetic clues into disease biology, and they have revealed strong genetic correlations among many diseases and traits. Some of these genetic correlations may reflect causal relationships. We developed a method to quantify causal relationships between genetically correlated traits using GWAS summary association statistics. In particular, our method quantifies what part of the genetic component of trait 1 is also causal for trait 2 using mixed fourth moments $E(α_1^2α_1α_2)$ and $E(α_2^2α_1α_2)$ of the bivariate effect size distribution. If trait 1 is causal for trait 2, then SNPs affecting trait 1 (large $α_1^2$) will have correlated effects on trait 2 (large $α_1α_2$), but not vice versa. We validated this approach in extensive simulations. Across 52 traits (average $N=331$k), we identified 30 putative genetically causal relationships, many novel, including an effect of LDL cholesterol on decreased bone mineral density. More broadly, we demonstrate that it is possible to distinguish between genetic correlation and causation using genetic association data. △ Less

Submitted 21 November, 2018; originally announced November 2018.

Comments: Machine Learning for Health (ML4H) Workshop at NeurIPS 2018 arXiv:1811.07216

Report number: ML4H/2018/4

Journal ref: O'Connor, Luke J. and Alkes L. Price. "Distinguishing genetic correlation from causation across 52 diseases and complex traits." Nature genetics (2018)

arXiv:1711.02033 [pdf, other]

Estimating Cosmological Parameters from the Dark Matter Distribution

Authors: Siamak Ravanbakhsh, Junier Oliva, Sebastien Fromenteau, Layne C. Price, Shirley Ho, Jeff Schneider, Barnabas Poczos

Abstract: A grand challenge of the 21st century cosmology is to accurately estimate the cosmological parameters of our Universe. A major approach to estimating the cosmological parameters is to use the large-scale matter distribution of the Universe. Galaxy surveys provide the means to map out cosmic large-scale structure in three dimensions. Information about galaxy locations is typically summarized in a "… ▽ More A grand challenge of the 21st century cosmology is to accurately estimate the cosmological parameters of our Universe. A major approach to estimating the cosmological parameters is to use the large-scale matter distribution of the Universe. Galaxy surveys provide the means to map out cosmic large-scale structure in three dimensions. Information about galaxy locations is typically summarized in a "single" function of scale, such as the galaxy correlation function or power-spectrum. We show that it is possible to estimate these cosmological parameters directly from the distribution of matter. This paper presents the application of deep 3D convolutional networks to volumetric representation of dark-matter simulations as well as the results obtained using a recently proposed distribution regression framework, showing that machine learning techniques are comparable to, and can sometimes outperform, maximum-likelihood point estimates using "cosmological models". This opens the way to estimating the parameters of our Universe with higher accuracy. △ Less

Submitted 6 November, 2017; originally announced November 2017.

Comments: ICML 2016

Showing 1–10 of 10 results for author: Price, L