-
Recent Advances, Applications, and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2023 Symposium
Authors:
Hyewon Jeong,
Sarah Jabbour,
Yuzhe Yang,
Rahul Thapta,
Hussein Mozannar,
William Jongwon Han,
Nikita Mehandru,
Michael Wornow,
Vladislav Lialin,
Xin Liu,
Alejandro Lozano,
Jiacheng Zhu,
Rafal Dariusz Kocielnik,
Keith Harrigian,
Haoran Zhang,
Edward Lee,
Milos Vukadinovic,
Aparna Balagopalan,
Vincent Jeanselme,
Katherine Matton,
Ilker Demirel,
Jason Fries,
Parisa Rashidi,
Brett Beaulieu-Jones,
Xuhai Orson Xu
, et al. (18 additional authors not shown)
Abstract:
The third ML4H symposium was held in person on December 10, 2023, in New Orleans, Louisiana, USA. The symposium included research roundtable sessions to foster discussions between participants and senior researchers on timely and relevant topics for the \ac{ML4H} community. Encouraged by the successful virtual roundtables in the previous year, we organized eleven in-person roundtables and four vir…
▽ More
The third ML4H symposium was held in person on December 10, 2023, in New Orleans, Louisiana, USA. The symposium included research roundtable sessions to foster discussions between participants and senior researchers on timely and relevant topics for the \ac{ML4H} community. Encouraged by the successful virtual roundtables in the previous year, we organized eleven in-person roundtables and four virtual roundtables at ML4H 2022. The organization of the research roundtables at the conference involved 17 Senior Chairs and 19 Junior Chairs across 11 tables. Each roundtable session included invited senior chairs (with substantial experience in the field), junior chairs (responsible for facilitating the discussion), and attendees from diverse backgrounds with interest in the session's topic. Herein we detail the organization process and compile takeaways from these roundtable discussions, including recent advances, applications, and open challenges for each topic. We conclude with a summary and lessons learned across all roundtables. This document serves as a comprehensive review paper, summarizing the recent advancements in machine learning for healthcare as contributed by foremost researchers in the field.
△ Less
Submitted 5 April, 2024; v1 submitted 3 March, 2024;
originally announced March 2024.
-
Event-Based Contrastive Learning for Medical Time Series
Authors:
Hyewon Jeong,
Nassim Oufattole,
Matthew Mcdermott,
Aparna Balagopalan,
Bryan Jangeesingh,
Marzyeh Ghassemi,
Collin Stultz
Abstract:
In clinical practice, one often needs to identify whether a patient is at high risk of adverse outcomes after some key medical event. For example, quantifying the risk of adverse outcomes after an acute cardiovascular event helps healthcare providers identify those patients at the highest risk of poor outcomes; i.e., patients who benefit from invasive therapies that can lower their risk. Assessing…
▽ More
In clinical practice, one often needs to identify whether a patient is at high risk of adverse outcomes after some key medical event. For example, quantifying the risk of adverse outcomes after an acute cardiovascular event helps healthcare providers identify those patients at the highest risk of poor outcomes; i.e., patients who benefit from invasive therapies that can lower their risk. Assessing the risk of adverse outcomes, however, is challenging due to the complexity, variability, and heterogeneity of longitudinal medical data, especially for individuals suffering from chronic diseases like heart failure. In this paper, we introduce Event-Based Contrastive Learning (EBCL) - a method for learning embeddings of heterogeneous patient data that preserves temporal information before and after key index events. We demonstrate that EBCL can be used to construct models that yield improved performance on important downstream tasks relative to other pretraining methods. We develop and test the method using a cohort of heart failure patients obtained from a large hospital network and the publicly available MIMIC-IV dataset consisting of patients in an intensive care unit at a large tertiary care center. On both cohorts, EBCL pretraining yields models that are performant with respect to a number of downstream tasks, including mortality, hospital readmission, and length of stay. In addition, unsupervised EBCL embeddings effectively cluster heart failure patients into subgroups with distinct outcomes, thereby providing information that helps identify new heart failure phenotypes. The contrastive framework around the index event can be adapted to a wide array of time-series datasets and provides information that can be used to guide personalized care.
△ Less
Submitted 19 April, 2024; v1 submitted 15 December, 2023;
originally announced December 2023.
-
The Role of Relevance in Fair Ranking
Authors:
Aparna Balagopalan,
Abigail Z. Jacobs,
Asia Biega
Abstract:
Online platforms mediate access to opportunity: relevance-based rankings create and constrain options by allocating exposure to job openings and job candidates in hiring platforms, or sellers in a marketplace. In order to do so responsibly, these socially consequential systems employ various fairness measures and interventions, many of which seek to allocate exposure based on worthiness. Because t…
▽ More
Online platforms mediate access to opportunity: relevance-based rankings create and constrain options by allocating exposure to job openings and job candidates in hiring platforms, or sellers in a marketplace. In order to do so responsibly, these socially consequential systems employ various fairness measures and interventions, many of which seek to allocate exposure based on worthiness. Because these constructs are typically not directly observable, platforms must instead resort to using proxy scores such as relevance and infer them from behavioral signals such as searcher clicks. Yet, it remains an open question whether relevance fulfills its role as such a worthiness score in high-stakes fair rankings. In this paper, we combine perspectives and tools from the social sciences, information retrieval, and fairness in machine learning to derive a set of desired criteria that relevance scores should satisfy in order to meaningfully guide fairness interventions. We then empirically show that not all of these criteria are met in a case study of relevance inferred from biased user click data. We assess the impact of these violations on the estimated system fairness and analyze whether existing fairness interventions may mitigate the identified issues. Our analyses and results surface the pressing need for new approaches to relevance collection and generation that are suitable for use in fair ranking.
△ Less
Submitted 6 June, 2023; v1 submitted 9 May, 2023;
originally announced May 2023.
-
The Road to Explainability is Paved with Bias: Measuring the Fairness of Explanations
Authors:
Aparna Balagopalan,
Haoran Zhang,
Kimia Hamidieh,
Thomas Hartvigsen,
Frank Rudzicz,
Marzyeh Ghassemi
Abstract:
Machine learning models in safety-critical settings like healthcare are often blackboxes: they contain a large number of parameters which are not transparent to users. Post-hoc explainability methods where a simple, human-interpretable model imitates the behavior of these blackbox models are often proposed to help users trust model predictions. In this work, we audit the quality of such explanatio…
▽ More
Machine learning models in safety-critical settings like healthcare are often blackboxes: they contain a large number of parameters which are not transparent to users. Post-hoc explainability methods where a simple, human-interpretable model imitates the behavior of these blackbox models are often proposed to help users trust model predictions. In this work, we audit the quality of such explanations for different protected subgroups using real data from four settings in finance, healthcare, college admissions, and the US justice system. Across two different blackbox model architectures and four popular explainability methods, we find that the approximation quality of explanation models, also known as the fidelity, differs significantly between subgroups. We also demonstrate that pairing explainability methods with recent advances in robust machine learning can improve explanation fairness in some settings. However, we highlight the importance of communicating details of non-zero fidelity gaps to users, since a single solution might not exist across all settings. Finally, we discuss the implications of unfair explanation models as a challenging and understudied problem facing the machine learning community.
△ Less
Submitted 2 June, 2022; v1 submitted 6 May, 2022;
originally announced May 2022.
-
Quantifying the Task-Specific Information in Text-Based Classifications
Authors:
Zining Zhu,
Aparna Balagopalan,
Marzyeh Ghassemi,
Frank Rudzicz
Abstract:
Recently, neural natural language models have attained state-of-the-art performance on a wide variety of tasks, but the high performance can result from superficial, surface-level cues (Bender and Koller, 2020; Niven and Kao, 2020). These surface cues, as the ``shortcuts'' inherent in the datasets, do not contribute to the *task-specific information* (TSI) of the classification tasks. While it is…
▽ More
Recently, neural natural language models have attained state-of-the-art performance on a wide variety of tasks, but the high performance can result from superficial, surface-level cues (Bender and Koller, 2020; Niven and Kao, 2020). These surface cues, as the ``shortcuts'' inherent in the datasets, do not contribute to the *task-specific information* (TSI) of the classification tasks. While it is essential to look at the model performance, it is also important to understand the datasets. In this paper, we consider this question: Apart from the information introduced by the shortcut features, how much task-specific information is required to classify a dataset? We formulate this quantity in an information-theoretic framework. While this quantity is hard to compute, we approximate it with a fast and stable method. TSI quantifies the amount of linguistic knowledge modulo a set of predefined shortcuts -- that contributes to classifying a sample from each dataset. This framework allows us to compare across datasets, saying that, apart from a set of ``shortcut features'', classifying each sample in the Multi-NLI task involves around 0.4 nats more TSI than in the Quora Question Pair.
△ Less
Submitted 17 October, 2021;
originally announced October 2021.
-
Comparing Acoustic-based Approaches for Alzheimer's Disease Detection
Authors:
Aparna Balagopalan,
Jekaterina Novikova
Abstract:
Robust strategies for Alzheimer's disease (AD) detection are important, given the high prevalence of AD. In this paper, we study the performance and generalizability of three approaches for AD detection from speech on the recent ADReSSo challenge dataset: 1) using conventional acoustic features 2) using novel pre-trained acoustic embeddings 3) combining acoustic features and embeddings. We find th…
▽ More
Robust strategies for Alzheimer's disease (AD) detection are important, given the high prevalence of AD. In this paper, we study the performance and generalizability of three approaches for AD detection from speech on the recent ADReSSo challenge dataset: 1) using conventional acoustic features 2) using novel pre-trained acoustic embeddings 3) combining acoustic features and embeddings. We find that while feature-based approaches have a higher precision, classification approaches relying on pre-trained embeddings prove to have a higher, and more balanced cross-validated performance across multiple metrics of performance. Further, embedding-only approaches are more generalizable. Our best model outperforms the acoustic baseline in the challenge by 2.8%.
△ Less
Submitted 15 September, 2022; v1 submitted 2 June, 2021;
originally announced June 2021.
-
Augmenting BERT Carefully with Underrepresented Linguistic Features
Authors:
Aparna Balagopalan,
Jekaterina Novikova
Abstract:
Fine-tuned Bidirectional Encoder Representations from Transformers (BERT)-based sequence classification models have proven to be effective for detecting Alzheimer's Disease (AD) from transcripts of human speech. However, previous research shows it is possible to improve BERT's performance on various tasks by augmenting the model with additional information. In this work, we use probing tasks as in…
▽ More
Fine-tuned Bidirectional Encoder Representations from Transformers (BERT)-based sequence classification models have proven to be effective for detecting Alzheimer's Disease (AD) from transcripts of human speech. However, previous research shows it is possible to improve BERT's performance on various tasks by augmenting the model with additional information. In this work, we use probing tasks as introspection techniques to identify linguistic information not well-represented in various layers of BERT, but important for the AD detection task. We supplement these linguistic features in which representations from BERT are found to be insufficient with hand-crafted features externally, and show that jointly fine-tuning BERT in combination with these features improves the performance of AD classification by upto 5\% over fine-tuned BERT alone.
△ Less
Submitted 11 November, 2020;
originally announced November 2020.
-
Fantastic Features and Where to Find Them: Detecting Cognitive Impairment with a Subsequence Classification Guided Approach
Authors:
Benjamin Eyre,
Aparna Balagopalan,
Jekaterina Novikova
Abstract:
Despite the widely reported success of embedding-based machine learning methods on natural language processing tasks, the use of more easily interpreted engineered features remains common in fields such as cognitive impairment (CI) detection. Manually engineering features from noisy text is time and resource consuming, and can potentially result in features that do not enhance model performance. T…
▽ More
Despite the widely reported success of embedding-based machine learning methods on natural language processing tasks, the use of more easily interpreted engineered features remains common in fields such as cognitive impairment (CI) detection. Manually engineering features from noisy text is time and resource consuming, and can potentially result in features that do not enhance model performance. To combat this, we describe a new approach to feature engineering that leverages sequential machine learning models and domain knowledge to predict which features help enhance performance. We provide a concrete example of this method on a standard data set of CI speech and demonstrate that CI classification accuracy improves by 2.3% over a strong baseline when using features produced by this method. This demonstration provides an ex-ample of how this method can be used to assist classification in fields where interpretability is important, such as health care.
△ Less
Submitted 13 October, 2020;
originally announced October 2020.
-
To BERT or Not To BERT: Comparing Speech and Language-based Approaches for Alzheimer's Disease Detection
Authors:
Aparna Balagopalan,
Benjamin Eyre,
Frank Rudzicz,
Jekaterina Novikova
Abstract:
Research related to automatically detecting Alzheimer's disease (AD) is important, given the high prevalence of AD and the high cost of traditional methods. Since AD significantly affects the content and acoustics of spontaneous speech, natural language processing and machine learning provide promising techniques for reliably detecting AD. We compare and contrast the performance of two such approa…
▽ More
Research related to automatically detecting Alzheimer's disease (AD) is important, given the high prevalence of AD and the high cost of traditional methods. Since AD significantly affects the content and acoustics of spontaneous speech, natural language processing and machine learning provide promising techniques for reliably detecting AD. We compare and contrast the performance of two such approaches for AD detection on the recent ADReSS challenge dataset: 1) using domain knowledge-based hand-crafted features that capture linguistic and acoustic phenomena, and 2) fine-tuning Bidirectional Encoder Representations from Transformer (BERT)-based sequence classification models. We also compare multiple feature-based regression models for a neuropsychological score task in the challenge. We observe that fine-tuned BERT models, given the relative importance of linguistics in cognitive impairment detection, outperform feature-based approaches on the AD detection task.
△ Less
Submitted 26 July, 2020;
originally announced August 2020.
-
Cross-Language Aphasia Detection using Optimal Transport Domain Adaptation
Authors:
Aparna Balagopalan,
Jekaterina Novikova,
Matthew B. A. McDermott,
Bret Nestor,
Tristan Naumann,
Marzyeh Ghassemi
Abstract:
Multi-language speech datasets are scarce and often have small sample sizes in the medical domain. Robust transfer of linguistic features across languages could improve rates of early diagnosis and therapy for speakers of low-resource languages when detecting health conditions from speech. We utilize out-of-domain, unpaired, single-speaker, healthy speech data for training multiple Optimal Transpo…
▽ More
Multi-language speech datasets are scarce and often have small sample sizes in the medical domain. Robust transfer of linguistic features across languages could improve rates of early diagnosis and therapy for speakers of low-resource languages when detecting health conditions from speech. We utilize out-of-domain, unpaired, single-speaker, healthy speech data for training multiple Optimal Transport (OT) domain adaptation systems. We learn map**s from other languages to English and detect aphasia from linguistic characteristics of speech, and show that OT domain adaptation improves aphasia detection over unilingual baselines for French (6% increased F1) and Mandarin (5% increased F1). Further, we show that adding aphasic data to the domain adaptation system significantly increases performance for both French and Mandarin, increasing the F1 scores further (10% and 8% increase in F1 scores for French and Mandarin, respectively, over unilingual baselines).
△ Less
Submitted 4 December, 2019;
originally announced December 2019.
-
Lexical Features Are More Vulnerable, Syntactic Features Have More Predictive Power
Authors:
Jekaterina Novikova,
Aparna Balagopalan,
Ksenia Shkaruta,
Frank Rudzicz
Abstract:
Understanding the vulnerability of linguistic features extracted from noisy text is important for both develo** better health text classification models and for interpreting vulnerabilities of natural language models. In this paper, we investigate how generic language characteristics, such as syntax or the lexicon, are impacted by artificial text alterations. The vulnerability of features is ana…
▽ More
Understanding the vulnerability of linguistic features extracted from noisy text is important for both develo** better health text classification models and for interpreting vulnerabilities of natural language models. In this paper, we investigate how generic language characteristics, such as syntax or the lexicon, are impacted by artificial text alterations. The vulnerability of features is analysed from two perspectives: (1) the level of feature value change, and (2) the level of change of feature predictive power as a result of text modifications. We show that lexical features are more sensitive to text modifications than syntactic ones. However, we also demonstrate that these smaller changes of syntactic features have a stronger influence on classification performance downstream, compared to the impact of changes to lexical features. Results are validated across three datasets representing different text-classification tasks, with different levels of lexical and syntactic complexity of both conversational and written language.
△ Less
Submitted 30 September, 2019;
originally announced October 2019.
-
Impact of ASR on Alzheimer's Disease Detection: All Errors are Equal, but Deletions are More Equal than Others
Authors:
Aparna Balagopalan,
Ksenia Shkaruta,
Jekaterina Novikova
Abstract:
Automatic Speech Recognition (ASR) is a critical component of any fully-automated speech-based dementia detection model. However, despite years of speech recognition research, little is known about the impact of ASR accuracy on dementia detection. In this paper, we experiment with controlled amounts of artificially generated ASR errors and investigate their influence on dementia detection. We find…
▽ More
Automatic Speech Recognition (ASR) is a critical component of any fully-automated speech-based dementia detection model. However, despite years of speech recognition research, little is known about the impact of ASR accuracy on dementia detection. In this paper, we experiment with controlled amounts of artificially generated ASR errors and investigate their influence on dementia detection. We find that deletion errors affect detection performance the most, due to their impact on the features of syntactic complexity and discourse representation in speech. We show the trend to be generalisable across two different datasets for cognitive impairment detection. As a conclusion, we propose optimising the ASR to reflect a higher penalty for deletion errors in order to improve dementia detection performance.
△ Less
Submitted 13 October, 2020; v1 submitted 2 April, 2019;
originally announced April 2019.
-
The Effect of Heterogeneous Data for Alzheimer's Disease Detection from Speech
Authors:
Aparna Balagopalan,
Jekaterina Novikova,
Frank Rudzicz,
Marzyeh Ghassemi
Abstract:
Speech datasets for identifying Alzheimer's disease (AD) are generally restricted to participants performing a single task, e.g. describing an image shown to them. As a result, models trained on linguistic features derived from such datasets may not be generalizable across tasks. Building on prior work demonstrating that same-task data of healthy participants helps improve AD detection on a single…
▽ More
Speech datasets for identifying Alzheimer's disease (AD) are generally restricted to participants performing a single task, e.g. describing an image shown to them. As a result, models trained on linguistic features derived from such datasets may not be generalizable across tasks. Building on prior work demonstrating that same-task data of healthy participants helps improve AD detection on a single-task dataset of pathological speech, we augment an AD-specific dataset consisting of subjects describing a picture with multi-task healthy data. We demonstrate that normative data from multiple speech-based tasks helps improve AD detection by up to 9%. Visualization of decision boundaries reveals that models trained on a combination of structured picture descriptions and unstructured conversational speech have the least out-of-task error and show the most potential to generalize to multiple tasks. We analyze the impact of age of the added samples and if they affect fairness in classification. We also provide explanations for a possible inductive bias effect across tasks using model-agnostic feature anchors. This work highlights the need for heterogeneous datasets for encoding changes in multiple facets of cognition and for develo** a task-independent AD detection model.
△ Less
Submitted 29 November, 2018;
originally announced November 2018.
-
ReGAN: RE[LAX|BAR|INFORCE] based Sequence Generation using GANs
Authors:
Aparna Balagopalan,
Satya Gorti,
Mathieu Ravaut,
Raeid Saqur
Abstract:
Generative Adversarial Networks (GANs) have seen steep ascension to the peak of ML research zeitgeist in recent years. Mostly catalyzed by its success in the domain of image generation, the technique has seen wide range of adoption in a variety of other problem domains. Although GANs have had a lot of success in producing more realistic images than other approaches, they have only seen limited use…
▽ More
Generative Adversarial Networks (GANs) have seen steep ascension to the peak of ML research zeitgeist in recent years. Mostly catalyzed by its success in the domain of image generation, the technique has seen wide range of adoption in a variety of other problem domains. Although GANs have had a lot of success in producing more realistic images than other approaches, they have only seen limited use for text sequences. Generation of longer sequences compounds this problem. Most recently, SeqGAN (Yu et al., 2017) has shown improvements in adversarial evaluation and results with human evaluation compared to a MLE based trained baseline. The main contributions of this paper are three-fold: 1. We show results for sequence generation using a GAN architecture with efficient policy gradient estimators, 2. We attain improved training stability, and 3. We perform a comparative study of recent unbiased low variance gradient estimation techniques such as REBAR (Tucker et al., 2017), RELAX (Grathwohl et al., 2018) and REINFORCE (Williams, 1992). Using a simple grammar on synthetic datasets with varying length, we indicate the quality of sequences generated by the model.
△ Less
Submitted 7 May, 2018;
originally announced May 2018.