Skip to main content

Showing 1–18 of 18 results for author: McDermott, M B A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.19653  [pdf, other

    cs.LG cs.AI

    ACES: Automatic Cohort Extraction System for Event-Stream Datasets

    Authors: Justin Xu, Jack Gallifant, Alistair E. W. Johnson, Matthew B. A. McDermott

    Abstract: Reproducibility remains a significant challenge in machine learning (ML) for healthcare. In this field, datasets, model pipelines, and even task/cohort definitions are often private, leading to a significant barrier in sharing, iterating, and understanding ML results on electronic health record (EHR) datasets. In this paper, we address a significant part of this problem by introducing the Automati… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Comments: For ACES Online Documentation, see https://eventstreamaces.readthedocs.io/en/latest/

  2. arXiv:2401.06091  [pdf, other

    cs.LG stat.ME

    A Closer Look at AUROC and AUPRC under Class Imbalance

    Authors: Matthew B. A. McDermott, Lasse Hyldig Hansen, Haoran Zhang, Giovanni Angelotti, Jack Gallifant

    Abstract: In machine learning (ML), a widespread adage is that the area under the precision-recall curve (AUPRC) is a superior metric for model comparison to the area under the receiver operating characteristic (AUROC) for binary classification tasks with class imbalance. This paper challenges this notion through novel mathematical analysis, illustrating that AUROC and AUPRC can be concisely related in prob… ▽ More

    Submitted 18 April, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

  3. arXiv:2306.11547  [pdf, other

    cs.LG

    Event Stream GPT: A Data Pre-processing and Modeling Library for Generative, Pre-trained Transformers over Continuous-time Sequences of Complex Events

    Authors: Matthew B. A. McDermott, Bret Nestor, Peniel Argaw, Isaac Kohane

    Abstract: Generative, pre-trained transformers (GPTs, a.k.a. "Foundation Models") have reshaped natural language processing (NLP) through their versatility in diverse downstream tasks. However, their potential extends far beyond NLP. This paper provides a software utility to help realize this potential, extending the applicability of GPTs to continuous-time sequences of complex events with internal dependen… ▽ More

    Submitted 21 June, 2023; v1 submitted 20 June, 2023; originally announced June 2023.

  4. arXiv:2112.00179   

    cs.LG

    A collection of the accepted abstracts for the Machine Learning for Health (ML4H) symposium 2021

    Authors: Fabian Falck, Yuyin Zhou, Emma Rocheteau, Liyue Shen, Luis Oala, Girmaw Abebe, Subhrajit Roy, Stephen Pfohl, Emily Alsentzer, Matthew B. A. McDermott

    Abstract: A collection of the accepted abstracts for the Machine Learning for Health (ML4H) symposium 2021. This index is not complete, as some accepted abstracts chose to opt-out of inclusion.

    Submitted 30 November, 2021; originally announced December 2021.

  5. arXiv:2103.10334  [pdf, other

    cs.LG

    Structure Inducing Pre-Training

    Authors: Matthew B. A. McDermott, Brendan Yap, Peter Szolovits, Marinka Zitnik

    Abstract: Language model pre-training and derived methods are incredibly impactful in machine learning. However, there remains considerable uncertainty on exactly why pre-training helps improve performance for fine-tuning tasks. This is especially true when attempting to adapt language-model pre-training to domains outside of natural language. Here, we analyze this problem by exploring how existing pre-trai… ▽ More

    Submitted 4 August, 2022; v1 submitted 18 March, 2021; originally announced March 2021.

  6. arXiv:2102.00466  [pdf, other

    cs.CL cs.AI

    Adversarial Contrastive Pre-training for Protein Sequences

    Authors: Matthew B. A. McDermott, Brendan Yap, Harry Hsu, Di **, Peter Szolovits

    Abstract: Recent developments in Natural Language Processing (NLP) demonstrate that large-scale, self-supervised pre-training can be extremely beneficial for downstream tasks. These ideas have been adapted to other domains, including the analysis of the amino acid sequences of proteins. However, to date most attempts on protein sequences rely on direct masked language model style pre-training. In this work,… ▽ More

    Submitted 31 January, 2021; originally announced February 2021.

  7. arXiv:2011.11554   

    cs.LG

    ML4H Abstract Track 2020

    Authors: Emily Alsentzer, Matthew B. A. McDermott, Fabian Falck, Suproteem K. Sarkar, Subhrajit Roy, Stephanie L. Hyland

    Abstract: A collection of the accepted abstracts for the Machine Learning for Health (ML4H) workshop at NeurIPS 2020. This index is not complete, as some accepted abstracts chose to opt-out of inclusion.

    Submitted 19 November, 2020; originally announced November 2020.

  8. arXiv:2007.10185  [pdf, other

    cs.LG stat.ML

    A Comprehensive Evaluation of Multi-task Learning and Multi-task Pre-training on EHR Time-series Data

    Authors: Matthew B. A. McDermott, Bret Nestor, Evan Kim, Wancong Zhang, Anna Goldenberg, Peter Szolovits, Marzyeh Ghassemi

    Abstract: Multi-task learning (MTL) is a machine learning technique aiming to improve model performance by leveraging information across many tasks. It has been used extensively on various data modalities, including electronic health record (EHR) data. However, despite significant use on EHR data, there has been little systematic investigation of the utility of MTL across the diverse set of possible tasks a… ▽ More

    Submitted 20 July, 2020; originally announced July 2020.

  9. arXiv:2006.15229  [pdf, other

    cs.LG stat.ML

    CheXpert++: Approximating the CheXpert labeler for Speed,Differentiability, and Probabilistic Output

    Authors: Matthew B. A. McDermott, Tzu Ming Harry Hsu, Wei-Hung Weng, Marzyeh Ghassemi, Peter Szolovits

    Abstract: It is often infeasible or impossible to obtain ground truth labels for medical data. To circumvent this, one may build rule-based or other expert-knowledge driven labelers to ingest data and yield silver labels absent any ground-truth training data. One popular such labeler is CheXpert, a labeler that produces diagnostic labels for chest X-ray radiology reports. CheXpert is very useful, but is rel… ▽ More

    Submitted 26 June, 2020; originally announced June 2020.

    Comments: To appear at MLHC 2020

  10. arXiv:2002.01584   

    cs.LG stat.ML

    ML4H Abstract Track 2019

    Authors: Matthew B. A. McDermott, Emily Alsentzer, Sam Finlayson, Michael Oberst, Fabian Falck, Tristan Naumann, Brett K. Beaulieu-Jones, Adrian V. Dalca

    Abstract: A collection of the accepted abstracts for the Machine Learning for Health (ML4H) workshop at NeurIPS 2019. This index is not complete, as some accepted abstracts chose to opt-out of inclusion.

    Submitted 4 February, 2020; originally announced February 2020.

  11. arXiv:1912.04370  [pdf, other

    eess.AS cs.CL cs.LG cs.SD stat.ML

    Cross-Language Aphasia Detection using Optimal Transport Domain Adaptation

    Authors: Aparna Balagopalan, Jekaterina Novikova, Matthew B. A. McDermott, Bret Nestor, Tristan Naumann, Marzyeh Ghassemi

    Abstract: Multi-language speech datasets are scarce and often have small sample sizes in the medical domain. Robust transfer of linguistic features across languages could improve rates of early diagnosis and therapy for speakers of low-resource languages when detecting health conditions from speech. We utilize out-of-domain, unpaired, single-speaker, healthy speech data for training multiple Optimal Transpo… ▽ More

    Submitted 4 December, 2019; originally announced December 2019.

    Comments: Accepted to ML4H at NeurIPS 2019

  12. arXiv:1911.10241  [pdf, other

    q-bio.QM cs.LG stat.ML

    Cross-modal representation alignment of molecular structure and perturbation-induced transcriptional profiles

    Authors: Samuel G. Finlayson, Matthew B. A. McDermott, Alex V. Pickering, Scott L. Lipnick, Isaac S. Kohane

    Abstract: Modeling the relationship between chemical structure and molecular activity is a key goal in drug development. Many benchmark tasks have been proposed for molecular property prediction, but these tasks are generally aimed at specific, isolated biomedical properties. In this work, we propose a new cross-modal small molecule retrieval task, designed to force a model to learn to associate the structu… ▽ More

    Submitted 1 October, 2020; v1 submitted 22 November, 2019; originally announced November 2019.

    Comments: Accepted for oral presentation at the Pacific Symposium of Biocomputing, 2021

  13. arXiv:1908.00690  [pdf, other

    cs.LG stat.ML

    Feature Robustness in Non-stationary Health Records: Caveats to Deployable Model Performance in Common Clinical Machine Learning Tasks

    Authors: Bret Nestor, Matthew B. A. McDermott, Willie Boag, Gabriela Berner, Tristan Naumann, Michael C. Hughes, Anna Goldenberg, Marzyeh Ghassemi

    Abstract: When training clinical prediction models from electronic health records (EHRs), a key concern should be a model's ability to sustain performance over time when deployed, even as care practices, database systems, and population demographics evolve. Due to de-identification requirements, however, current experimental practices for public EHR benchmarks (such as the MIMIC-III critical care dataset) a… ▽ More

    Submitted 1 August, 2019; originally announced August 2019.

  14. MIMIC-Extract: A Data Extraction, Preprocessing, and Representation Pipeline for MIMIC-III

    Authors: Shirly Wang, Matthew B. A. McDermott, Geeticka Chauhan, Michael C. Hughes, Tristan Naumann, Marzyeh Ghassemi

    Abstract: Robust machine learning relies on access to data that can be used with standardized frameworks in important tasks and the ability to develop models whose performance can be reasonably reproduced. In machine learning for healthcare, the community faces reproducibility challenges due to a lack of publicly accessible data and a lack of standardized data processing frameworks. We present MIMIC-Extract… ▽ More

    Submitted 19 August, 2020; v1 submitted 18 July, 2019; originally announced July 2019.

  15. arXiv:1907.01463  [pdf, other

    cs.LG cs.CY stat.ML

    Reproducibility in Machine Learning for Health

    Authors: Matthew B. A. McDermott, Shirly Wang, Nikki Marinsek, Rajesh Ranganath, Marzyeh Ghassemi, Luca Foschini

    Abstract: Machine learning algorithms designed to characterize, monitor, and intervene on human health (ML4H) are expected to perform safely and reliably when operating at scale, potentially outside strict human supervision. This requirement warrants a stricter attention to issues of reproducibility than other fields of machine learning. In this work, we conduct a systematic evaluation of over 100 recentl… ▽ More

    Submitted 2 July, 2019; originally announced July 2019.

    Comments: Presented at the ICLR 2019 Reproducibility in Machine Learning Workshop

  16. REflex: Flexible Framework for Relation Extraction in Multiple Domains

    Authors: Geeticka Chauhan, Matthew B. A. McDermott, Peter Szolovits

    Abstract: Systematic comparison of methods for relation extraction (RE) is difficult because many experiments in the field are not described precisely enough to be completely reproducible and many papers fail to report ablation studies that would highlight the relative contributions of their various combined techniques. In this work, we build a unifying framework for RE, applying this on three highly used d… ▽ More

    Submitted 20 July, 2019; v1 submitted 19 June, 2019; originally announced June 2019.

    Comments: accepted by BioNLP 2019 at the Association of Computation Linguistics 2019

  17. arXiv:1904.03323  [pdf, other

    cs.CL

    Publicly Available Clinical BERT Embeddings

    Authors: Emily Alsentzer, John R. Murphy, Willie Boag, Wei-Hung Weng, Di **, Tristan Naumann, Matthew B. A. McDermott

    Abstract: Contextual word embedding models such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) have dramatically improved performance for many natural language processing (NLP) tasks in recent months. However, these models have been minimally explored on specialty corpora, such as clinical text; moreover, in the clinical domain, no publicly-available pre-trained BERT models yet exist. In this… ▽ More

    Submitted 20 June, 2019; v1 submitted 5 April, 2019; originally announced April 2019.

    Comments: Clinical Natural Language Processing (ClinicalNLP) Workshop at NAACL 2019

  18. arXiv:1811.12583  [pdf, other

    cs.LG stat.ML

    Rethinking clinical prediction: Why machine learning must consider year of care and feature aggregation

    Authors: Bret Nestor, Matthew B. A. McDermott, Geeticka Chauhan, Tristan Naumann, Michael C. Hughes, Anna Goldenberg, Marzyeh Ghassemi

    Abstract: Machine learning for healthcare often trains models on de-identified datasets with randomly-shifted calendar dates, ignoring the fact that data were generated under hospital operation practices that change over time. These changing practices induce definitive changes in observed data which confound evaluations which do not account for dates and limit the generalisability of date-agnostic models. I… ▽ More

    Submitted 29 November, 2018; originally announced November 2018.

    Comments: Machine Learning for Health (ML4H) Workshop at NeurIPS 2018 arXiv:1811.07216

    Report number: ML4H/2018/189