-
Ask Your Distribution Shift if Pre-Training is Right for You
Authors:
Benjamin Cohen-Wang,
Joshua Vendrow,
Aleksander Madry
Abstract:
Pre-training is a widely used approach to develop models that are robust to distribution shifts. However, in practice, its effectiveness varies: fine-tuning a pre-trained model improves robustness significantly in some cases but not at all in others (compared to training from scratch). In this work, we seek to characterize the failure modes that pre-training can and cannot address. In particular,…
▽ More
Pre-training is a widely used approach to develop models that are robust to distribution shifts. However, in practice, its effectiveness varies: fine-tuning a pre-trained model improves robustness significantly in some cases but not at all in others (compared to training from scratch). In this work, we seek to characterize the failure modes that pre-training can and cannot address. In particular, we focus on two possible failure modes of models under distribution shift: poor extrapolation (e.g., they cannot generalize to a different domain) and biases in the training data (e.g., they rely on spurious features). Our study suggests that, as a rule of thumb, pre-training can help mitigate poor extrapolation but not dataset biases. After providing theoretical motivation and empirical evidence for this finding, we explore two of its implications for develo** robust models: (1) pre-training and interventions designed to prevent exploiting biases have complementary robustness benefits, and (2) fine-tuning on a (very) small, non-diverse but de-biased dataset can result in significantly more robust models than fine-tuning on a large and diverse but biased dataset. Code is available at https://github.com/MadryLab/pretraining-distribution-shift-robustness.
△ Less
Submitted 29 February, 2024;
originally announced March 2024.
-
The Journey, Not the Destination: How Data Guides Diffusion Models
Authors:
Kristian Georgiev,
Joshua Vendrow,
Hadi Salman,
Sung Min Park,
Aleksander Madry
Abstract:
Diffusion models trained on large datasets can synthesize photo-realistic images of remarkable quality and diversity. However, attributing these images back to the training data-that is, identifying specific training examples which caused an image to be generated-remains a challenge. In this paper, we propose a framework that: (i) provides a formal notion of data attribution in the context of diff…
▽ More
Diffusion models trained on large datasets can synthesize photo-realistic images of remarkable quality and diversity. However, attributing these images back to the training data-that is, identifying specific training examples which caused an image to be generated-remains a challenge. In this paper, we propose a framework that: (i) provides a formal notion of data attribution in the context of diffusion models, and (ii) allows us to counterfactually validate such attributions. Then, we provide a method for computing these attributions efficiently. Finally, we apply our method to find (and evaluate) such attributions for denoising diffusion probabilistic models trained on CIFAR-10 and latent diffusion models trained on MS COCO. We provide code at https://github.com/MadryLab/journey-TRAK .
△ Less
Submitted 11 December, 2023;
originally announced December 2023.
-
Neural Nonnegative Matrix Factorization for Hierarchical Multilayer Topic Modeling
Authors:
Tyler Will,
Runyu Zhang,
Eli Sadovnik,
Mengdi Gao,
Joshua Vendrow,
Jamie Haddock,
Denali Molitor,
Deanna Needell
Abstract:
We introduce a new method based on nonnegative matrix factorization, Neural NMF, for detecting latent hierarchical structure in data. Datasets with hierarchical structure arise in a wide variety of fields, such as document classification, image processing, and bioinformatics. Neural NMF recursively applies NMF in layers to discover overarching topics encompassing the lower-level features. We deriv…
▽ More
We introduce a new method based on nonnegative matrix factorization, Neural NMF, for detecting latent hierarchical structure in data. Datasets with hierarchical structure arise in a wide variety of fields, such as document classification, image processing, and bioinformatics. Neural NMF recursively applies NMF in layers to discover overarching topics encompassing the lower-level features. We derive a backpropagation optimization scheme that allows us to frame hierarchical NMF as a neural network. We test Neural NMF on a synthetic hierarchical dataset, the 20 Newsgroups dataset, and the MyLymeData symptoms dataset. Numerical results demonstrate that Neural NMF outperforms other hierarchical NMF methods on these data sets and offers better learned hierarchical structure and interpretability of topics.
△ Less
Submitted 28 February, 2023;
originally announced March 2023.
-
Dataset Interfaces: Diagnosing Model Failures Using Controllable Counterfactual Generation
Authors:
Joshua Vendrow,
Saachi Jain,
Logan Engstrom,
Aleksander Madry
Abstract:
Distribution shift is a major source of failure for machine learning models. However, evaluating model reliability under distribution shift can be challenging, especially since it may be difficult to acquire counterfactual examples that exhibit a specified shift. In this work, we introduce the notion of a dataset interface: a framework that, given an input dataset and a user-specified shift, retur…
▽ More
Distribution shift is a major source of failure for machine learning models. However, evaluating model reliability under distribution shift can be challenging, especially since it may be difficult to acquire counterfactual examples that exhibit a specified shift. In this work, we introduce the notion of a dataset interface: a framework that, given an input dataset and a user-specified shift, returns instances from that input distribution that exhibit the desired shift. We study a number of natural implementations for such an interface, and find that they often introduce confounding shifts that complicate model evaluation. Motivated by this, we propose a dataset interface implementation that leverages Textual Inversion to tailor generation to the input distribution. We then demonstrate how applying this dataset interface to the ImageNet dataset enables studying model behavior across a diverse array of distribution shifts, including variations in background, lighting, and attributes of the objects. Code available at https://github.com/MadryLab/dataset-interfaces.
△ Less
Submitted 19 June, 2023; v1 submitted 15 February, 2023;
originally announced February 2023.
-
Automatic Infectious Disease Classification Analysis with Concept Discovery
Authors:
Elena Sizikova,
Joshua Vendrow,
Xu Cao,
Rachel Grotheer,
Jamie Haddock,
Lara Kassab,
Alona Kryshchenko,
Thomas Merkh,
R. W. M. A. Madushani,
Kenny Moise,
Annie Ulichney,
Huy V. Vo,
Chuntian Wang,
Megan Coffee,
Kathryn Leonard,
Deanna Needell
Abstract:
Automatic infectious disease classification from images can facilitate needed medical diagnoses. Such an approach can identify diseases, like tuberculosis, which remain under-diagnosed due to resource constraints and also novel and emerging diseases, like monkeypox, which clinicians have little experience or acumen in diagnosing. Avoiding missed or delayed diagnoses would prevent further transmiss…
▽ More
Automatic infectious disease classification from images can facilitate needed medical diagnoses. Such an approach can identify diseases, like tuberculosis, which remain under-diagnosed due to resource constraints and also novel and emerging diseases, like monkeypox, which clinicians have little experience or acumen in diagnosing. Avoiding missed or delayed diagnoses would prevent further transmission and improve clinical outcomes. In order to understand and trust neural network predictions, analysis of learned representations is necessary. In this work, we argue that automatic discovery of concepts, i.e., human interpretable attributes, allows for a deep understanding of learned information in medical image analysis tasks, generalizing beyond the training labels or protocols. We provide an overview of existing concept discovery approaches in medical image and computer vision communities, and evaluate representative methods on tuberculosis (TB) prediction and monkeypox prediction tasks. Finally, we propose NMFx, a general NMF formulation of interpretability by concept discovery that works in a unified way in unsupervised, weakly supervised, and supervised scenarios.
△ Less
Submitted 14 November, 2022; v1 submitted 28 August, 2022;
originally announced September 2022.
-
A Generalized Hierarchical Nonnegative Tensor Decomposition
Authors:
Joshua Vendrow,
Jamie Haddock,
Deanna Needell
Abstract:
Nonnegative matrix factorization (NMF) has found many applications including topic modeling and document analysis. Hierarchical NMF (HNMF) variants are able to learn topics at various levels of granularity and illustrate their hierarchical relationship. Recently, nonnegative tensor factorization (NTF) methods have been applied in a similar fashion in order to handle data sets with complex, multi-m…
▽ More
Nonnegative matrix factorization (NMF) has found many applications including topic modeling and document analysis. Hierarchical NMF (HNMF) variants are able to learn topics at various levels of granularity and illustrate their hierarchical relationship. Recently, nonnegative tensor factorization (NTF) methods have been applied in a similar fashion in order to handle data sets with complex, multi-modal structure. Hierarchical NTF (HNTF) methods have been proposed, however these methods do not naturally generalize their matrix-based counterparts. Here, we propose a new HNTF model which directly generalizes a HNMF model special case, and provide a supervised extension. We also provide a multiplicative updates training method for this model. Our experimental results show that this model more naturally illuminates the topic hierarchy than previous HNMF and HNTF methods.
△ Less
Submitted 15 February, 2022; v1 submitted 29 September, 2021;
originally announced September 2021.
-
Analysis of Legal Documents via Non-negative Matrix Factorization Methods
Authors:
Ryan Budahazy,
Lu Cheng,
Yihuan Huang,
Andrew Johnson,
Pengyu Li,
Joshua Vendrow,
Zhoutong Wu,
Denali Molitor,
Elizaveta Rebrova,
Deanna Needell
Abstract:
The California Innocence Project (CIP), a clinical law school program aiming to free wrongfully convicted prisoners, evaluates thousands of mails containing new requests for assistance and corresponding case files. Processing and interpreting this large amount of information presents a significant challenge for CIP officials, which can be successfully aided by topic modeling techniques.In this pap…
▽ More
The California Innocence Project (CIP), a clinical law school program aiming to free wrongfully convicted prisoners, evaluates thousands of mails containing new requests for assistance and corresponding case files. Processing and interpreting this large amount of information presents a significant challenge for CIP officials, which can be successfully aided by topic modeling techniques.In this paper, we apply Non-negative Matrix Factorization (NMF) method and implement various offshoots of it to the important and previously unstudied data set compiled by CIP. We identify underlying topics of existing case files and classify request files by crime type and case status (decision type). The results uncover the semantic structure of current case files and can provide CIP officials with a general understanding of newly received case files before further examinations. We also provide an exposition of popular variants of NMF with their experimental results and discuss the benefits and drawbacks of each variant through the real-world application.
△ Less
Submitted 6 November, 2021; v1 submitted 28 April, 2021;
originally announced April 2021.
-
Learning low-rank latent mesoscale structures in networks
Authors:
Hanbaek Lyu,
Yacoub H. Kureh,
Joshua Vendrow,
Mason A. Porter
Abstract:
It is common to use networks to encode the architecture of interactions between entities in complex systems in the physical, biological, social, and information sciences. To study the large-scale behavior of complex systems, it is useful to examine mesoscale structures in networks as building blocks that influence such behavior. We present a new approach for describing low-rank mesoscale structure…
▽ More
It is common to use networks to encode the architecture of interactions between entities in complex systems in the physical, biological, social, and information sciences. To study the large-scale behavior of complex systems, it is useful to examine mesoscale structures in networks as building blocks that influence such behavior. We present a new approach for describing low-rank mesoscale structures in networks, and we illustrate our approach using several synthetic network models and empirical friendship, collaboration, and protein--protein interaction (PPI) networks. We find that these networks possess a relatively small number of `latent motifs' that together can successfully approximate most subgraphs of a network at a fixed mesoscale. We use an algorithm for `network dictionary learning' (NDL), which combines a network-sampling method and nonnegative matrix factorization, to learn the latent motifs of a given network. The ability to encode a network using a set of latent motifs has a wide variety of applications to network-analysis tasks, such as comparison, denoising, and edge inference. Additionally, using a new network denoising and reconstruction (NDR) algorithm, we demonstrate how to denoise a corrupted network by using only the latent motifs that one learns directly from the corrupted network.
△ Less
Submitted 13 July, 2023; v1 submitted 13 February, 2021;
originally announced February 2021.
-
Learning to predict synchronization of coupled oscillators on randomly generated graphs
Authors:
Hardeep Bassi,
Richard Yim,
Rohith Kodukula,
Joshua Vendrow,
Cherlin Zhu,
Hanbaek Lyu
Abstract:
Suppose we are given a system of coupled oscillators on an unknown graph along with the trajectory of the system during some period. Can we predict whether the system will eventually synchronize? Even with a known underlying graph structure, this is an important yet analytically intractable question in general. In this work, we take an alternative approach to the synchronization prediction problem…
▽ More
Suppose we are given a system of coupled oscillators on an unknown graph along with the trajectory of the system during some period. Can we predict whether the system will eventually synchronize? Even with a known underlying graph structure, this is an important yet analytically intractable question in general. In this work, we take an alternative approach to the synchronization prediction problem by viewing it as a classification problem based on the fact that any given system will eventually synchronize or converge to a non-synchronizing limit cycle. By only using some basic statistics of the underlying graphs such as edge density and diameter, our method can achieve perfect accuracy when there is a significant difference in the topology of the underlying graphs between the synchronizing and the non-synchronizing examples. However, in the problem setting where these graph statistics cannot distinguish the two classes very well (e.g., when the graphs are generated from the same random graph model), we find that pairing a few iterations of the initial dynamics along with the graph statistics as the input to our classification algorithms can lead to significant improvement in accuracy; far exceeding what is known by the classical oscillator theory. More surprisingly, we find that in almost all such settings, drop** out the basic graph statistics and training our algorithms with only initial dynamics achieves nearly the same accuracy. We demonstrate our method on three models of continuous and discrete coupled oscillators -- the Kuramoto model, Firefly Cellular Automata, and Greenberg-Hastings model. Finally, we also propose an "ensemble prediction" algorithm that successfully scales our method to large graphs by training on dynamics observed from multiple random subgraphs.
△ Less
Submitted 23 August, 2022; v1 submitted 27 December, 2020;
originally announced December 2020.
-
On a Guided Nonnegative Matrix Factorization
Authors:
Joshua Vendrow,
Jamie Haddock,
Elizaveta Rebrova,
Deanna Needell
Abstract:
Fully unsupervised topic models have found fantastic success in document clustering and classification. However, these models often suffer from the tendency to learn less-than-meaningful or even redundant topics when the data is biased towards a set of features. For this reason, we propose an approach based upon the nonnegative matrix factorization (NMF) model, deemed \textit{Guided NMF}, that inc…
▽ More
Fully unsupervised topic models have found fantastic success in document clustering and classification. However, these models often suffer from the tendency to learn less-than-meaningful or even redundant topics when the data is biased towards a set of features. For this reason, we propose an approach based upon the nonnegative matrix factorization (NMF) model, deemed \textit{Guided NMF}, that incorporates user-designed seed word supervision. Our experimental results demonstrate the promise of this model and illustrate that it is competitive with other methods of this ilk with only very little supervision information.
△ Less
Submitted 5 February, 2021; v1 submitted 21 October, 2020;
originally announced October 2020.
-
Feature Selection on Lyme Disease Patient Survey Data
Authors:
Joshua Vendrow,
Jamie Haddock,
Deanna Needell,
Lorraine Johnson
Abstract:
Lyme disease is a rapidly growing illness that remains poorly understood within the medical community. Critical questions about when and why patients respond to treatment or stay ill, what kinds of treatments are effective, and even how to properly diagnose the disease remain largely unanswered. We investigate these questions by applying machine learning techniques to a large scale Lyme disease pa…
▽ More
Lyme disease is a rapidly growing illness that remains poorly understood within the medical community. Critical questions about when and why patients respond to treatment or stay ill, what kinds of treatments are effective, and even how to properly diagnose the disease remain largely unanswered. We investigate these questions by applying machine learning techniques to a large scale Lyme disease patient registry, MyLymeData, developed by the nonprofit LymeDisease.org. We apply various machine learning methods in order to measure the effect of individual features in predicting participants' answers to the Global Rating of Change (GROC) survey questions that assess the self-reported degree to which their condition improved, worsened, or remained unchanged following antibiotic treatment. We use basic linear regression, support vector machines, neural networks, entropy-based decision tree models, and $k$-nearest neighbors approaches. We first analyze the general performance of the model and then identify the most important features for predicting participant answers to GROC. After we identify the "key" features, we separate them from the dataset and demonstrate the effectiveness of these features at identifying GROC. In doing so, we highlight possible directions for future study both mathematically and clinically.
△ Less
Submitted 24 August, 2020;
originally announced September 2020.