-
What is Your Metric Telling You? Evaluating Classifier Calibration under Context-Specific Definitions of Reliability
Authors:
John Kirchenbauer,
Jacob Oaks,
Eric Heim
Abstract:
Classifier calibration has received recent attention from the machine learning community due both to its practical utility in facilitating decision making, as well as the observation that modern neural network classifiers are poorly calibrated. Much of this focus has been towards the goal of learning classifiers such that their output with largest magnitude (the "predicted class") is calibrated. H…
▽ More
Classifier calibration has received recent attention from the machine learning community due both to its practical utility in facilitating decision making, as well as the observation that modern neural network classifiers are poorly calibrated. Much of this focus has been towards the goal of learning classifiers such that their output with largest magnitude (the "predicted class") is calibrated. However, this narrow interpretation of classifier outputs does not adequately capture the variety of practical use cases in which classifiers can aid in decision making. In this work, we argue that more expressive metrics must be developed that accurately measure calibration error for the specific context in which a classifier will be deployed. To this end, we derive a number of different metrics using a generalization of Expected Calibration Error (ECE) that measure calibration error under different definitions of reliability. We then provide an extensive empirical evaluation of commonly used neural network architectures and calibration techniques with respect to these metrics. We find that: 1) definitions of ECE that focus solely on the predicted class fail to accurately measure calibration error under a selection of practically useful definitions of reliability and 2) many common calibration techniques fail to improve calibration performance uniformly across ECE metrics derived from these diverse definitions of reliability.
△ Less
Submitted 23 May, 2022;
originally announced May 2022.
-
Measuring AI Systems Beyond Accuracy
Authors:
Violet Turri,
Rachel Dzombak,
Eric Heim,
Nathan VanHoudnos,
Jay Palat,
Anusha Sinha
Abstract:
Current test and evaluation (T&E) methods for assessing machine learning (ML) system performance often rely on incomplete metrics. Testing is additionally often siloed from the other phases of the ML system lifecycle. Research investigating cross-domain approaches to ML T&E is needed to drive the state of the art forward and to build an Artificial Intelligence (AI) engineering discipline. This pap…
▽ More
Current test and evaluation (T&E) methods for assessing machine learning (ML) system performance often rely on incomplete metrics. Testing is additionally often siloed from the other phases of the ML system lifecycle. Research investigating cross-domain approaches to ML T&E is needed to drive the state of the art forward and to build an Artificial Intelligence (AI) engineering discipline. This paper advocates for a robust, integrated approach to testing by outlining six key questions for guiding a holistic T&E strategy.
△ Less
Submitted 7 April, 2022;
originally announced April 2022.
-
Proceedings of NeurIPS 2020 Workshop on Artificial Intelligence for Humanitarian Assistance and Disaster Response
Authors:
Ritwik Gupta,
Eric T. Heim,
Edoardo Nemni
Abstract:
These are the "proceedings" of the 2nd AI + HADR workshop which was held virtually on December 12, 2020 as part of the Neural Information Processing Systems conference. These are non-archival and merely serve as a way to collate all the papers accepted to the workshop.
These are the "proceedings" of the 2nd AI + HADR workshop which was held virtually on December 12, 2020 as part of the Neural Information Processing Systems conference. These are non-archival and merely serve as a way to collate all the papers accepted to the workshop.
△ Less
Submitted 7 December, 2020; v1 submitted 3 December, 2020;
originally announced December 2020.
-
Proceedings of NeurIPS 2019 Workshop on Artificial Intelligence for Humanitarian Assistance and Disaster Response
Authors:
Ritwik Gupta,
Eric T. Heim
Abstract:
These are the "proceedings" of the 1st AI + HADR workshop which was held in Vancouver, Canada on December 13, 2019 as part of the Neural Information Processing Systems conference. These are non-archival and serve solely as a collation of all the papers accepted to the workshop.
These are the "proceedings" of the 1st AI + HADR workshop which was held in Vancouver, Canada on December 13, 2019 as part of the Neural Information Processing Systems conference. These are non-archival and serve solely as a collation of all the papers accepted to the workshop.
△ Less
Submitted 3 December, 2020; v1 submitted 2 December, 2020;
originally announced December 2020.
-
Factor Analysis on Citation, Using a Combined Latent and Logistic Regression Model
Authors:
Namjoon Suh,
Xiaoming Huo,
Eric Heim,
Lee Seversky
Abstract:
We propose a combined model, which integrates the latent factor model and the logistic regression model, for the citation network. It is noticed that neither a latent factor model nor a logistic regression model alone is sufficient to capture the structure of the data. The proposed model has a latent (i.e., factor analysis) model to represents the main technological trends (a.k.a., factors), and a…
▽ More
We propose a combined model, which integrates the latent factor model and the logistic regression model, for the citation network. It is noticed that neither a latent factor model nor a logistic regression model alone is sufficient to capture the structure of the data. The proposed model has a latent (i.e., factor analysis) model to represents the main technological trends (a.k.a., factors), and adds a sparse component that captures the remaining ad-hoc dependence. Parameter estimation is carried out through the construction of a joint-likelihood function of edges and properly chosen penalty terms. The convexity of the objective function allows us to develop an efficient algorithm, while the penalty terms push towards a low-dimensional latent component and a sparse graphical structure. Simulation results show that the proposed method works well in practical situations. The proposed method has been applied to a real application, which contains a citation network of statisticians (Ji and **, 2016). Some interesting findings are reported.
△ Less
Submitted 1 December, 2019;
originally announced December 2019.
-
xBD: A Dataset for Assessing Building Damage from Satellite Imagery
Authors:
Ritwik Gupta,
Richard Hosfelt,
Sandra Sajeev,
Nirav Patel,
Bryce Goodman,
Jigar Doshi,
Eric Heim,
Howie Choset,
Matthew Gaston
Abstract:
We present xBD, a new, large-scale dataset for the advancement of change detection and building damage assessment for humanitarian assistance and disaster recovery research. Natural disaster response requires an accurate understanding of damaged buildings in an affected region. Current response strategies require in-person damage assessments within 24-48 hours of a disaster. Massive potential exis…
▽ More
We present xBD, a new, large-scale dataset for the advancement of change detection and building damage assessment for humanitarian assistance and disaster recovery research. Natural disaster response requires an accurate understanding of damaged buildings in an affected region. Current response strategies require in-person damage assessments within 24-48 hours of a disaster. Massive potential exists for using aerial imagery combined with computer vision algorithms to assess damage and reduce the potential danger to human life. In collaboration with multiple disaster response agencies, xBD provides pre- and post-event satellite imagery across a variety of disaster events with building polygons, ordinal labels of damage level, and corresponding satellite metadata. Furthermore, the dataset contains bounding boxes and labels for environmental factors such as fire, water, and smoke. xBD is the largest building damage assessment dataset to date, containing 850,736 building annotations across 45,362 km\textsuperscript{2} of imagery.
△ Less
Submitted 21 November, 2019;
originally announced November 2019.
-
Constrained Generative Adversarial Networks for Interactive Image Generation
Authors:
Eric Heim
Abstract:
Generative Adversarial Networks (GANs) have received a great deal of attention due in part to recent success in generating original, high-quality samples from visual domains. However, most current methods only allow for users to guide this image generation process through limited interactions. In this work we develop a novel GAN framework that allows humans to be "in-the-loop" of the image generat…
▽ More
Generative Adversarial Networks (GANs) have received a great deal of attention due in part to recent success in generating original, high-quality samples from visual domains. However, most current methods only allow for users to guide this image generation process through limited interactions. In this work we develop a novel GAN framework that allows humans to be "in-the-loop" of the image generation process. Our technique iteratively accepts relative constraints of the form "Generate an image more like image A than image B". After each constraint is given, the user is presented with new outputs from the GAN, informing the next round of feedback. This feedback is used to constrain the output of the GAN with respect to an underlying semantic space that can be designed to model a variety of different notions of similarity (e.g. classes, attributes, object relationships, color, etc.). In our experiments, we show that our GAN framework is able to generate images that are of comparable quality to equivalent unsupervised GANs while satisfying a large number of the constraints provided by users, effectively changing a GAN into one that allows users interactive control over image generation without sacrificing image quality.
△ Less
Submitted 3 April, 2019;
originally announced April 2019.
-
Exploiting Class Learnability in Noisy Data
Authors:
Matthew Klawonn,
Eric Heim,
James Hendler
Abstract:
In many domains, collecting sufficient labeled training data for supervised machine learning requires easily accessible but noisy sources, such as crowdsourcing services or tagged Web data. Noisy labels occur frequently in data sets harvested via these means, sometimes resulting in entire classes of data on which learned classifiers generalize poorly. For real world applications, we argue that it…
▽ More
In many domains, collecting sufficient labeled training data for supervised machine learning requires easily accessible but noisy sources, such as crowdsourcing services or tagged Web data. Noisy labels occur frequently in data sets harvested via these means, sometimes resulting in entire classes of data on which learned classifiers generalize poorly. For real world applications, we argue that it can be beneficial to avoid training on such classes entirely. In this work, we aim to explore the classes in a given data set, and guide supervised training to spend time on a class proportional to its learnability. By focusing the training process, we aim to improve model generalization on classes with a strong signal. To that end, we develop an online algorithm that works in conjunction with classifier and training algorithm, iteratively selecting training data for the classifier based on how well it appears to generalize on each class. Testing our approach on a variety of data sets, we show our algorithm learns to focus on classes for which the model has low generalization error relative to strong baselines, yielding a classifier with good performance on learnable classes.
△ Less
Submitted 15 November, 2018;
originally announced November 2018.
-
Generating Triples with Adversarial Networks for Scene Graph Construction
Authors:
Matthew Klawonn,
Eric Heim
Abstract:
Driven by successes in deep learning, computer vision research has begun to move beyond object detection and image classification to more sophisticated tasks like image captioning or visual question answering. Motivating such endeavors is the desire for models to capture not only objects present in an image, but more fine-grained aspects of a scene such as relationships between objects and their a…
▽ More
Driven by successes in deep learning, computer vision research has begun to move beyond object detection and image classification to more sophisticated tasks like image captioning or visual question answering. Motivating such endeavors is the desire for models to capture not only objects present in an image, but more fine-grained aspects of a scene such as relationships between objects and their attributes. Scene graphs provide a formal construct for capturing these aspects of an image. Despite this, there have been only a few recent efforts to generate scene graphs from imagery. Previous works limit themselves to settings where bounding box information is available at train time and do not attempt to generate scene graphs with attributes. In this paper we propose a method, based on recent advancements in Generative Adversarial Networks, to overcome these deficiencies. We take the approach of first generating small subgraphs, each describing a single statement about a scene from a specific region of the input image chosen using an attention mechanism. By doing so, our method is able to produce portions of the scene graphs with attribute information without the need for bounding box labels. Then, the complete scene graph is constructed from these subgraphs. We show that our model improves upon prior work in scene graph generation on state-of-the-art data sets and accepted metrics. Further, we demonstrate that our model is capable of handling a larger vocabulary size than prior work has attempted.
△ Less
Submitted 7 February, 2018;
originally announced February 2018.
-
Clickstream analysis for crowd-based object segmentation with confidence
Authors:
Eric Heim,
Alexander Seitel,
Jonas Andrulis,
Fabian Isensee,
Christian Stock,
Tobias Ross,
Lena Maier-Hein
Abstract:
With the rapidly increasing interest in machine learning based solutions for automatic image annotation, the availability of reference annotations for algorithm training is one of the major bottlenecks in the field. Crowdsourcing has evolved as a valuable option for low-cost and large-scale data annotation; however, quality control remains a major issue which needs to be addressed. To our knowledg…
▽ More
With the rapidly increasing interest in machine learning based solutions for automatic image annotation, the availability of reference annotations for algorithm training is one of the major bottlenecks in the field. Crowdsourcing has evolved as a valuable option for low-cost and large-scale data annotation; however, quality control remains a major issue which needs to be addressed. To our knowledge, we are the first to analyze the annotation process to improve crowd-sourced image segmentation. Our method involves training a regressor to estimate the quality of a segmentation from the annotator's clickstream data. The quality estimation can be used to identify spam and weight individual annotations by their (estimated) quality when merging multiple segmentations of one image. Using a total of 29,000 crowd annotations performed on publicly available data of different object classes, we show that (1) our method is highly accurate in estimating the segmentation quality based on clickstream data, (2) outperforms state-of-the-art methods for merging multiple annotations. As the regressor does not need to be trained on the object class that it is applied to it can be regarded as a low-cost option for quality control and confidence analysis in the context of crowd-based image annotation.
△ Less
Submitted 29 November, 2017; v1 submitted 25 November, 2016;
originally announced November 2016.
-
Active Perceptual Similarity Modeling with Auxiliary Information
Authors:
Eric Heim,
Matthew Berger,
Lee Seversky,
Milos Hauskrecht
Abstract:
Learning a model of perceptual similarity from a collection of objects is a fundamental task in machine learning underlying numerous applications. A common way to learn such a model is from relative comparisons in the form of triplets: responses to queries of the form "Is object a more similar to b than it is to c?". If no consideration is made in the determination of which queries to ask, existin…
▽ More
Learning a model of perceptual similarity from a collection of objects is a fundamental task in machine learning underlying numerous applications. A common way to learn such a model is from relative comparisons in the form of triplets: responses to queries of the form "Is object a more similar to b than it is to c?". If no consideration is made in the determination of which queries to ask, existing similarity learning methods can require a prohibitively large number of responses. In this work, we consider the problem of actively learning from triplets -finding which queries are most useful for learning. Different from previous active triplet learning approaches, we incorporate auxiliary information into our similarity model and introduce an active learning scheme to find queries that are informative for quickly learning both the relevant aspects of auxiliary data and the directly-learned similarity components. Compared to prior approaches, we show that we can learn just as effectively with much fewer queries. For evaluation, we introduce a new dataset of exhaustive triplet comparisons obtained from humans and demonstrate improved performance for different types of auxiliary information.
△ Less
Submitted 6 November, 2015;
originally announced November 2015.
-
Sparse Multidimensional Patient Modeling using Auxiliary Confidence Labels
Authors:
Eric Heim,
Milos Hauskrecht
Abstract:
In this work, we focus on the problem of learning a classification model that performs inference on patient Electronic Health Records (EHRs). Often, a large amount of costly expert supervision is required to learn such a model. To reduce this cost, we obtain confidence labels that indicate how sure an expert is in the class labels she provides. If meaningful confidence information can be incorpora…
▽ More
In this work, we focus on the problem of learning a classification model that performs inference on patient Electronic Health Records (EHRs). Often, a large amount of costly expert supervision is required to learn such a model. To reduce this cost, we obtain confidence labels that indicate how sure an expert is in the class labels she provides. If meaningful confidence information can be incorporated into a learning method, fewer patient instances may need to be labeled to learn an accurate model. In addition, while accuracy of predictions is important for any inference model, a model of patients must be interpretable so that clinicians can understand how the model is making decisions. To these ends, we develop a novel metric learning method called Confidence bAsed MEtric Learning (CAMEL) that supports inclusion of confidence labels, but also emphasizes interpretability in three ways. First, our method induces sparsity, thus producing simple models that use only a few features from patient EHRs. Second, CAMEL naturally produces confidence scores that can be taken into consideration when clinicians make treatment decisions. Third, the metrics learned by CAMEL induce multidimensional spaces where each dimension represents a different "factor" that clinicians can use to assess patients. In our experimental evaluation, we show on a real-world clinical data set that our CAMEL methods are able to learn models that are as or more accurate as other methods that use the same supervision. Furthermore, we show that when CAMEL uses confidence scores it is able to learn models as or more accurate as others we tested while using only 10% of the training instances. Finally, we perform qualitative assessments on the metrics learned by CAMEL and show that they identify and clearly articulate important factors in how the model performs inference.
△ Less
Submitted 28 July, 2015;
originally announced July 2015.
-
Efficient Online Relative Comparison Kernel Learning
Authors:
Eric Heim,
Matthew Berger,
Lee M. Seversky,
Milos Hauskrecht
Abstract:
Learning a kernel matrix from relative comparison human feedback is an important problem with applications in collaborative filtering, object retrieval, and search. For learning a kernel over a large number of objects, existing methods face significant scalability issues inhibiting the application of these methods to settings where a kernel is learned in an online and timely fashion. In this paper…
▽ More
Learning a kernel matrix from relative comparison human feedback is an important problem with applications in collaborative filtering, object retrieval, and search. For learning a kernel over a large number of objects, existing methods face significant scalability issues inhibiting the application of these methods to settings where a kernel is learned in an online and timely fashion. In this paper we propose a novel framework called Efficient online Relative comparison Kernel LEarning (ERKLE), for efficiently learning the similarity of a large set of objects in an online manner. We learn a kernel from relative comparisons via stochastic gradient descent, one query response at a time, by taking advantage of the sparse and low-rank properties of the gradient to efficiently restrict the kernel to lie in the space of positive semidefinite matrices. In addition, we derive a passive-aggressive online update for minimally satisfying new relative comparisons as to not disrupt the influence of previously obtained comparisons. Experimentally, we demonstrate a considerable improvement in speed while obtaining improved or comparable accuracy compared to current methods in the online learning setting.
△ Less
Submitted 12 January, 2015; v1 submitted 6 January, 2015;
originally announced January 2015.
-
Relative Comparison Kernel Learning with Auxiliary Kernels
Authors:
Eric Heim,
Hamed Valizadegan,
Milos Hauskrecht
Abstract:
In this work we consider the problem of learning a positive semidefinite kernel matrix from relative comparisons of the form: "object A is more similar to object B than it is to C", where comparisons are given by humans. Existing solutions to this problem assume many comparisons are provided to learn a high quality kernel. However, this can be considered unrealistic for many real-world tasks since…
▽ More
In this work we consider the problem of learning a positive semidefinite kernel matrix from relative comparisons of the form: "object A is more similar to object B than it is to C", where comparisons are given by humans. Existing solutions to this problem assume many comparisons are provided to learn a high quality kernel. However, this can be considered unrealistic for many real-world tasks since relative assessments require human input, which is often costly or difficult to obtain. Because of this, only a limited number of these comparisons may be provided. In this work, we explore methods for aiding the process of learning a kernel with the help of auxiliary kernels built from more easily extractable information regarding the relationships among objects. We propose a new kernel learning approach in which the target kernel is defined as a conic combination of auxiliary kernels and a kernel whose elements are learned directly. We formulate a convex optimization to solve for this target kernel that adds only minor overhead to methods that use no auxiliary information. Empirical results show that in the presence of few training relative comparisons, our method can learn kernels that generalize to more out-of-sample comparisons than methods that do not utilize auxiliary information, as well as similar methods that learn metrics over objects.
△ Less
Submitted 15 April, 2014; v1 submitted 2 September, 2013;
originally announced September 2013.