Search | arXiv e-print repository

What is Your Metric Telling You? Evaluating Classifier Calibration under Context-Specific Definitions of Reliability

Authors: John Kirchenbauer, Jacob Oaks, Eric Heim

Abstract: Classifier calibration has received recent attention from the machine learning community due both to its practical utility in facilitating decision making, as well as the observation that modern neural network classifiers are poorly calibrated. Much of this focus has been towards the goal of learning classifiers such that their output with largest magnitude (the "predicted class") is calibrated. H… ▽ More Classifier calibration has received recent attention from the machine learning community due both to its practical utility in facilitating decision making, as well as the observation that modern neural network classifiers are poorly calibrated. Much of this focus has been towards the goal of learning classifiers such that their output with largest magnitude (the "predicted class") is calibrated. However, this narrow interpretation of classifier outputs does not adequately capture the variety of practical use cases in which classifiers can aid in decision making. In this work, we argue that more expressive metrics must be developed that accurately measure calibration error for the specific context in which a classifier will be deployed. To this end, we derive a number of different metrics using a generalization of Expected Calibration Error (ECE) that measure calibration error under different definitions of reliability. We then provide an extensive empirical evaluation of commonly used neural network architectures and calibration techniques with respect to these metrics. We find that: 1) definitions of ECE that focus solely on the predicted class fail to accurately measure calibration error under a selection of practically useful definitions of reliability and 2) many common calibration techniques fail to improve calibration performance uniformly across ECE metrics derived from these diverse definitions of reliability. △ Less

Submitted 23 May, 2022; originally announced May 2022.

Comments: Accepted in the ICLR 2022 Machine Learning Evaluation Standards Workshop

arXiv:2204.04211 [pdf]

Measuring AI Systems Beyond Accuracy

Authors: Violet Turri, Rachel Dzombak, Eric Heim, Nathan VanHoudnos, Jay Palat, Anusha Sinha

Abstract: Current test and evaluation (T&E) methods for assessing machine learning (ML) system performance often rely on incomplete metrics. Testing is additionally often siloed from the other phases of the ML system lifecycle. Research investigating cross-domain approaches to ML T&E is needed to drive the state of the art forward and to build an Artificial Intelligence (AI) engineering discipline. This pap… ▽ More Current test and evaluation (T&E) methods for assessing machine learning (ML) system performance often rely on incomplete metrics. Testing is additionally often siloed from the other phases of the ML system lifecycle. Research investigating cross-domain approaches to ML T&E is needed to drive the state of the art forward and to build an Artificial Intelligence (AI) engineering discipline. This paper advocates for a robust, integrated approach to testing by outlining six key questions for guiding a holistic T&E strategy. △ Less

Submitted 7 April, 2022; originally announced April 2022.

Comments: 8 pages, Presented at 2022 AAAI Spring Symposium Series Workshop on AI Engineering: Creating Scalable, Human-Centered and Robust AI Systems

arXiv:2012.02108

Proceedings of NeurIPS 2020 Workshop on Artificial Intelligence for Humanitarian Assistance and Disaster Response

Authors: Ritwik Gupta, Eric T. Heim, Edoardo Nemni

Abstract: These are the "proceedings" of the 2nd AI + HADR workshop which was held virtually on December 12, 2020 as part of the Neural Information Processing Systems conference. These are non-archival and merely serve as a way to collate all the papers accepted to the workshop. These are the "proceedings" of the 2nd AI + HADR workshop which was held virtually on December 12, 2020 as part of the Neural Information Processing Systems conference. These are non-archival and merely serve as a way to collate all the papers accepted to the workshop. △ Less

Submitted 7 December, 2020; v1 submitted 3 December, 2020; originally announced December 2020.

arXiv:2012.01022

Proceedings of NeurIPS 2019 Workshop on Artificial Intelligence for Humanitarian Assistance and Disaster Response

Authors: Ritwik Gupta, Eric T. Heim

Abstract: These are the "proceedings" of the 1st AI + HADR workshop which was held in Vancouver, Canada on December 13, 2019 as part of the Neural Information Processing Systems conference. These are non-archival and serve solely as a collation of all the papers accepted to the workshop. These are the "proceedings" of the 1st AI + HADR workshop which was held in Vancouver, Canada on December 13, 2019 as part of the Neural Information Processing Systems conference. These are non-archival and serve solely as a collation of all the papers accepted to the workshop. △ Less

Submitted 3 December, 2020; v1 submitted 2 December, 2020; originally announced December 2020.

arXiv:1912.00524 [pdf, other]

Factor Analysis on Citation, Using a Combined Latent and Logistic Regression Model

Authors: Namjoon Suh, Xiaoming Huo, Eric Heim, Lee Seversky

Abstract: We propose a combined model, which integrates the latent factor model and the logistic regression model, for the citation network. It is noticed that neither a latent factor model nor a logistic regression model alone is sufficient to capture the structure of the data. The proposed model has a latent (i.e., factor analysis) model to represents the main technological trends (a.k.a., factors), and a… ▽ More We propose a combined model, which integrates the latent factor model and the logistic regression model, for the citation network. It is noticed that neither a latent factor model nor a logistic regression model alone is sufficient to capture the structure of the data. The proposed model has a latent (i.e., factor analysis) model to represents the main technological trends (a.k.a., factors), and adds a sparse component that captures the remaining ad-hoc dependence. Parameter estimation is carried out through the construction of a joint-likelihood function of edges and properly chosen penalty terms. The convexity of the objective function allows us to develop an efficient algorithm, while the penalty terms push towards a low-dimensional latent component and a sparse graphical structure. Simulation results show that the proposed method works well in practical situations. The proposed method has been applied to a real application, which contains a citation network of statisticians (Ji and **, 2016). Some interesting findings are reported. △ Less

Submitted 1 December, 2019; originally announced December 2019.

Comments: Citation network, matrix decomposition, latent variable model, logistic regression model, convex optimization, alternating direction method of multiplier

arXiv:1911.09296 [pdf, other]

xBD: A Dataset for Assessing Building Damage from Satellite Imagery

Authors: Ritwik Gupta, Richard Hosfelt, Sandra Sajeev, Nirav Patel, Bryce Goodman, Jigar Doshi, Eric Heim, Howie Choset, Matthew Gaston

Abstract: We present xBD, a new, large-scale dataset for the advancement of change detection and building damage assessment for humanitarian assistance and disaster recovery research. Natural disaster response requires an accurate understanding of damaged buildings in an affected region. Current response strategies require in-person damage assessments within 24-48 hours of a disaster. Massive potential exis… ▽ More We present xBD, a new, large-scale dataset for the advancement of change detection and building damage assessment for humanitarian assistance and disaster recovery research. Natural disaster response requires an accurate understanding of damaged buildings in an affected region. Current response strategies require in-person damage assessments within 24-48 hours of a disaster. Massive potential exists for using aerial imagery combined with computer vision algorithms to assess damage and reduce the potential danger to human life. In collaboration with multiple disaster response agencies, xBD provides pre- and post-event satellite imagery across a variety of disaster events with building polygons, ordinal labels of damage level, and corresponding satellite metadata. Furthermore, the dataset contains bounding boxes and labels for environmental factors such as fire, water, and smoke. xBD is the largest building damage assessment dataset to date, containing 850,736 building annotations across 45,362 km\textsuperscript{2} of imagery. △ Less

Submitted 21 November, 2019; originally announced November 2019.

Comments: 9 pages, 10 figures

arXiv:1904.02526 [pdf, other]

Constrained Generative Adversarial Networks for Interactive Image Generation

Authors: Eric Heim

Abstract: Generative Adversarial Networks (GANs) have received a great deal of attention due in part to recent success in generating original, high-quality samples from visual domains. However, most current methods only allow for users to guide this image generation process through limited interactions. In this work we develop a novel GAN framework that allows humans to be "in-the-loop" of the image generat… ▽ More Generative Adversarial Networks (GANs) have received a great deal of attention due in part to recent success in generating original, high-quality samples from visual domains. However, most current methods only allow for users to guide this image generation process through limited interactions. In this work we develop a novel GAN framework that allows humans to be "in-the-loop" of the image generation process. Our technique iteratively accepts relative constraints of the form "Generate an image more like image A than image B". After each constraint is given, the user is presented with new outputs from the GAN, informing the next round of feedback. This feedback is used to constrain the output of the GAN with respect to an underlying semantic space that can be designed to model a variety of different notions of similarity (e.g. classes, attributes, object relationships, color, etc.). In our experiments, we show that our GAN framework is able to generate images that are of comparable quality to equivalent unsupervised GANs while satisfying a large number of the constraints provided by users, effectively changing a GAN into one that allows users interactive control over image generation without sacrificing image quality. △ Less

Submitted 3 April, 2019; originally announced April 2019.

Comments: To Appear in the Proceedings of the 2019 Conference on Computer Vision and Pattern Recognition

arXiv:1811.06524 [pdf, other]

Exploiting Class Learnability in Noisy Data

Authors: Matthew Klawonn, Eric Heim, James Hendler

Abstract: In many domains, collecting sufficient labeled training data for supervised machine learning requires easily accessible but noisy sources, such as crowdsourcing services or tagged Web data. Noisy labels occur frequently in data sets harvested via these means, sometimes resulting in entire classes of data on which learned classifiers generalize poorly. For real world applications, we argue that it… ▽ More In many domains, collecting sufficient labeled training data for supervised machine learning requires easily accessible but noisy sources, such as crowdsourcing services or tagged Web data. Noisy labels occur frequently in data sets harvested via these means, sometimes resulting in entire classes of data on which learned classifiers generalize poorly. For real world applications, we argue that it can be beneficial to avoid training on such classes entirely. In this work, we aim to explore the classes in a given data set, and guide supervised training to spend time on a class proportional to its learnability. By focusing the training process, we aim to improve model generalization on classes with a strong signal. To that end, we develop an online algorithm that works in conjunction with classifier and training algorithm, iteratively selecting training data for the classifier based on how well it appears to generalize on each class. Testing our approach on a variety of data sets, we show our algorithm learns to focus on classes for which the model has low generalization error relative to strong baselines, yielding a classifier with good performance on learnable classes. △ Less

Submitted 15 November, 2018; originally announced November 2018.

Comments: Accepted to AAAI 2019

arXiv:1802.02598 [pdf, other]

Generating Triples with Adversarial Networks for Scene Graph Construction

Authors: Matthew Klawonn, Eric Heim

Abstract: Driven by successes in deep learning, computer vision research has begun to move beyond object detection and image classification to more sophisticated tasks like image captioning or visual question answering. Motivating such endeavors is the desire for models to capture not only objects present in an image, but more fine-grained aspects of a scene such as relationships between objects and their a… ▽ More Driven by successes in deep learning, computer vision research has begun to move beyond object detection and image classification to more sophisticated tasks like image captioning or visual question answering. Motivating such endeavors is the desire for models to capture not only objects present in an image, but more fine-grained aspects of a scene such as relationships between objects and their attributes. Scene graphs provide a formal construct for capturing these aspects of an image. Despite this, there have been only a few recent efforts to generate scene graphs from imagery. Previous works limit themselves to settings where bounding box information is available at train time and do not attempt to generate scene graphs with attributes. In this paper we propose a method, based on recent advancements in Generative Adversarial Networks, to overcome these deficiencies. We take the approach of first generating small subgraphs, each describing a single statement about a scene from a specific region of the input image chosen using an attention mechanism. By doing so, our method is able to produce portions of the scene graphs with attribute information without the need for bounding box labels. Then, the complete scene graph is constructed from these subgraphs. We show that our model improves upon prior work in scene graph generation on state-of-the-art data sets and accepted metrics. Further, we demonstrate that our model is capable of handling a larger vocabulary size than prior work has attempted. △ Less

Submitted 7 February, 2018; originally announced February 2018.

Comments: Accepted to AAAI 2018

arXiv:1611.08527 [pdf, other]

doi 10.1109/TPAMI.2017.2777967

Clickstream analysis for crowd-based object segmentation with confidence

Authors: Eric Heim, Alexander Seitel, Jonas Andrulis, Fabian Isensee, Christian Stock, Tobias Ross, Lena Maier-Hein

Abstract: With the rapidly increasing interest in machine learning based solutions for automatic image annotation, the availability of reference annotations for algorithm training is one of the major bottlenecks in the field. Crowdsourcing has evolved as a valuable option for low-cost and large-scale data annotation; however, quality control remains a major issue which needs to be addressed. To our knowledg… ▽ More With the rapidly increasing interest in machine learning based solutions for automatic image annotation, the availability of reference annotations for algorithm training is one of the major bottlenecks in the field. Crowdsourcing has evolved as a valuable option for low-cost and large-scale data annotation; however, quality control remains a major issue which needs to be addressed. To our knowledge, we are the first to analyze the annotation process to improve crowd-sourced image segmentation. Our method involves training a regressor to estimate the quality of a segmentation from the annotator's clickstream data. The quality estimation can be used to identify spam and weight individual annotations by their (estimated) quality when merging multiple segmentations of one image. Using a total of 29,000 crowd annotations performed on publicly available data of different object classes, we show that (1) our method is highly accurate in estimating the segmentation quality based on clickstream data, (2) outperforms state-of-the-art methods for merging multiple annotations. As the regressor does not need to be trained on the object class that it is applied to it can be regarded as a low-cost option for quality control and confidence analysis in the context of crowd-based image annotation. △ Less

Submitted 29 November, 2017; v1 submitted 25 November, 2016; originally announced November 2016.

Comments: to appear in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

arXiv:1511.02254 [pdf, other]

Active Perceptual Similarity Modeling with Auxiliary Information

Authors: Eric Heim, Matthew Berger, Lee Seversky, Milos Hauskrecht

Abstract: Learning a model of perceptual similarity from a collection of objects is a fundamental task in machine learning underlying numerous applications. A common way to learn such a model is from relative comparisons in the form of triplets: responses to queries of the form "Is object a more similar to b than it is to c?". If no consideration is made in the determination of which queries to ask, existin… ▽ More Learning a model of perceptual similarity from a collection of objects is a fundamental task in machine learning underlying numerous applications. A common way to learn such a model is from relative comparisons in the form of triplets: responses to queries of the form "Is object a more similar to b than it is to c?". If no consideration is made in the determination of which queries to ask, existing similarity learning methods can require a prohibitively large number of responses. In this work, we consider the problem of actively learning from triplets -finding which queries are most useful for learning. Different from previous active triplet learning approaches, we incorporate auxiliary information into our similarity model and introduce an active learning scheme to find queries that are informative for quickly learning both the relevant aspects of auxiliary data and the directly-learned similarity components. Compared to prior approaches, we show that we can learn just as effectively with much fewer queries. For evaluation, we introduce a new dataset of exhaustive triplet comparisons obtained from humans and demonstrate improved performance for different types of auxiliary information. △ Less

Submitted 6 November, 2015; originally announced November 2015.

arXiv:1507.07955 [pdf, ps, other]

Sparse Multidimensional Patient Modeling using Auxiliary Confidence Labels

Authors: Eric Heim, Milos Hauskrecht

Abstract: In this work, we focus on the problem of learning a classification model that performs inference on patient Electronic Health Records (EHRs). Often, a large amount of costly expert supervision is required to learn such a model. To reduce this cost, we obtain confidence labels that indicate how sure an expert is in the class labels she provides. If meaningful confidence information can be incorpora… ▽ More In this work, we focus on the problem of learning a classification model that performs inference on patient Electronic Health Records (EHRs). Often, a large amount of costly expert supervision is required to learn such a model. To reduce this cost, we obtain confidence labels that indicate how sure an expert is in the class labels she provides. If meaningful confidence information can be incorporated into a learning method, fewer patient instances may need to be labeled to learn an accurate model. In addition, while accuracy of predictions is important for any inference model, a model of patients must be interpretable so that clinicians can understand how the model is making decisions. To these ends, we develop a novel metric learning method called Confidence bAsed MEtric Learning (CAMEL) that supports inclusion of confidence labels, but also emphasizes interpretability in three ways. First, our method induces sparsity, thus producing simple models that use only a few features from patient EHRs. Second, CAMEL naturally produces confidence scores that can be taken into consideration when clinicians make treatment decisions. Third, the metrics learned by CAMEL induce multidimensional spaces where each dimension represents a different "factor" that clinicians can use to assess patients. In our experimental evaluation, we show on a real-world clinical data set that our CAMEL methods are able to learn models that are as or more accurate as other methods that use the same supervision. Furthermore, we show that when CAMEL uses confidence scores it is able to learn models as or more accurate as others we tested while using only 10% of the training instances. Finally, we perform qualitative assessments on the metrics learned by CAMEL and show that they identify and clearly articulate important factors in how the model performs inference. △ Less

Submitted 28 July, 2015; originally announced July 2015.

Comments: Currently under review

arXiv:1501.01242 [pdf, other]

Efficient Online Relative Comparison Kernel Learning

Authors: Eric Heim, Matthew Berger, Lee M. Seversky, Milos Hauskrecht

Abstract: Learning a kernel matrix from relative comparison human feedback is an important problem with applications in collaborative filtering, object retrieval, and search. For learning a kernel over a large number of objects, existing methods face significant scalability issues inhibiting the application of these methods to settings where a kernel is learned in an online and timely fashion. In this paper… ▽ More Learning a kernel matrix from relative comparison human feedback is an important problem with applications in collaborative filtering, object retrieval, and search. For learning a kernel over a large number of objects, existing methods face significant scalability issues inhibiting the application of these methods to settings where a kernel is learned in an online and timely fashion. In this paper we propose a novel framework called Efficient online Relative comparison Kernel LEarning (ERKLE), for efficiently learning the similarity of a large set of objects in an online manner. We learn a kernel from relative comparisons via stochastic gradient descent, one query response at a time, by taking advantage of the sparse and low-rank properties of the gradient to efficiently restrict the kernel to lie in the space of positive semidefinite matrices. In addition, we derive a passive-aggressive online update for minimally satisfying new relative comparisons as to not disrupt the influence of previously obtained comparisons. Experimentally, we demonstrate a considerable improvement in speed while obtaining improved or comparable accuracy compared to current methods in the online learning setting. △ Less

Submitted 12 January, 2015; v1 submitted 6 January, 2015; originally announced January 2015.

Comments: Extended version of the paper appearing in The Proceedings of the 2015 SIAM International Conference on Data Mining (SDM15)

arXiv:1309.0489 [pdf, ps, other]

Relative Comparison Kernel Learning with Auxiliary Kernels

Authors: Eric Heim, Hamed Valizadegan, Milos Hauskrecht

Abstract: In this work we consider the problem of learning a positive semidefinite kernel matrix from relative comparisons of the form: "object A is more similar to object B than it is to C", where comparisons are given by humans. Existing solutions to this problem assume many comparisons are provided to learn a high quality kernel. However, this can be considered unrealistic for many real-world tasks since… ▽ More In this work we consider the problem of learning a positive semidefinite kernel matrix from relative comparisons of the form: "object A is more similar to object B than it is to C", where comparisons are given by humans. Existing solutions to this problem assume many comparisons are provided to learn a high quality kernel. However, this can be considered unrealistic for many real-world tasks since relative assessments require human input, which is often costly or difficult to obtain. Because of this, only a limited number of these comparisons may be provided. In this work, we explore methods for aiding the process of learning a kernel with the help of auxiliary kernels built from more easily extractable information regarding the relationships among objects. We propose a new kernel learning approach in which the target kernel is defined as a conic combination of auxiliary kernels and a kernel whose elements are learned directly. We formulate a convex optimization to solve for this target kernel that adds only minor overhead to methods that use no auxiliary information. Empirical results show that in the presence of few training relative comparisons, our method can learn kernels that generalize to more out-of-sample comparisons than methods that do not utilize auxiliary information, as well as similar methods that learn metrics over objects. △ Less

Submitted 15 April, 2014; v1 submitted 2 September, 2013; originally announced September 2013.

Showing 1–14 of 14 results for author: Heim, E