-
Conditional score-based diffusion models for solving inverse problems in mechanics
Authors:
Agnimitra Dasgupta,
Harisankar Ramaswamy,
Javier Murgoitio Esandi,
Ken Foo,
Runze Li,
Qifa Zhou,
Brendan Kennedy,
Assad Oberai
Abstract:
We propose a framework to perform Bayesian inference using conditional score-based diffusion models to solve a class of inverse problems in mechanics involving the inference of a specimen's spatially varying material properties from noisy measurements of its mechanical response to loading. Conditional score-based diffusion models are generative models that learn to approximate the score function o…
▽ More
We propose a framework to perform Bayesian inference using conditional score-based diffusion models to solve a class of inverse problems in mechanics involving the inference of a specimen's spatially varying material properties from noisy measurements of its mechanical response to loading. Conditional score-based diffusion models are generative models that learn to approximate the score function of a conditional distribution using samples from the joint distribution. More specifically, the score functions corresponding to multiple realizations of the measurement are approximated using a single neural network, the so-called score network, which is subsequently used to sample the posterior distribution using an appropriate Markov chain Monte Carlo scheme based on Langevin dynamics. Training the score network only requires simulating the forward model. Hence, the proposed approach can accommodate black-box forward models and complex measurement noise. Moreover, once the score network has been trained, it can be re-used to solve the inverse problem for different realizations of the measurements. We demonstrate the efficacy of the proposed approach on a suite of high-dimensional inverse problems in mechanics that involve inferring heterogeneous material properties from noisy measurements. Some examples we consider involve synthetic data, while others include data collected from actual elastography experiments. Further, our applications demonstrate that the proposed approach can handle different measurement modalities, complex patterns in the inferred quantities, non-Gaussian and non-additive noise models, and nonlinear black-box forward models. The results show that the proposed framework can solve large-scale physics-based inverse problems efficiently.
△ Less
Submitted 21 June, 2024; v1 submitted 18 June, 2024;
originally announced June 2024.
-
Half-Space Feature Learning in Neural Networks
Authors:
Mahesh Lorik Yadav,
Harish Guruprasad Ramaswamy,
Chandrashekar Lakshminarayanan
Abstract:
There currently exist two extreme viewpoints for neural network feature learning -- (i) Neural networks simply implement a kernel method (a la NTK) and hence no features are learned (ii) Neural networks can represent (and hence learn) intricate hierarchical features suitable for the data. We argue in this paper neither interpretation is likely to be correct based on a novel viewpoint. Neural netwo…
▽ More
There currently exist two extreme viewpoints for neural network feature learning -- (i) Neural networks simply implement a kernel method (a la NTK) and hence no features are learned (ii) Neural networks can represent (and hence learn) intricate hierarchical features suitable for the data. We argue in this paper neither interpretation is likely to be correct based on a novel viewpoint. Neural networks can be viewed as a mixture of experts, where each expert corresponds to a (number of layers length) path through a sequence of hidden units. We use this alternate interpretation to motivate a model, called the Deep Linearly Gated Network (DLGN), which sits midway between deep linear networks and ReLU networks. Unlike deep linear networks, the DLGN is capable of learning non-linear features (which are then linearly combined), and unlike ReLU networks these features are ultimately simple -- each feature is effectively an indicator function for a region compactly described as an intersection of (number of layers) half-spaces in the input space. This viewpoint allows for a comprehensive global visualization of features, unlike the local visualizations for neurons based on saliency/activation/gradient maps. Feature learning in DLGNs is shown to happen and the mechanism with which this happens is through learning half-spaces in the input space that contain smooth regions of the target function. Due to the structure of DLGNs, the neurons in later layers are fundamentally the same as those in earlier layers -- they all represent a half-space -- however, the dynamics of gradient descent impart a distinct clustering to the later layer neurons. We hypothesize that ReLU networks also have similar feature learning behaviour.
△ Less
Submitted 5 April, 2024;
originally announced April 2024.
-
On the Learning Dynamics of Attention Networks
Authors:
Rahul Vashisht,
Harish G. Ramaswamy
Abstract:
Attention models are typically learned by optimizing one of three standard loss functions that are variously called -- soft attention, hard attention, and latent variable marginal likelihood (LVML) attention. All three paradigms are motivated by the same goal of finding two models -- a `focus' model that `selects' the right \textit{segment} of the input and a `classification' model that processes…
▽ More
Attention models are typically learned by optimizing one of three standard loss functions that are variously called -- soft attention, hard attention, and latent variable marginal likelihood (LVML) attention. All three paradigms are motivated by the same goal of finding two models -- a `focus' model that `selects' the right \textit{segment} of the input and a `classification' model that processes the selected segment into the target label. However, they differ significantly in the way the selected segments are aggregated, resulting in distinct dynamics and final results. We observe a unique signature of models learned using these paradigms and explain this as a consequence of the evolution of the classification model under gradient descent when the focus model is fixed. We also analyze these paradigms in a simple setting and derive closed-form expressions for the parameter trajectory under gradient flow. With the soft attention loss, the focus model improves quickly at initialization and splutters later on. On the other hand, hard attention loss behaves in the opposite fashion. Based on our observations, we propose a simple hybrid approach that combines the advantages of the different loss functions and demonstrates it on a collection of semi-synthetic and real-world datasets
△ Less
Submitted 12 October, 2023; v1 submitted 25 July, 2023;
originally announced July 2023.
-
On the Interpretability of Attention Networks
Authors:
Lakshmi Narayan Pandey,
Rahul Vashisht,
Harish G. Ramaswamy
Abstract:
Attention mechanisms form a core component of several successful deep learning architectures, and are based on one key idea: ''The output depends only on a small (but unknown) segment of the input.'' In several practical applications like image captioning and language translation, this is mostly true. In trained models with an attention mechanism, the outputs of an intermediate module that encodes…
▽ More
Attention mechanisms form a core component of several successful deep learning architectures, and are based on one key idea: ''The output depends only on a small (but unknown) segment of the input.'' In several practical applications like image captioning and language translation, this is mostly true. In trained models with an attention mechanism, the outputs of an intermediate module that encodes the segment of input responsible for the output is often used as a way to peek into the `reasoning` of the network. We make such a notion more precise for a variant of the classification problem that we term selective dependence classification (SDC) when used with attention model architectures. Under such a setting, we demonstrate various error modes where an attention model can be accurate but fail to be interpretable, and show that such models do occur as a result of training. We illustrate various situations that can accentuate and mitigate this behaviour. Finally, we use our objective definition of interpretability for SDC tasks to evaluate a few attention model learning algorithms designed to encourage sparsity and demonstrate that these algorithms help improve interpretability.
△ Less
Submitted 14 May, 2023; v1 submitted 30 December, 2022;
originally announced December 2022.
-
Consistent Multiclass Algorithms for Complex Metrics and Constraints
Authors:
Harikrishna Narasimhan,
Harish G. Ramaswamy,
Shiv Kumar Tavker,
Drona Khurana,
Praneeth Netrapalli,
Shivani Agarwal
Abstract:
We present consistent algorithms for multiclass learning with complex performance metrics and constraints, where the objective and constraints are defined by arbitrary functions of the confusion matrix. This setting includes many common performance metrics such as the multiclass G-mean and micro F1-measure, and constraints such as those on the classifier's precision and recall and more recent meas…
▽ More
We present consistent algorithms for multiclass learning with complex performance metrics and constraints, where the objective and constraints are defined by arbitrary functions of the confusion matrix. This setting includes many common performance metrics such as the multiclass G-mean and micro F1-measure, and constraints such as those on the classifier's precision and recall and more recent measures of fairness discrepancy. We give a general framework for designing consistent algorithms for such complex design goals by viewing the learning problem as an optimization problem over the set of feasible confusion matrices. We provide multiple instantiations of our framework under different assumptions on the performance metrics and constraints, and in each case show rates of convergence to the optimal (feasible) classifier (and thus asymptotic consistency). Experiments on a variety of multiclass classification tasks and fairness-constrained problems show that our algorithms compare favorably to the state-of-the-art baselines.
△ Less
Submitted 18 October, 2022; v1 submitted 18 October, 2022;
originally announced October 2022.
-
QTBIPOC PD: Exploring the Intersections of Race, Gender, and Sexual Orientation in Participatory Design
Authors:
Naba Rizvi,
Reggie Casanova-Perez,
Harshini Ramaswamy,
Emily Bascom,
Lisa Dirks,
Nadir Weibel
Abstract:
As Human-Computer Interaction (HCI) research aims to be inclusive and representative of many marginalized identities, there is still a lack of available literature and research on intersectional considerations of race, gender, and sexual orientation, especially when it comes to participatory design. We aim to create a space to generate community recommendations for effectively and appropriately en…
▽ More
As Human-Computer Interaction (HCI) research aims to be inclusive and representative of many marginalized identities, there is still a lack of available literature and research on intersectional considerations of race, gender, and sexual orientation, especially when it comes to participatory design. We aim to create a space to generate community recommendations for effectively and appropriately engaging Queer, Transgender, Black, Indigenous, People of Color (QTBIPOC) populations in participatory design, and discuss methods of dissemination for recommendations. Workshop participants will engage with critical race theory, queer theory, and feminist theory to reflect on current exclusionary HCI and participatory design methods and practices.
△ Less
Submitted 16 April, 2022;
originally announced April 2022.
-
Making Hidden Bias Visible: Designing a Feedback Ecosystem for Primary Care Providers
Authors:
Naba Rizvi,
Harshini Ramaswamy,
Reggie Casanova-Perez,
Andrea Hartzler,
Nadir Weibel
Abstract:
Implicit bias may perpetuate healthcare disparities for marginalized patient populations. Such bias is expressed in communication between patients and their providers. We design an ecosystem with guidance from providers to make this bias explicit in patient-provider communication. Our end users are providers seeking to improve their quality of care for patients who are Black, Indigenous, People of…
▽ More
Implicit bias may perpetuate healthcare disparities for marginalized patient populations. Such bias is expressed in communication between patients and their providers. We design an ecosystem with guidance from providers to make this bias explicit in patient-provider communication. Our end users are providers seeking to improve their quality of care for patients who are Black, Indigenous, People of Color (BIPOC) and/or Lesbian, Gay, Bisexual, Transgender, and Queer (LGBTQ). We present wireframes displaying communication metrics that negatively impact patient-centered care divided into the following categories: digital nudge, dashboard, and guided reflection. Our wireframes provide quantitative, real-time, and conversational feedback promoting provider reflection on their interactions with patients. This is the first design iteration toward the development of a tool to raise providers' awareness of their own implicit biases.
△ Less
Submitted 16 April, 2022;
originally announced April 2022.
-
The efficacy and generalizability of conditional GANs for posterior inference in physics-based inverse problems
Authors:
Deep Ray,
Harisankar Ramaswamy,
Dhruv V. Patel,
Assad A. Oberai
Abstract:
In this work, we train conditional Wasserstein generative adversarial networks to effectively sample from the posterior of physics-based Bayesian inference problems. The generator is constructed using a U-Net architecture, with the latent information injected using conditional instance normalization. The former facilitates a multiscale inverse map, while the latter enables the decoupling of the la…
▽ More
In this work, we train conditional Wasserstein generative adversarial networks to effectively sample from the posterior of physics-based Bayesian inference problems. The generator is constructed using a U-Net architecture, with the latent information injected using conditional instance normalization. The former facilitates a multiscale inverse map, while the latter enables the decoupling of the latent space dimension from the dimension of the measurement, and introduces stochasticity at all scales of the U-Net. We solve PDE-based inverse problems to demonstrate the performance of our approach in quantifying the uncertainty in the inferred field. Further, we show the generator can learn inverse maps which are local in nature, which in turn promotes generalizability when testing with out-of-distribution samples.
△ Less
Submitted 17 November, 2022; v1 submitted 15 February, 2022;
originally announced February 2022.
-
Predicting the success of Gradient Descent for a particular Dataset-Architecture-Initialization (DAI)
Authors:
Umangi Jain,
Harish G. Ramaswamy
Abstract:
Despite their massive success, training successful deep neural networks still largely relies on experimentally choosing an architecture, hyper-parameters, initialization, and training mechanism. In this work, we focus on determining the success of standard gradient descent method for training deep neural networks on a specified dataset, architecture, and initialization (DAI) combination. Through e…
▽ More
Despite their massive success, training successful deep neural networks still largely relies on experimentally choosing an architecture, hyper-parameters, initialization, and training mechanism. In this work, we focus on determining the success of standard gradient descent method for training deep neural networks on a specified dataset, architecture, and initialization (DAI) combination. Through extensive systematic experiments, we show that the evolution of singular values of the matrix obtained from the hidden layers of a DNN can aid in determining the success of gradient descent technique to train a DAI, even in the absence of validation labels in the supervised learning paradigm. This phenomenon can facilitate early give-up, stop** the training of neural networks which are predicted to not generalize well, early in the training process. Our experimentation across multiple datasets, architectures, and initializations reveals that the proposed scores can more accurately predict the success of a DAI than simply relying on the validation accuracy at earlier epochs to make a judgment.
△ Less
Submitted 25 November, 2021;
originally announced November 2021.
-
The Effect of Super-spreader Events in Epidemics
Authors:
Harisankar Ramaswamy,
Assad A Oberai,
Mitul Luhar,
Yannis C Yortsos
Abstract:
The spread of infectious epidemics is often accelerated by super-spreader events. Understanding their effect is important, particularly in the context of standard epidemiological models, which require estimates for parameters such as $R_0$. In this letter, we show that the effective value of $R_0$ in super-spreader situations is significantly large, of the order of hundreds, suggesting a delta-fun…
▽ More
The spread of infectious epidemics is often accelerated by super-spreader events. Understanding their effect is important, particularly in the context of standard epidemiological models, which require estimates for parameters such as $R_0$. In this letter, we show that the effective value of $R_0$ in super-spreader situations is significantly large, of the order of hundreds, suggesting a delta-function-like behavior during the event. Use of a well-mixed room model supports these findings. They elucidate infection kinetic modeling in enclosed environments, which differ from the standard SIR model, and provide expressions for $R_0$ in terms of physical and operational parameters. The overall impact of super-spreader events can be significant, depending on the state of the epidemic and how the infections generated by the event subsequently spread in the community.
△ Less
Submitted 26 March, 2021; v1 submitted 3 March, 2021;
originally announced March 2021.
-
Using noise resilience for ranking generalization of deep neural networks
Authors:
Depen Morwani,
Rahul Vashisht,
Harish G. Ramaswamy
Abstract:
Recent papers have shown that sufficiently overparameterized neural networks can perfectly fit even random labels. Thus, it is crucial to understand the underlying reason behind the generalization performance of a network on real-world data. In this work, we propose several measures to predict the generalization error of a network given the training data and its parameters. Using one of these meas…
▽ More
Recent papers have shown that sufficiently overparameterized neural networks can perfectly fit even random labels. Thus, it is crucial to understand the underlying reason behind the generalization performance of a network on real-world data. In this work, we propose several measures to predict the generalization error of a network given the training data and its parameters. Using one of these measures, based on noise resilience of the network, we secured 5th position in the predicting generalization in deep learning (PGDL) competition at NeurIPS 2020.
△ Less
Submitted 16 December, 2020;
originally announced December 2020.
-
Inductive Bias of Gradient Descent for Weight Normalized Smooth Homogeneous Neural Nets
Authors:
Depen Morwani,
Harish G. Ramaswamy
Abstract:
We analyze the inductive bias of gradient descent for weight normalized smooth homogeneous neural nets, when trained on exponential or cross-entropy loss. We analyse both standard weight normalization (SWN) and exponential weight normalization (EWN), and show that the gradient flow path with EWN is equivalent to gradient flow on standard networks with an adaptive learning rate. We extend these res…
▽ More
We analyze the inductive bias of gradient descent for weight normalized smooth homogeneous neural nets, when trained on exponential or cross-entropy loss. We analyse both standard weight normalization (SWN) and exponential weight normalization (EWN), and show that the gradient flow path with EWN is equivalent to gradient flow on standard networks with an adaptive learning rate. We extend these results to gradient descent, and establish asymptotic relations between weights and gradients for both SWN and EWN. We also show that EWN causes weights to be updated in a way that prefers asymptotic relative sparsity. For EWN, we provide a finite-time convergence rate of the loss with gradient flow and a tight asymptotic convergence rate with gradient descent. We demonstrate our results for SWN and EWN on synthetic data sets. Experimental results on simple datasets support our claim on sparse EWN solutions, even with SGD. This demonstrates its potential applications in learning neural networks amenable to pruning.
△ Less
Submitted 31 January, 2023; v1 submitted 24 October, 2020;
originally announced October 2020.
-
Convex Calibrated Surrogates for the Multi-Label F-Measure
Authors:
Mingyuan Zhang,
Harish G. Ramaswamy,
Shivani Agarwal
Abstract:
The F-measure is a widely used performance measure for multi-label classification, where multiple labels can be active in an instance simultaneously (e.g. in image tagging, multiple tags can be active in any image). In particular, the F-measure explicitly balances recall (fraction of active labels predicted to be active) and precision (fraction of labels predicted to be active that are actually so…
▽ More
The F-measure is a widely used performance measure for multi-label classification, where multiple labels can be active in an instance simultaneously (e.g. in image tagging, multiple tags can be active in any image). In particular, the F-measure explicitly balances recall (fraction of active labels predicted to be active) and precision (fraction of labels predicted to be active that are actually so), both of which are important in evaluating the overall performance of a multi-label classifier. As with most discrete prediction problems, however, directly optimizing the F-measure is computationally hard. In this paper, we explore the question of designing convex surrogate losses that are calibrated for the F-measure -- specifically, that have the property that minimizing the surrogate loss yields (in the limit of sufficient data) a Bayes optimal multi-label classifier for the F-measure. We show that the F-measure for an $s$-label problem, when viewed as a $2^s \times 2^s$ loss matrix, has rank at most $s^2+1$, and apply a result of Ramaswamy et al. (2014) to design a family of convex calibrated surrogates for the F-measure. The resulting surrogate risk minimization algorithms can be viewed as decomposing the multi-label F-measure learning problem into $s^2+1$ binary class probability estimation problems. We also provide a quantitative regret transfer bound for our surrogates, which allows any regret guarantees for the binary problems to be transferred to regret guarantees for the overall F-measure problem, and discuss a connection with the algorithm of Dembczynski et al. (2013). Our experiments confirm our theoretical findings.
△ Less
Submitted 16 September, 2020;
originally announced September 2020.
-
A comprehensive spatial-temporal infection model
Authors:
Harisankar Ramaswamy,
Assad A Oberai,
Yannis C Yortsos
Abstract:
Motivated by analogies between the spreading of human-to-human infections and of chemical processes, we develop a comprehensive model that accounts both for infection and for transport. In this analogy, the three different populations of infection models correspond to three chemical species. Areal densities emerge as the key variables, thus capturing the effect of spatial density. We derive expres…
▽ More
Motivated by analogies between the spreading of human-to-human infections and of chemical processes, we develop a comprehensive model that accounts both for infection and for transport. In this analogy, the three different populations of infection models correspond to three chemical species. Areal densities emerge as the key variables, thus capturing the effect of spatial density. We derive expressions for the kinetics of the infection rates and for the important parameter R0, that include areal density and its spatial distribution. Coupled with mobility the model allows the study of various effects. We first present results for a batch reactor, the chemical process equivalent of the SIR model. Because density makes R0 a decreasing function of the process extent, the infection curves are different and smaller than for the standard SIR model. We show that the effect of the initial conditions is limited to the onset of the epidemic. We derive effective infection curves for a number of cases, including a back-and-forth commute between regions of low and high R0 environments. We then consider spatially distributed systems. We show that diffusion leads to traveling waves, which in 1-D geometries propagate at a constant speed and with a constant shape, both of which are sole functions of R0. The infection curves are slightly different than for the batch problem, as diffusion mitigates the infection intensity, thus leading to an effective lower R0. The dimensional wave speed is found to be proportional to the product of the square root of the diffusivity and of an increasing function of R0, confirming the importance of restricting mobility in arresting the propagation of infection. We examine the interaction of infection waves under various conditions and scenarios, and extend the wave propagation analysis to 2-D heterogeneous systems.
△ Less
Submitted 4 December, 2020; v1 submitted 28 August, 2020;
originally announced August 2020.
-
On Controllable Sparse Alternatives to Softmax
Authors:
Anirban Laha,
Saneem A. Chemmengath,
Priyanka Agrawal,
Mitesh M. Khapra,
Karthik Sankaranarayanan,
Harish G. Ramaswamy
Abstract:
Converting an n-dimensional vector to a probability distribution over n objects is a commonly used component in many machine learning tasks like multiclass classification, multilabel classification, attention mechanisms etc. For this, several probability map** functions have been proposed and employed in literature such as softmax, sum-normalization, spherical softmax, and sparsemax, but there i…
▽ More
Converting an n-dimensional vector to a probability distribution over n objects is a commonly used component in many machine learning tasks like multiclass classification, multilabel classification, attention mechanisms etc. For this, several probability map** functions have been proposed and employed in literature such as softmax, sum-normalization, spherical softmax, and sparsemax, but there is very little understanding in terms how they relate with each other. Further, none of the above formulations offer an explicit control over the degree of sparsity. To address this, we develop a unified framework that encompasses all these formulations as special cases. This framework ensures simple closed-form solutions and existence of sub-gradients suitable for learning via backpropagation. Within this framework, we propose two novel sparse formulations, sparsegen-lin and sparsehourglass, that seek to provide a control over the degree of desired sparsity. We further develop novel convex loss functions that help induce the behavior of aforementioned formulations in the multilabel classification setting, showing improved performance. We also demonstrate empirically that the proposed formulations, when used to compute attention weights, achieve better or comparable performance on standard seq2seq tasks like neural machine translation and abstractive summarization.
△ Less
Submitted 30 October, 2018; v1 submitted 29 October, 2018;
originally announced October 2018.
-
Mixture Proportion Estimation via Kernel Embedding of Distributions
Authors:
Harish G. Ramaswamy,
Clayton Scott,
Ambuj Tewari
Abstract:
Mixture proportion estimation (MPE) is the problem of estimating the weight of a component distribution in a mixture, given samples from the mixture and component. This problem constitutes a key part in many "weakly supervised learning" problems like learning with positive and unlabelled samples, learning with label noise, anomaly detection and crowdsourcing. While there have been several methods…
▽ More
Mixture proportion estimation (MPE) is the problem of estimating the weight of a component distribution in a mixture, given samples from the mixture and component. This problem constitutes a key part in many "weakly supervised learning" problems like learning with positive and unlabelled samples, learning with label noise, anomaly detection and crowdsourcing. While there have been several methods proposed to solve this problem, to the best of our knowledge no efficient algorithm with a proven convergence rate towards the true proportion exists for this problem. We fill this gap by constructing a provably correct algorithm for MPE, and derive convergence rates under certain assumptions on the distribution. Our method is based on embedding distributions onto an RKHS, and implementing it only requires solving a simple convex quadratic programming problem a few times. We run our algorithm on several standard classification datasets, and demonstrate that it performs comparably to or better than other algorithms on most datasets.
△ Less
Submitted 31 May, 2016; v1 submitted 8 March, 2016;
originally announced March 2016.
-
Consistent Algorithms for Multiclass Classification with a Reject Option
Authors:
Harish G. Ramaswamy,
Ambuj Tewari,
Shivani Agarwal
Abstract:
We consider the problem of $n$-class classification ($n\geq 2$), where the classifier can choose to abstain from making predictions at a given cost, say, a factor $α$ of the cost of misclassification. Designing consistent algorithms for such $n$-class classification problems with a `reject option' is the main goal of this paper, thereby extending and generalizing previously known results for…
▽ More
We consider the problem of $n$-class classification ($n\geq 2$), where the classifier can choose to abstain from making predictions at a given cost, say, a factor $α$ of the cost of misclassification. Designing consistent algorithms for such $n$-class classification problems with a `reject option' is the main goal of this paper, thereby extending and generalizing previously known results for $n=2$. We show that the Crammer-Singer surrogate and the one vs all hinge loss, albeit with a different predictor than the standard argmax, yield consistent algorithms for this problem when $α=\frac{1}{2}$. More interestingly, we design a new convex surrogate that is also consistent for this problem when $α=\frac{1}{2}$ and operates on a much lower dimensional space ($\log(n)$ as opposed to $n$). We also generalize all three surrogates to be consistent for any $α\in[0, \frac{1}{2}]$.
△ Less
Submitted 15 May, 2015;
originally announced May 2015.
-
Consistent Classification Algorithms for Multi-class Non-Decomposable Performance Metrics
Authors:
Harish G. Ramaswamy,
Harikrishna Narasimhan,
Shivani Agarwal
Abstract:
We study consistency of learning algorithms for a multi-class performance metric that is a non-decomposable function of the confusion matrix of a classifier and cannot be expressed as a sum of losses on individual data points; examples of such performance metrics include the macro F-measure popular in information retrieval and the G-mean metric used in class-imbalanced problems. While there has be…
▽ More
We study consistency of learning algorithms for a multi-class performance metric that is a non-decomposable function of the confusion matrix of a classifier and cannot be expressed as a sum of losses on individual data points; examples of such performance metrics include the macro F-measure popular in information retrieval and the G-mean metric used in class-imbalanced problems. While there has been much work in recent years in understanding the consistency properties of learning algorithms for `binary' non-decomposable metrics, little is known either about the form of the optimal classifier for a general multi-class non-decomposable metric, or about how these learning algorithms generalize to the multi-class case. In this paper, we provide a unified framework for analysing a multi-class non-decomposable performance metric, where the problem of finding the optimal classifier for the performance metric is viewed as an optimization problem over the space of all confusion matrices achievable under the given distribution. Using this framework, we show that (under a continuous distribution) the optimal classifier for a multi-class performance metric can be obtained as the solution of a cost-sensitive classification problem, thus generalizing several previous results on specific binary non-decomposable metrics. We then design a consistent learning algorithm for concave multi-class performance metrics that proceeds via a sequence of cost-sensitive classification problems, and can be seen as applying the conditional gradient (CG) optimization method over the space of feasible confusion matrices. To our knowledge, this is the first efficient learning algorithm (whose running time is polynomial in the number of classes) that is consistent for a large family of multi-class non-decomposable metrics. Our consistency proof uses a novel technique based on the convergence analysis of the CG method.
△ Less
Submitted 1 January, 2015;
originally announced January 2015.
-
Convex Calibration Dimension for Multiclass Loss Matrices
Authors:
Harish G. Ramaswamy,
Shivani Agarwal
Abstract:
We study consistency properties of surrogate loss functions for general multiclass learning problems, defined by a general multiclass loss matrix. We extend the notion of classification calibration, which has been studied for binary and multiclass 0-1 classification problems (and for certain other specific learning problems), to the general multiclass setting, and derive necessary and sufficient c…
▽ More
We study consistency properties of surrogate loss functions for general multiclass learning problems, defined by a general multiclass loss matrix. We extend the notion of classification calibration, which has been studied for binary and multiclass 0-1 classification problems (and for certain other specific learning problems), to the general multiclass setting, and derive necessary and sufficient conditions for a surrogate loss to be calibrated with respect to a loss matrix in this setting. We then introduce the notion of convex calibration dimension of a multiclass loss matrix, which measures the smallest `size' of a prediction space in which it is possible to design a convex surrogate that is calibrated with respect to the loss matrix. We derive both upper and lower bounds on this quantity, and use these results to analyze various loss matrices. In particular, we apply our framework to study various subset ranking losses, and use the convex calibration dimension as a tool to show both the existence and non-existence of various types of convex calibrated surrogates for these losses. Our results strengthen recent results of Duchi et al. (2010) and Calauzenes et al. (2012) on the non-existence of certain types of convex calibrated surrogates in subset ranking. We anticipate the convex calibration dimension may prove to be a useful tool in the study and design of surrogate losses for general multiclass learning problems.
△ Less
Submitted 23 August, 2015; v1 submitted 12 August, 2014;
originally announced August 2014.