-
Transformer In-Context Learning for Categorical Data
Authors:
Aaron T. Wang,
Ricardo Henao,
Lawrence Carin
Abstract:
Recent research has sought to understand Transformers through the lens of in-context learning with functional data. We extend that line of work with the goal of moving closer to language models, considering categorical outcomes, nonlinear underlying models, and nonlinear attention. The contextual data are of the form $\textsf{C}=(x_1,c_1,\dots,x_N,c_{N})$ where each $c_i\in\{0,\dots,C-1\}$ is draw…
▽ More
Recent research has sought to understand Transformers through the lens of in-context learning with functional data. We extend that line of work with the goal of moving closer to language models, considering categorical outcomes, nonlinear underlying models, and nonlinear attention. The contextual data are of the form $\textsf{C}=(x_1,c_1,\dots,x_N,c_{N})$ where each $c_i\in\{0,\dots,C-1\}$ is drawn from a categorical distribution that depends on covariates $x_i\in\mathbb{R}^d$. Contextual outcomes in the $m$th set of contextual data, $\textsf{C}_m$, are modeled in terms of latent function $f_m(x)\in\textsf{F}$, where $\textsf{F}$ is a functional class with $(C-1)$-dimensional vector output. The probability of observing class $c\in\{0,\dots,C-1\}$ is modeled in terms of the output components of $f_m(x)$ via the softmax. The Transformer parameters may be trained with $M$ contextual examples, $\{\textsf{C}_m\}_{m=1,M}$, and the trained model is then applied to new contextual data $\textsf{C}_{M+1}$ for new $f_{M+1}(x)\in\textsf{F}$. The goal is for the Transformer to constitute the probability of each category $c\in\{0,\dots,C-1\}$ for a new query $x_{N_{M+1}+1}$. We assume each component of $f_m(x)$ resides in a reproducing kernel Hilbert space (RKHS), specifying $\textsf{F}$. Analysis and an extensive set of experiments suggest that on its forward pass the Transformer (with attention defined by the RKHS kernel) implements a form of gradient descent of the underlying function, connected to the latent vector function associated with the softmax. We present what is believed to be the first real-world demonstration of this few-shot-learning methodology, using the ImageNet dataset.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Meta-Learned Attribute Self-Interaction Network for Continual and Generalized Zero-Shot Learning
Authors:
Vinay K Verma,
Nikhil Mehta,
Kevin J Liang,
Aakansha Mishra,
Lawrence Carin
Abstract:
Zero-shot learning (ZSL) is a promising approach to generalizing a model to categories unseen during training by leveraging class attributes, but challenges remain. Recently, methods using generative models to combat bias towards classes seen during training have pushed state of the art, but these generative models can be slow or computationally expensive to train. Also, these generative models as…
▽ More
Zero-shot learning (ZSL) is a promising approach to generalizing a model to categories unseen during training by leveraging class attributes, but challenges remain. Recently, methods using generative models to combat bias towards classes seen during training have pushed state of the art, but these generative models can be slow or computationally expensive to train. Also, these generative models assume that the attribute vector of each unseen class is available a priori at training, which is not always practical. Additionally, while many previous ZSL methods assume a one-time adaptation to unseen classes, in reality, the world is always changing, necessitating a constant adjustment of deployed models. Models unprepared to handle a sequential stream of data are likely to experience catastrophic forgetting. We propose a Meta-learned Attribute self-Interaction Network (MAIN) for continual ZSL. By pairing attribute self-interaction trained using meta-learning with inverse regularization of the attribute encoder, we are able to outperform state-of-the-art results without leveraging the unseen class attributes while also being able to train our models substantially faster (>100x) than expensive generative-based approaches. We demonstrate this with experiments on five standard ZSL datasets (CUB, aPY, AWA1, AWA2, and SUN) in the generalized zero-shot learning and continual (fixed/dynamic) zero-shot learning settings. Extensive ablations and analyses demonstrate the efficacy of various components proposed.
△ Less
Submitted 2 December, 2023;
originally announced December 2023.
-
Open World Classification with Adaptive Negative Samples
Authors:
Ke Bai,
Guoyin Wang,
Jiwei Li,
Sunghyun Park,
Sung** Lee,
Puyang Xu,
Ricardo Henao,
Lawrence Carin
Abstract:
Open world classification is a task in natural language processing with key practical relevance and impact. Since the open or {\em unknown} category data only manifests in the inference phase, finding a model with a suitable decision boundary accommodating for the identification of known classes and discrimination of the open category is challenging. The performance of existing models is limited b…
▽ More
Open world classification is a task in natural language processing with key practical relevance and impact. Since the open or {\em unknown} category data only manifests in the inference phase, finding a model with a suitable decision boundary accommodating for the identification of known classes and discrimination of the open category is challenging. The performance of existing models is limited by the lack of effective open category data during the training stage or the lack of a good mechanism to learn appropriate decision boundaries. We propose an approach based on \underline{a}daptive \underline{n}egative \underline{s}amples (ANS) designed to generate effective synthetic open category samples in the training stage and without requiring any prior knowledge or external datasets. Empirically, we find a significant advantage in using auxiliary one-versus-rest binary classifiers, which effectively utilize the generated negative samples and avoid the complex threshold-seeking stage in previous works. Extensive experiments on three benchmark datasets show that ANS achieves significant improvements over state-of-the-art methods.
△ Less
Submitted 9 March, 2023;
originally announced March 2023.
-
Pushing the Efficiency Limit Using Structured Sparse Convolutions
Authors:
Vinay Kumar Verma,
Nikhil Mehta,
Shi**g Si,
Ricardo Henao,
Lawrence Carin
Abstract:
Weight pruning is among the most popular approaches for compressing deep convolutional neural networks. Recent work suggests that in a randomly initialized deep neural network, there exist sparse subnetworks that achieve performance comparable to the original network. Unfortunately, finding these subnetworks involves iterative stages of training and pruning, which can be computationally expensive.…
▽ More
Weight pruning is among the most popular approaches for compressing deep convolutional neural networks. Recent work suggests that in a randomly initialized deep neural network, there exist sparse subnetworks that achieve performance comparable to the original network. Unfortunately, finding these subnetworks involves iterative stages of training and pruning, which can be computationally expensive. We propose Structured Sparse Convolution (SSC), which leverages the inherent structure in images to reduce the parameters in the convolutional filter. This leads to improved efficiency of convolutional architectures compared to existing methods that perform pruning at initialization. We show that SSC is a generalization of commonly used layers (depthwise, groupwise and pointwise convolution) in ``efficient architectures.'' Extensive experiments on well-known CNN models and datasets show the effectiveness of the proposed method. Architectures based on SSC achieve state-of-the-art performance compared to baselines on CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet classification benchmarks.
△ Less
Submitted 23 October, 2022;
originally announced October 2022.
-
Pseudo-OOD training for robust language models
Authors:
Dhanasekar Sundararaman,
Nikhil Mehta,
Lawrence Carin
Abstract:
While pre-trained large-scale deep models have garnered attention as an important topic for many downstream natural language processing (NLP) tasks, such models often make unreliable predictions on out-of-distribution (OOD) inputs. As such, OOD detection is a key component of a reliable machine-learning model for any industry-scale application. Common approaches often assume access to additional O…
▽ More
While pre-trained large-scale deep models have garnered attention as an important topic for many downstream natural language processing (NLP) tasks, such models often make unreliable predictions on out-of-distribution (OOD) inputs. As such, OOD detection is a key component of a reliable machine-learning model for any industry-scale application. Common approaches often assume access to additional OOD samples during the training stage, however, outlier distribution is often unknown in advance. Instead, we propose a post hoc framework called POORE - POsthoc pseudo-Ood REgularization, that generates pseudo-OOD samples using in-distribution (IND) data. The model is fine-tuned by introducing a new regularization loss that separates the embeddings of IND and OOD data, which leads to significant gains on the OOD prediction task during testing. We extensively evaluate our framework on three real-world dialogue systems, achieving new state-of-the-art in OOD detection.
△ Less
Submitted 17 October, 2022;
originally announced October 2022.
-
Collaborative Anomaly Detection
Authors:
Ke Bai,
Aonan Zhang,
Zhizhong Li,
Ricardo Heano,
Chong Wang,
Lawrence Carin
Abstract:
In recommendation systems, items are likely to be exposed to various users and we would like to learn about the familiarity of a new user with an existing item. This can be formulated as an anomaly detection (AD) problem distinguishing between "common users" (nominal) and "fresh users" (anomalous). Considering the sheer volume of items and the sparsity of user-item paired data, independently apply…
▽ More
In recommendation systems, items are likely to be exposed to various users and we would like to learn about the familiarity of a new user with an existing item. This can be formulated as an anomaly detection (AD) problem distinguishing between "common users" (nominal) and "fresh users" (anomalous). Considering the sheer volume of items and the sparsity of user-item paired data, independently applying conventional single-task detection methods on each item quickly becomes difficult, while correlations between items are ignored. To address this multi-task anomaly detection problem, we propose collaborative anomaly detection (CAD) to jointly learn all tasks with an embedding encoding correlations among tasks. We explore CAD with conditional density estimation and conditional likelihood ratio estimation. We found that: $i$) estimating a likelihood ratio enjoys more efficient learning and yields better results than density estimation. $ii$) It is beneficial to select a small number of tasks in advance to learn a task embedding model, and then use it to warm-start all task embeddings. Consequently, these embeddings can capture correlations between tasks and generalize to new correlated tasks.
△ Less
Submitted 20 September, 2022;
originally announced September 2022.
-
Improving Downstream Task Performance by Treating Numbers as Entities
Authors:
Dhanasekar Sundararaman,
Vivek Subramanian,
Guoyin Wang,
Liyan Xu,
Lawrence Carin
Abstract:
Numbers are essential components of text, like any other word tokens, from which natural language processing (NLP) models are built and deployed. Though numbers are typically not accounted for distinctly in most NLP tasks, there is still an underlying amount of numeracy already exhibited by NLP models. In this work, we attempt to tap this potential of state-of-the-art NLP models and transfer their…
▽ More
Numbers are essential components of text, like any other word tokens, from which natural language processing (NLP) models are built and deployed. Though numbers are typically not accounted for distinctly in most NLP tasks, there is still an underlying amount of numeracy already exhibited by NLP models. In this work, we attempt to tap this potential of state-of-the-art NLP models and transfer their ability to boost performance in related tasks. Our proposed classification of numbers into entities helps NLP models perform well on several tasks, including a handcrafted Fill-In-The-Blank (FITB) task and on question answering using joint embeddings, outperforming the BERT and RoBERTa baseline classification.
△ Less
Submitted 18 September, 2022; v1 submitted 7 May, 2022;
originally announced May 2022.
-
elBERto: Self-supervised Commonsense Learning for Question Answering
Authors:
Xunlin Zhan,
Yuan Li,
Xiao Dong,
Xiaodan Liang,
Zhiting Hu,
Lawrence Carin
Abstract:
Commonsense question answering requires reasoning about everyday situations and causes and effects implicit in context. Typically, existing approaches first retrieve external evidence and then perform commonsense reasoning using these evidence. In this paper, we propose a Self-supervised Bidirectional Encoder Representation Learning of Commonsense (elBERto) framework, which is compatible with off-…
▽ More
Commonsense question answering requires reasoning about everyday situations and causes and effects implicit in context. Typically, existing approaches first retrieve external evidence and then perform commonsense reasoning using these evidence. In this paper, we propose a Self-supervised Bidirectional Encoder Representation Learning of Commonsense (elBERto) framework, which is compatible with off-the-shelf QA model architectures. The framework comprises five self-supervised tasks to force the model to fully exploit the additional training signals from contexts containing rich commonsense. The tasks include a novel Contrastive Relation Learning task to encourage the model to distinguish between logically contrastive contexts, a new Jigsaw Puzzle task that requires the model to infer logical chains in long contexts, and three classic SSL tasks to maintain pre-trained models language encoding ability. On the representative WIQA, CosmosQA, and ReClor datasets, elBERto outperforms all other methods, including those utilizing explicit graph reasoning and external knowledge retrieval. Moreover, elBERto achieves substantial improvements on out-of-paragraph and no-effect questions where simple lexical similarity comparison does not help, indicating that it successfully learns commonsense and is able to leverage it when given dynamic context.
△ Less
Submitted 4 October, 2022; v1 submitted 17 March, 2022;
originally announced March 2022.
-
Capturing Actionable Dynamics with Structured Latent Ordinary Differential Equations
Authors:
Paidamoyo Chapfuwa,
Sherri Rose,
Lawrence Carin,
Edward Meeds,
Ricardo Henao
Abstract:
End-to-end learning of dynamical systems with black-box models, such as neural ordinary differential equations (ODEs), provides a flexible framework for learning dynamics from data without prescribing a mathematical model for the dynamics. Unfortunately, this flexibility comes at the cost of understanding the dynamical system, for which ODEs are used ubiquitously. Further, experimental data are co…
▽ More
End-to-end learning of dynamical systems with black-box models, such as neural ordinary differential equations (ODEs), provides a flexible framework for learning dynamics from data without prescribing a mathematical model for the dynamics. Unfortunately, this flexibility comes at the cost of understanding the dynamical system, for which ODEs are used ubiquitously. Further, experimental data are collected under various conditions (inputs), such as treatments, or grouped in some way, such as part of sub-populations. Understanding the effects of these system inputs on system outputs is crucial to have any meaningful model of a dynamical system. To that end, we propose a structured latent ODE model that explicitly captures system input variations within its latent representation. Building on a static latent variable specification, our model learns (independent) stochastic factors of variation for each input to the system, thus separating the effects of the system inputs in the latent space. This approach provides actionable modeling through the controlled generation of time-series data for novel input combinations (or perturbations). Additionally, we propose a flexible approach for quantifying uncertainties, leveraging a quantile regression formulation. Results on challenging biological datasets show consistent improvements over competitive baselines in the controlled generation of observational data and inference of biologically meaningful system inputs.
△ Less
Submitted 16 June, 2022; v1 submitted 25 February, 2022;
originally announced February 2022.
-
Explainable multiple abnormality classification of chest CT volumes
Authors:
Rachel Lea Draelos,
Lawrence Carin
Abstract:
Understanding model predictions is critical in healthcare, to facilitate rapid verification of model correctness and to guard against use of models that exploit confounding variables. We introduce the challenging new task of explainable multiple abnormality classification in volumetric medical images, in which a model must indicate the regions used to predict each abnormality. To solve this task,…
▽ More
Understanding model predictions is critical in healthcare, to facilitate rapid verification of model correctness and to guard against use of models that exploit confounding variables. We introduce the challenging new task of explainable multiple abnormality classification in volumetric medical images, in which a model must indicate the regions used to predict each abnormality. To solve this task, we propose a multiple instance learning convolutional neural network, AxialNet, that allows identification of top slices for each abnormality. Next we incorporate HiResCAM, an attention mechanism, to identify sub-slice regions. We prove that for AxialNet, HiResCAM explanations are guaranteed to reflect the locations the model used, unlike Grad-CAM which sometimes highlights irrelevant locations. Armed with a model that produces faithful explanations, we then aim to improve the model's learning through a novel mask loss that leverages HiResCAM and 3D allowed regions to encourage the model to predict abnormalities based only on the organs in which those abnormalities appear. The 3D allowed regions are obtained automatically through a new approach, PARTITION, that combines location information extracted from radiology reports with organ segmentation maps obtained through morphological image processing. Overall, we propose the first model for explainable multi-abnormality prediction in volumetric medical images, and then use the mask loss to achieve a 33% improvement in organ localization of multiple abnormalities in the RAD-ChestCT data set of 36,316 scans, representing the state of the art. This work advances the clinical applicability of multiple abnormality modeling in chest CT volumes.
△ Less
Submitted 20 August, 2022; v1 submitted 23 November, 2021;
originally announced November 2021.
-
Finite-Time Consensus Learning for Decentralized Optimization with Nonlinear Gossi**
Authors:
Junya Chen,
Sijia Wang,
Lawrence Carin,
Chenyang Tao
Abstract:
Distributed learning has become an integral tool for scaling up machine learning and addressing the growing need for data privacy. Although more robust to the network topology, decentralized learning schemes have not gained the same level of popularity as their centralized counterparts for being less competitive performance-wise. In this work, we attribute this issue to the lack of synchronization…
▽ More
Distributed learning has become an integral tool for scaling up machine learning and addressing the growing need for data privacy. Although more robust to the network topology, decentralized learning schemes have not gained the same level of popularity as their centralized counterparts for being less competitive performance-wise. In this work, we attribute this issue to the lack of synchronization among decentralized learning workers, showing both empirically and theoretically that the convergence rate is tied to the synchronization level among the workers. Such motivated, we present a novel decentralized learning framework based on nonlinear gossi** (NGO), that enjoys an appealing finite-time consensus property to achieve better synchronization. We provide a careful analysis of its convergence and discuss its merits for modern distributed optimization applications, such as deep neural networks. Our analysis on how communication delay and randomized chats affect learning further enables the derivation of practical variants that accommodate asynchronous and randomized communications. To validate the effectiveness of our proposal, we benchmark NGO against competing solutions through an extensive set of tests, with encouraging results reported.
△ Less
Submitted 13 November, 2021; v1 submitted 4 November, 2021;
originally announced November 2021.
-
Variational Inference with Holder Bounds
Authors:
Junya Chen,
Danni Lu,
Zidi Xiu,
Ke Bai,
Lawrence Carin,
Chenyang Tao
Abstract:
The recent introduction of thermodynamic integration techniques has provided a new framework for understanding and improving variational inference (VI). In this work, we present a careful analysis of the thermodynamic variational objective (TVO), bridging the gap between existing variational objectives and shedding new insights to advance the field. In particular, we elucidate how the TVO naturall…
▽ More
The recent introduction of thermodynamic integration techniques has provided a new framework for understanding and improving variational inference (VI). In this work, we present a careful analysis of the thermodynamic variational objective (TVO), bridging the gap between existing variational objectives and shedding new insights to advance the field. In particular, we elucidate how the TVO naturally connects the three key variational schemes, namely the importance-weighted VI, Renyi-VI, and MCMC-VI, which subsumes most VI objectives employed in practice. To explain the performance gap between theory and practice, we reveal how the pathological geometry of thermodynamic curves negatively affects TVO. By generalizing the integration path from the geometric mean to the weighted Holder mean, we extend the theory of TVO and identify new opportunities for improving VI. This motivates our new VI objectives, named the Holder bounds, which flatten the thermodynamic curves and promise to achieve a one-step approximation of the exact marginal log-likelihood. A comprehensive discussion on the choices of numerical estimators is provided. We present strong empirical evidence on both synthetic and real-world datasets to support our claims.
△ Less
Submitted 13 November, 2021; v1 submitted 4 November, 2021;
originally announced November 2021.
-
Hölder Bounds for Sensitivity Analysis in Causal Reasoning
Authors:
Serge Assaad,
Shuxi Zeng,
Henry Pfister,
Fan Li,
Lawrence Carin
Abstract:
We examine interval estimation of the effect of a treatment T on an outcome Y given the existence of an unobserved confounder U. Using Hölder's inequality, we derive a set of bounds on the confounding bias |E[Y|T=t]-E[Y|do(T=t)]| based on the degree of unmeasured confounding (i.e., the strength of the connection U->T, and the strength of U->Y). These bounds are tight either when U is independent o…
▽ More
We examine interval estimation of the effect of a treatment T on an outcome Y given the existence of an unobserved confounder U. Using Hölder's inequality, we derive a set of bounds on the confounding bias |E[Y|T=t]-E[Y|do(T=t)]| based on the degree of unmeasured confounding (i.e., the strength of the connection U->T, and the strength of U->Y). These bounds are tight either when U is independent of T or when U is independent of Y given T (when there is no unobserved confounding). We focus on a special case of this bound depending on the total variation distance between the distributions p(U) and p(U|T=t), as well as the maximum (over all possible values of U) deviation of the conditional expected outcome E[Y|U=u,T=t] from the average expected outcome E[Y|T=t]. We discuss possible calibration strategies for this bound to get interval estimates for treatment effects, and experimentally validate the bound using synthetic and semi-synthetic datasets.
△ Less
Submitted 9 July, 2021;
originally announced July 2021.
-
Gradient Importance Learning for Incomplete Observations
Authors:
Qitong Gao,
Dong Wang,
Joshua D. Amason,
Siyang Yuan,
Chenyang Tao,
Ricardo Henao,
Majda Hadziahmetovic,
Lawrence Carin,
Miroslav Pajic
Abstract:
Though recent works have developed methods that can generate estimates (or imputations) of the missing entries in a dataset to facilitate downstream analysis, most depend on assumptions that may not align with real-world applications and could suffer from poor performance in subsequent tasks such as classification. This is particularly true if the data have large missingness rates or a small sampl…
▽ More
Though recent works have developed methods that can generate estimates (or imputations) of the missing entries in a dataset to facilitate downstream analysis, most depend on assumptions that may not align with real-world applications and could suffer from poor performance in subsequent tasks such as classification. This is particularly true if the data have large missingness rates or a small sample size. More importantly, the imputation error could be propagated into the prediction step that follows, which may constrain the capabilities of the prediction model. In this work, we introduce the gradient importance learning (GIL) method to train multilayer perceptrons (MLPs) and long short-term memories (LSTMs) to directly perform inference from inputs containing missing values without imputation. Specifically, we employ reinforcement learning (RL) to adjust the gradients used to train these models via back-propagation. This allows the model to exploit the underlying information behind missingness patterns. We test the approach on real-world time-series (i.e., MIMIC-III), tabular data obtained from an eye clinic, and a standard dataset (i.e., MNIST), where our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
△ Less
Submitted 1 March, 2022; v1 submitted 5 July, 2021;
originally announced July 2021.
-
Simpler, Faster, Stronger: Breaking The log-K Curse On Contrastive Learners With FlatNCE
Authors:
Junya Chen,
Zhe Gan,
Xuan Li,
Qing Guo,
Liqun Chen,
Shuyang Gao,
Tagyoung Chung,
Yi Xu,
Belinda Zeng,
Wenlian Lu,
Fan Li,
Lawrence Carin,
Chenyang Tao
Abstract:
InfoNCE-based contrastive representation learners, such as SimCLR, have been tremendously successful in recent years. However, these contrastive schemes are notoriously resource demanding, as their effectiveness breaks down with small-batch training (i.e., the log-K curse, whereas K is the batch-size). In this work, we reveal mathematically why contrastive learners fail in the small-batch-size reg…
▽ More
InfoNCE-based contrastive representation learners, such as SimCLR, have been tremendously successful in recent years. However, these contrastive schemes are notoriously resource demanding, as their effectiveness breaks down with small-batch training (i.e., the log-K curse, whereas K is the batch-size). In this work, we reveal mathematically why contrastive learners fail in the small-batch-size regime, and present a novel simple, non-trivial contrastive objective named FlatNCE, which fixes this issue. Unlike InfoNCE, our FlatNCE no longer explicitly appeals to a discriminative classification goal for contrastive learning. Theoretically, we show FlatNCE is the mathematical dual formulation of InfoNCE, thus bridging the classical literature on energy modeling; and empirically, we demonstrate that, with minimal modification of code, FlatNCE enables immediate performance boost independent of the subject-matter engineering efforts. The significance of this work is furthered by the powerful generalization of contrastive learning techniques, and the introduction of new tools to monitor and diagnose contrastive training. We substantiate our claims with empirical evidence on CIFAR10, ImageNet, and other datasets, where FlatNCE consistently outperforms InfoNCE.
△ Less
Submitted 2 July, 2021;
originally announced July 2021.
-
Tight Mutual Information Estimation With Contrastive Fenchel-Legendre Optimization
Authors:
Qing Guo,
Junya Chen,
Dong Wang,
Yuewei Yang,
Xinwei Deng,
Lawrence Carin,
Fan Li,
**g Huang,
Chenyang Tao
Abstract:
Successful applications of InfoNCE and its variants have popularized the use of contrastive variational mutual information (MI) estimators in machine learning. While featuring superior stability, these estimators crucially depend on costly large-batch training, and they sacrifice bound tightness for variance reduction. To overcome these limitations, we revisit the mathematics of popular variationa…
▽ More
Successful applications of InfoNCE and its variants have popularized the use of contrastive variational mutual information (MI) estimators in machine learning. While featuring superior stability, these estimators crucially depend on costly large-batch training, and they sacrifice bound tightness for variance reduction. To overcome these limitations, we revisit the mathematics of popular variational MI bounds from the lens of unnormalized statistical modeling and convex optimization. Our investigation not only yields a new unified theoretical framework encompassing popular variational MI bounds but also leads to a novel, simple, and powerful contrastive MI estimator named as FLO. Theoretically, we show that the FLO estimator is tight, and it provably converges under stochastic gradient descent. Empirically, our FLO estimator overcomes the limitations of its predecessors and learns more efficiently. The utility of FLO is verified using an extensive set of benchmarks, which also reveals the trade-offs in practical MI estimation.
△ Less
Submitted 24 October, 2022; v1 submitted 2 July, 2021;
originally announced July 2021.
-
Towards Fair Federated Learning with Zero-Shot Data Augmentation
Authors:
Weituo Hao,
Mostafa El-Khamy,
Jungwon Lee,
Jianyi Zhang,
Kevin J Liang,
Changyou Chen,
Lawrence Carin
Abstract:
Federated learning has emerged as an important distributed learning paradigm, where a server aggregates a global model from many client-trained models while having no access to the client data. Although it is recognized that statistical heterogeneity of the client local data yields slower global model convergence, it is less commonly recognized that it also yields a biased federated global model w…
▽ More
Federated learning has emerged as an important distributed learning paradigm, where a server aggregates a global model from many client-trained models while having no access to the client data. Although it is recognized that statistical heterogeneity of the client local data yields slower global model convergence, it is less commonly recognized that it also yields a biased federated global model with a high variance of accuracy across clients. In this work, we aim to provide federated learning schemes with improved fairness. To tackle this challenge, we propose a novel federated learning system that employs zero-shot data augmentation on under-represented data to mitigate statistical heterogeneity and encourage more uniform accuracy performance across clients in federated networks. We study two variants of this scheme, Fed-ZDAC (federated learning with zero-shot data augmentation at the clients) and Fed-ZDAS (federated learning with zero-shot data augmentation at the server). Empirical results on a suite of datasets demonstrate the effectiveness of our methods on simultaneously improving the test accuracy and fairness.
△ Less
Submitted 27 April, 2021;
originally announced April 2021.
-
Malignancy Prediction and Lesion Identification from Clinical Dermatological Images
Authors:
Meng Xia,
Meenal K. Kheterpal,
Samantha C. Wong,
Christine Park,
William Ratliff,
Lawrence Carin,
Ricardo Henao
Abstract:
We consider machine-learning-based malignancy prediction and lesion identification from clinical dermatological images, which can be indistinctly acquired via smartphone or dermoscopy capture. Additionally, we do not assume that images contain single lesions, thus the framework supports both focal or wide-field images. Specifically, we propose a two-stage approach in which we first identify all le…
▽ More
We consider machine-learning-based malignancy prediction and lesion identification from clinical dermatological images, which can be indistinctly acquired via smartphone or dermoscopy capture. Additionally, we do not assume that images contain single lesions, thus the framework supports both focal or wide-field images. Specifically, we propose a two-stage approach in which we first identify all lesions present in the image regardless of sub-type or likelihood of malignancy, then it estimates their likelihood of malignancy, and through aggregation, it also generates an image-level likelihood of malignancy that can be used for high-level screening processes. Further, we consider augmenting the proposed approach with clinical covariates (from electronic health records) and publicly available data (the ISIC dataset). Comprehensive experiments validated on an independent test dataset demonstrate that i) the proposed approach outperforms alternative model architectures; ii) the model based on images outperforms a pure clinical model by a large margin, and the combination of images and clinical data does not significantly improves over the image-only model; and iii) the proposed framework offers comparable performance in terms of malignancy classification relative to three board certified dermatologists with different levels of experience.
△ Less
Submitted 2 April, 2021;
originally announced April 2021.
-
Efficient Feature Transformations for Discriminative and Generative Continual Learning
Authors:
Vinay Kumar Verma,
Kevin J Liang,
Nikhil Mehta,
Piyush Rai,
Lawrence Carin
Abstract:
As neural networks are increasingly being applied to real-world applications, mechanisms to address distributional shift and sequential task learning without forgetting are critical. Methods incorporating network expansion have shown promise by naturally adding model capacity for learning new tasks while simultaneously avoiding catastrophic forgetting. However, the growth in the number of addition…
▽ More
As neural networks are increasingly being applied to real-world applications, mechanisms to address distributional shift and sequential task learning without forgetting are critical. Methods incorporating network expansion have shown promise by naturally adding model capacity for learning new tasks while simultaneously avoiding catastrophic forgetting. However, the growth in the number of additional parameters of many of these types of methods can be computationally expensive at larger scales, at times prohibitively so. Instead, we propose a simple task-specific feature map transformation strategy for continual learning, which we call Efficient Feature Transformations (EFTs). These EFTs provide powerful flexibility for learning new tasks, achieved with minimal parameters added to the base architecture. We further propose a feature distance maximization strategy, which significantly improves task prediction in class incremental settings, without needing expensive generative models. We demonstrate the efficacy and efficiency of our method with an extensive set of experiments in discriminative (CIFAR-100 and ImageNet-1K) and generative (LSUN, CUB-200, Cats) sequences of tasks. Even with low single-digit parameter growth rates, EFTs can outperform many other continual learning methods in a wide range of settings.
△ Less
Submitted 24 March, 2021;
originally announced March 2021.
-
Improving Zero-shot Voice Style Transfer via Disentangled Representation Learning
Authors:
Siyang Yuan,
Pengyu Cheng,
Ruiyi Zhang,
Weituo Hao,
Zhe Gan,
Lawrence Carin
Abstract:
Voice style transfer, also called voice conversion, seeks to modify one speaker's voice to generate speech as if it came from another (target) speaker. Previous works have made progress on voice conversion with parallel training data and pre-known speakers. However, zero-shot voice style transfer, which learns from non-parallel data and generates voices for previously unseen speakers, remains a ch…
▽ More
Voice style transfer, also called voice conversion, seeks to modify one speaker's voice to generate speech as if it came from another (target) speaker. Previous works have made progress on voice conversion with parallel training data and pre-known speakers. However, zero-shot voice style transfer, which learns from non-parallel data and generates voices for previously unseen speakers, remains a challenging problem. We propose a novel zero-shot voice transfer method via disentangled representation learning. The proposed method first encodes speaker-related style and voice content of each input voice into separated low-dimensional embedding spaces, and then transfers to a new voice by combining the source content embedding and target style embedding through a decoder. With information-theoretic guidance, the style and content embedding spaces are representative and (ideally) independent of each other. On real-world VCTK datasets, our method outperforms other baselines and obtains state-of-the-art results in terms of transfer accuracy and voice naturalness for voice style transfer experiments under both many-to-many and zero-shot setups.
△ Less
Submitted 16 March, 2021;
originally announced March 2021.
-
FairFil: Contrastive Neural Debiasing Method for Pretrained Text Encoders
Authors:
Pengyu Cheng,
Weituo Hao,
Siyang Yuan,
Shi**g Si,
Lawrence Carin
Abstract:
Pretrained text encoders, such as BERT, have been applied increasingly in various natural language processing (NLP) tasks, and have recently demonstrated significant performance gains. However, recent studies have demonstrated the existence of social bias in these pretrained NLP models. Although prior works have made progress on word-level debiasing, improved sentence-level fairness of pretrained…
▽ More
Pretrained text encoders, such as BERT, have been applied increasingly in various natural language processing (NLP) tasks, and have recently demonstrated significant performance gains. However, recent studies have demonstrated the existence of social bias in these pretrained NLP models. Although prior works have made progress on word-level debiasing, improved sentence-level fairness of pretrained encoders still lacks exploration. In this paper, we proposed the first neural debiasing method for a pretrained sentence encoder, which transforms the pretrained encoder outputs into debiased representations via a fair filter (FairFil) network. To learn the FairFil, we introduce a contrastive learning framework that not only minimizes the correlation between filtered embeddings and bias words but also preserves rich semantic information of the original sentences. On real-world datasets, our FairFil effectively reduces the bias degree of pretrained text encoders, while continuously showing desirable performance on downstream tasks. Moreover, our post-hoc method does not require any retraining of the text encoders, further enlarging FairFil's application space.
△ Less
Submitted 10 March, 2021;
originally announced March 2021.
-
CAM-GAN: Continual Adaptation Modules for Generative Adversarial Networks
Authors:
Sakshi Varshney,
Vinay Kumar Verma,
Srijith P K,
Lawrence Carin,
Piyush Rai
Abstract:
We present a continual learning approach for generative adversarial networks (GANs), by designing and leveraging parameter-efficient feature map transformations. Our approach is based on learning a set of global and task-specific parameters. The global parameters are fixed across tasks whereas the task-specific parameters act as local adapters for each task, and help in efficiently obtaining task-…
▽ More
We present a continual learning approach for generative adversarial networks (GANs), by designing and leveraging parameter-efficient feature map transformations. Our approach is based on learning a set of global and task-specific parameters. The global parameters are fixed across tasks whereas the task-specific parameters act as local adapters for each task, and help in efficiently obtaining task-specific feature maps. Moreover, we propose an element-wise addition of residual bias in the transformed feature space, which further helps stabilize GAN training in such settings. Our approach also leverages task similarity information based on the Fisher information matrix. Leveraging this knowledge from previous tasks significantly improves the model performance. In addition, the similarity measure also helps reduce the parameter growth in continual adaptation and helps to learn a compact model. In contrast to the recent approaches for continually-learned GANs, the proposed approach provides a memory-efficient way to perform effective continual data generation. Through extensive experiments on challenging and diverse datasets, we show that the feature-map-transformation approach outperforms state-of-the-art methods for continually-learned GANs, with substantially fewer parameters. The proposed method generates high-quality samples that can also improve the generative-replay-based continual learning for discriminative tasks.
△ Less
Submitted 30 July, 2021; v1 submitted 6 March, 2021;
originally announced March 2021.
-
Meta-Learned Attribute Self-Gating for Continual Generalized Zero-Shot Learning
Authors:
Vinay Kumar Verma,
Kevin Liang,
Nikhil Mehta,
Lawrence Carin
Abstract:
Zero-shot learning (ZSL) has been shown to be a promising approach to generalizing a model to categories unseen during training by leveraging class attributes, but challenges still remain. Recently, methods using generative models to combat bias towards classes seen during training have pushed the state of the art of ZSL, but these generative models can be slow or computationally expensive to trai…
▽ More
Zero-shot learning (ZSL) has been shown to be a promising approach to generalizing a model to categories unseen during training by leveraging class attributes, but challenges still remain. Recently, methods using generative models to combat bias towards classes seen during training have pushed the state of the art of ZSL, but these generative models can be slow or computationally expensive to train. Additionally, while many previous ZSL methods assume a one-time adaptation to unseen classes, in reality, the world is always changing, necessitating a constant adjustment for deployed models. Models unprepared to handle a sequential stream of data are likely to experience catastrophic forgetting. We propose a meta-continual zero-shot learning (MCZSL) approach to address both these issues. In particular, by pairing self-gating of attributes and scaled class normalization with meta-learning based training, we are able to outperform state-of-the-art results while being able to train our models substantially faster ($>100\times$) than expensive generative-based approaches. We demonstrate this by performing experiments on five standard ZSL datasets (CUB, aPY, AWA1, AWA2 and SUN) in both generalized zero-shot learning and generalized continual zero-shot learning settings.
△ Less
Submitted 23 February, 2021;
originally announced February 2021.
-
FLOP: Federated Learning on Medical Datasets using Partial Networks
Authors:
Qian Yang,
Jianyi Zhang,
Weituo Hao,
Gregory Spell,
Lawrence Carin
Abstract:
The outbreak of COVID-19 Disease due to the novel coronavirus has caused a shortage of medical resources. To aid and accelerate the diagnosis process, automatic diagnosis of COVID-19 via deep learning models has recently been explored by researchers across the world. While different data-driven deep learning models have been developed to mitigate the diagnosis of COVID-19, the data itself is still…
▽ More
The outbreak of COVID-19 Disease due to the novel coronavirus has caused a shortage of medical resources. To aid and accelerate the diagnosis process, automatic diagnosis of COVID-19 via deep learning models has recently been explored by researchers across the world. While different data-driven deep learning models have been developed to mitigate the diagnosis of COVID-19, the data itself is still scarce due to patient privacy concerns. Federated Learning (FL) is a natural solution because it allows different organizations to cooperatively learn an effective deep learning model without sharing raw data. However, recent studies show that FL still lacks privacy protection and may cause data leakage. We investigate this challenging problem by proposing a simple yet effective algorithm, named \textbf{F}ederated \textbf{L}earning \textbf{o}n Medical Datasets using \textbf{P}artial Networks (FLOP), that shares only a partial model between the server and clients. Extensive experiments on benchmark data and real-world healthcare tasks show that our approach achieves comparable or better performance while reducing the privacy and security risks. Of particular interest, we conduct experiments on the COVID-19 dataset and find that our FLOP algorithm can allow different hospitals to collaboratively and effectively train a partially shared model without sharing local patients' data.
△ Less
Submitted 22 June, 2021; v1 submitted 9 February, 2021;
originally announced February 2021.
-
What Makes Good In-Context Examples for GPT-$3$?
Authors:
Jiachang Liu,
Dinghan Shen,
Yizhe Zhang,
Bill Dolan,
Lawrence Carin,
Weizhu Chen
Abstract:
GPT-$3$ has attracted lots of attention due to its superior performance across a wide range of NLP tasks, especially with its powerful and versatile in-context few-shot learning ability. Despite its success, we found that the empirical results of GPT-$3$ depend heavily on the choice of in-context examples. In this work, we investigate whether there are more effective strategies for judiciously sel…
▽ More
GPT-$3$ has attracted lots of attention due to its superior performance across a wide range of NLP tasks, especially with its powerful and versatile in-context few-shot learning ability. Despite its success, we found that the empirical results of GPT-$3$ depend heavily on the choice of in-context examples. In this work, we investigate whether there are more effective strategies for judiciously selecting in-context examples (relative to random sampling) that better leverage GPT-$3$'s few-shot capabilities. Inspired by the recent success of leveraging a retrieval module to augment large-scale neural network models, we propose to retrieve examples that are semantically-similar to a test sample to formulate its corresponding prompt. Intuitively, the in-context examples selected with such a strategy may serve as more informative inputs to unleash GPT-$3$'s extensive knowledge. We evaluate the proposed approach on several natural language understanding and generation benchmarks, where the retrieval-based prompt selection approach consistently outperforms the random baseline. Moreover, it is observed that the sentence encoders fine-tuned on task-related datasets yield even more helpful retrieval results. Notably, significant gains are observed on tasks such as table-to-text generation (41.9% on the ToTTo dataset) and open-domain question answering (45.5% on the NQ dataset). We hope our investigation could help understand the behaviors of GPT-$3$ and large-scale pre-trained LMs in general and enhance their few-shot capabilities.
△ Less
Submitted 17 January, 2021;
originally announced January 2021.
-
Reinforcement Learning for Flexibility Design Problems
Authors:
Yehua Wei,
Lei Zhang,
Ruiyi Zhang,
Shi**g Si,
Hao Zhang,
Lawrence Carin
Abstract:
Flexibility design problems are a class of problems that appear in strategic decision-making across industries, where the objective is to design a ($e.g.$, manufacturing) network that affords flexibility and adaptivity. The underlying combinatorial nature and stochastic objectives make flexibility design problems challenging for standard optimization methods. In this paper, we develop a reinforcem…
▽ More
Flexibility design problems are a class of problems that appear in strategic decision-making across industries, where the objective is to design a ($e.g.$, manufacturing) network that affords flexibility and adaptivity. The underlying combinatorial nature and stochastic objectives make flexibility design problems challenging for standard optimization methods. In this paper, we develop a reinforcement learning (RL) framework for flexibility design problems. Specifically, we carefully design mechanisms with noisy exploration and variance reduction to ensure empirical success and show the unique advantage of RL in terms of fast-adaptation. Empirical results show that the RL-based method consistently finds better solutions compared to classical heuristics.
△ Less
Submitted 18 January, 2021; v1 submitted 1 January, 2021;
originally announced January 2021.
-
Wasserstein Contrastive Representation Distillation
Authors:
Liqun Chen,
Dong Wang,
Zhe Gan,
**g**g Liu,
Ricardo Henao,
Lawrence Carin
Abstract:
The primary goal of knowledge distillation (KD) is to encapsulate the information of a model learned from a teacher network into a student network, with the latter being more compact than the former. Existing work, e.g., using Kullback-Leibler divergence for distillation, may fail to capture important structural knowledge in the teacher network and often lacks the ability for feature generalizatio…
▽ More
The primary goal of knowledge distillation (KD) is to encapsulate the information of a model learned from a teacher network into a student network, with the latter being more compact than the former. Existing work, e.g., using Kullback-Leibler divergence for distillation, may fail to capture important structural knowledge in the teacher network and often lacks the ability for feature generalization, particularly in situations when teacher and student are built to address different classification tasks. We propose Wasserstein Contrastive Representation Distillation (WCoRD), which leverages both primal and dual forms of Wasserstein distance for KD. The dual form is used for global knowledge transfer, yielding a contrastive learning objective that maximizes the lower bound of mutual information between the teacher and the student networks. The primal form is used for local contrastive knowledge transfer within a mini-batch, effectively matching the distributions of features between the teacher and the student networks. Experiments demonstrate that the proposed WCoRD method outperforms state-of-the-art approaches on privileged information distillation, model compression and cross-modal transfer.
△ Less
Submitted 28 March, 2021; v1 submitted 15 December, 2020;
originally announced December 2020.
-
Learning Graphons via Structured Gromov-Wasserstein Barycenters
Authors:
Hongteng Xu,
Dixin Luo,
Lawrence Carin,
Hongyuan Zha
Abstract:
We propose a novel and principled method to learn a nonparametric graph model called graphon, which is defined in an infinite-dimensional space and represents arbitrary-size graphs. Based on the weak regularity lemma from the theory of graphons, we leverage a step function to approximate a graphon. We show that the cut distance of graphons can be relaxed to the Gromov-Wasserstein distance of their…
▽ More
We propose a novel and principled method to learn a nonparametric graph model called graphon, which is defined in an infinite-dimensional space and represents arbitrary-size graphs. Based on the weak regularity lemma from the theory of graphons, we leverage a step function to approximate a graphon. We show that the cut distance of graphons can be relaxed to the Gromov-Wasserstein distance of their step functions. Accordingly, given a set of graphs generated by an underlying graphon, we learn the corresponding step function as the Gromov-Wasserstein barycenter of the given graphs. Furthermore, we develop several enhancements and extensions of the basic algorithm, $e.g.$, the smoothed Gromov-Wasserstein barycenter for guaranteeing the continuity of the learned graphons and the mixed Gromov-Wasserstein barycenters for learning multiple structured graphons. The proposed approach overcomes drawbacks of prior state-of-the-art methods, and outperforms them on both synthetic and real-world data. The code is available at https://github.com/HongtengXu/SGWB-Graphon.
△ Less
Submitted 17 December, 2020; v1 submitted 10 December, 2020;
originally announced December 2020.
-
Proactive Pseudo-Intervention: Causally Informed Contrastive Learning For Interpretable Vision Models
Authors:
Dong Wang,
Yuewei Yang,
Chenyang Tao,
Zhe Gan,
Liqun Chen,
Fanjie Kong,
Ricardo Henao,
Lawrence Carin
Abstract:
Deep neural networks excel at comprehending complex visual signals, delivering on par or even superior performance to that of human experts. However, ad-hoc visual explanations of model decisions often reveal an alarming level of reliance on exploiting non-causal visual cues that strongly correlate with the target label in training data. As such, deep neural nets suffer compromised generalization…
▽ More
Deep neural networks excel at comprehending complex visual signals, delivering on par or even superior performance to that of human experts. However, ad-hoc visual explanations of model decisions often reveal an alarming level of reliance on exploiting non-causal visual cues that strongly correlate with the target label in training data. As such, deep neural nets suffer compromised generalization to novel inputs collected from different sources, and the reverse engineering of their decision rules offers limited interpretability. To overcome these limitations, we present a novel contrastive learning strategy called {\it Proactive Pseudo-Intervention} (PPI) that leverages proactive interventions to guard against image features with no causal relevance. We also devise a novel causally informed salience map** module to identify key image pixels to intervene, and show it greatly facilitates model interpretability. To demonstrate the utility of our proposals, we benchmark on both standard natural images and challenging medical image datasets. PPI-enhanced models consistently deliver superior performance relative to competing solutions, especially on out-of-domain predictions and data integration from heterogeneous sources. Further, our causally trained saliency maps are more succinct and meaningful relative to their non-causal counterparts.
△ Less
Submitted 29 April, 2021; v1 submitted 6 December, 2020;
originally announced December 2020.
-
Supercharging Imbalanced Data Learning With Energy-based Contrastive Representation Transfer
Authors:
Zidi Xiu,
Junya Chen,
Ricardo Henao,
Benjamin Goldstein,
Lawrence Carin,
Chenyang Tao
Abstract:
Dealing with severe class imbalance poses a major challenge for real-world applications, especially when the accurate classification and generalization of minority classes is of primary interest. In computer vision, learning from long tailed datasets is a recurring theme, especially for natural image datasets. While existing solutions mostly appeal to sampling or weighting adjustments to alleviate…
▽ More
Dealing with severe class imbalance poses a major challenge for real-world applications, especially when the accurate classification and generalization of minority classes is of primary interest. In computer vision, learning from long tailed datasets is a recurring theme, especially for natural image datasets. While existing solutions mostly appeal to sampling or weighting adjustments to alleviate the pathological imbalance, or imposing inductive bias to prioritize non-spurious associations, we take novel perspectives to promote sample efficiency and model generalization based on the invariance principles of causality. Our proposal posits a meta-distributional scenario, where the data generating mechanism is invariant across the label-conditional feature distributions. Such causal assumption enables efficient knowledge transfer from the dominant classes to their under-represented counterparts, even if the respective feature distributions show apparent disparities. This allows us to leverage a causal data inflation procedure to enlarge the representation of minority classes. Our development is orthogonal to the existing extreme classification techniques thus can be seamlessly integrated. The utility of our proposal is validated with an extensive set of synthetic and real-world computer vision tasks against SOTA solutions.
△ Less
Submitted 19 November, 2021; v1 submitted 24 November, 2020;
originally announced November 2020.
-
Use HiResCAM instead of Grad-CAM for faithful explanations of convolutional neural networks
Authors:
Rachel Lea Draelos,
Lawrence Carin
Abstract:
Explanation methods facilitate the development of models that learn meaningful concepts and avoid exploiting spurious correlations. We illustrate a previously unrecognized limitation of the popular neural network explanation method Grad-CAM: as a side effect of the gradient averaging step, Grad-CAM sometimes highlights locations the model did not actually use. To solve this problem, we propose HiR…
▽ More
Explanation methods facilitate the development of models that learn meaningful concepts and avoid exploiting spurious correlations. We illustrate a previously unrecognized limitation of the popular neural network explanation method Grad-CAM: as a side effect of the gradient averaging step, Grad-CAM sometimes highlights locations the model did not actually use. To solve this problem, we propose HiResCAM, a novel class-specific explanation method that is guaranteed to highlight only the locations the model used to make each prediction. We prove that HiResCAM is a generalization of CAM and explore the relationships between HiResCAM and other gradient-based explanation methods. Experiments on PASCAL VOC 2012, including crowd-sourced evaluations, illustrate that while HiResCAM's explanations faithfully reflect the model, Grad-CAM often expands the attention to create bigger and smoother visualizations. Overall, this work advances convolutional neural network explanation approaches and may aid in the development of trustworthy models for sensitive applications.
△ Less
Submitted 20 November, 2021; v1 submitted 17 November, 2020;
originally announced November 2020.
-
Estimating Total Correlation with Mutual Information Estimators
Authors:
Ke Bai,
Pengyu Cheng,
Weituo Hao,
Ricardo Henao,
Lawrence Carin
Abstract:
Total correlation (TC) is a fundamental concept in information theory that measures statistical dependency among multiple random variables. Recently, TC has shown noticeable effectiveness as a regularizer in many learning tasks, where the correlation among multiple latent embeddings requires to be jointly minimized or maximized. However, calculating precise TC values is challenging, especially whe…
▽ More
Total correlation (TC) is a fundamental concept in information theory that measures statistical dependency among multiple random variables. Recently, TC has shown noticeable effectiveness as a regularizer in many learning tasks, where the correlation among multiple latent embeddings requires to be jointly minimized or maximized. However, calculating precise TC values is challenging, especially when the closed-form distributions of embedding variables are unknown. In this paper, we introduce a unified framework to estimate total correlation values with sample-based mutual information (MI) estimators. More specifically, we discover a relation between TC and MI and propose two types of calculation paths (tree-like and line-like) to decompose TC into MI terms. With each MI term being bounded, the TC values can be successfully estimated. Further, we provide theoretical analyses concerning the statistical consistency of the proposed TC estimators. Experiments are presented on both synthetic and real-world scenarios, where our estimators demonstrate effectiveness in all TC estimation, minimization, and maximization tasks. The code is available at https://github.com/Linear95/TC-estimation.
△ Less
Submitted 22 February, 2023; v1 submitted 9 November, 2020;
originally announced November 2020.
-
MixKD: Towards Efficient Distillation of Large-scale Language Models
Authors:
Kevin J Liang,
Weituo Hao,
Dinghan Shen,
Yufan Zhou,
Weizhu Chen,
Changyou Chen,
Lawrence Carin
Abstract:
Large-scale language models have recently demonstrated impressive empirical performance. Nevertheless, the improved results are attained at the price of bigger models, more power consumption, and slower inference, which hinder their applicability to low-resource (both memory and computation) platforms. Knowledge distillation (KD) has been demonstrated as an effective framework for compressing such…
▽ More
Large-scale language models have recently demonstrated impressive empirical performance. Nevertheless, the improved results are attained at the price of bigger models, more power consumption, and slower inference, which hinder their applicability to low-resource (both memory and computation) platforms. Knowledge distillation (KD) has been demonstrated as an effective framework for compressing such big models. However, large-scale neural network systems are prone to memorize training instances, and thus tend to make inconsistent predictions when the data distribution is altered slightly. Moreover, the student model has few opportunities to request useful information from the teacher model when there is limited task-specific data available. To address these issues, we propose MixKD, a data-agnostic distillation framework that leverages mixup, a simple yet efficient data augmentation approach, to endow the resulting model with stronger generalization ability. Concretely, in addition to the original training examples, the student model is encouraged to mimic the teacher's behavior on the linear interpolation of example pairs as well. We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the generalization error and the empirical error. To verify its effectiveness, we conduct experiments on the GLUE benchmark, where MixKD consistently leads to significant gains over the standard KD training, and outperforms several competitive baselines. Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
△ Less
Submitted 17 March, 2021; v1 submitted 1 November, 2020;
originally announced November 2020.
-
Counterfactual Representation Learning with Balancing Weights
Authors:
Serge Assaad,
Shuxi Zeng,
Chenyang Tao,
Shounak Datta,
Nikhil Mehta,
Ricardo Henao,
Fan Li,
Lawrence Carin
Abstract:
A key to causal inference with observational data is achieving balance in predictive features associated with each treatment type. Recent literature has explored representation learning to achieve this goal. In this work, we discuss the pitfalls of these strategies - such as a steep trade-off between achieving balance and predictive power - and present a remedy via the integration of balancing wei…
▽ More
A key to causal inference with observational data is achieving balance in predictive features associated with each treatment type. Recent literature has explored representation learning to achieve this goal. In this work, we discuss the pitfalls of these strategies - such as a steep trade-off between achieving balance and predictive power - and present a remedy via the integration of balancing weights in causal learning. Specifically, we theoretically link balance to the quality of propensity estimation, emphasize the importance of identifying a proper target population, and elaborate on the complementary roles of feature balancing and weight adjustments. Using these concepts, we then develop an algorithm for flexible, scalable and accurate estimation of causal effects. Finally, we show how the learned weighted representations may serve to facilitate alternative causal learning procedures with appealing statistical features. We conduct an extensive set of experiments on both synthetic examples and standard benchmarks, and report encouraging results relative to state-of-the-art baselines.
△ Less
Submitted 23 February, 2021; v1 submitted 23 October, 2020;
originally announced October 2020.
-
Double Robust Representation Learning for Counterfactual Prediction
Authors:
Shuxi Zeng,
Serge Assaad,
Chenyang Tao,
Shounak Datta,
Lawrence Carin,
Fan Li
Abstract:
Causal inference, or counterfactual prediction, is central to decision making in healthcare, policy and social sciences. To de-bias causal estimators with high-dimensional data in observational studies, recent advances suggest the importance of combining machine learning models for both the propensity score and the outcome function. We propose a novel scalable method to learn double-robust represe…
▽ More
Causal inference, or counterfactual prediction, is central to decision making in healthcare, policy and social sciences. To de-bias causal estimators with high-dimensional data in observational studies, recent advances suggest the importance of combining machine learning models for both the propensity score and the outcome function. We propose a novel scalable method to learn double-robust representations for counterfactual predictions, leading to consistent causal estimation if the model for either the propensity score or the outcome, but not necessarily both, is correctly specified. Specifically, we use the entropy balancing method to learn the weights that minimize the Jensen-Shannon divergence of the representation between the treated and control groups, based on which we make robust and efficient counterfactual predictions for both individual and average treatment effects. We provide theoretical justifications for the proposed method. The algorithm shows competitive performance with the state-of-the-art on real world and synthetic data.
△ Less
Submitted 16 October, 2020; v1 submitted 15 October, 2020;
originally announced October 2020.
-
RetiNerveNet: Using Recursive Deep Learning to Estimate Pointwise 24-2 Visual Field Data based on Retinal Structure
Authors:
Shounak Datta,
Eduardo B. Mariottoni,
David Dov,
Alessandro A. Jammal,
Lawrence Carin,
Felipe A. Medeiros
Abstract:
Glaucoma is the leading cause of irreversible blindness in the world, affecting over 70 million people. The cumbersome Standard Automated Perimetry (SAP) test is most frequently used to detect visual loss due to glaucoma. Due to the SAP test's innate difficulty and its high test-retest variability, we propose the RetiNerveNet, a deep convolutional recursive neural network for obtaining estimates o…
▽ More
Glaucoma is the leading cause of irreversible blindness in the world, affecting over 70 million people. The cumbersome Standard Automated Perimetry (SAP) test is most frequently used to detect visual loss due to glaucoma. Due to the SAP test's innate difficulty and its high test-retest variability, we propose the RetiNerveNet, a deep convolutional recursive neural network for obtaining estimates of the SAP visual field. RetiNerveNet uses information from the more objective Spectral-Domain Optical Coherence Tomography (SDOCT). RetiNerveNet attempts to trace-back the arcuate convergence of the retinal nerve fibers, starting from the Retinal Nerve Fiber Layer (RNFL) thickness around the optic disc, to estimate individual age-corrected 24-2 SAP values. Recursive passes through the proposed network sequentially yield estimates of the visual locations progressively farther from the optic disc. While all the methods used for our experiments exhibit lower performance for the advanced disease group, the proposed network is observed to be more accurate than all the baselines for estimating the individual visual field values. We further augment RetiNerveNet to additionally predict the SAP Mean Deviation values and also create an ensemble of RetiNerveNets that further improves the performance, by increasingly weighting-up underrepresented parts of the training data.
△ Less
Submitted 19 June, 2021; v1 submitted 14 October, 2020;
originally announced October 2020.
-
Improving Text Generation with Student-Forcing Optimal Transport
Authors:
Guoyin Wang,
Chunyuan Li,
Jianqiao Li,
Hao Fu,
Yuh-Chen Lin,
Liqun Chen,
Yizhe Zhang,
Chenyang Tao,
Ruiyi Zhang,
Wenlin Wang,
Dinghan Shen,
Qian Yang,
Lawrence Carin
Abstract:
Neural language models are often trained with maximum likelihood estimation (MLE), where the next word is generated conditioned on the ground-truth word tokens. During testing, however, the model is instead conditioned on previously generated tokens, resulting in what is termed exposure bias. To reduce this gap between training and testing, we propose using optimal transport (OT) to match the sequ…
▽ More
Neural language models are often trained with maximum likelihood estimation (MLE), where the next word is generated conditioned on the ground-truth word tokens. During testing, however, the model is instead conditioned on previously generated tokens, resulting in what is termed exposure bias. To reduce this gap between training and testing, we propose using optimal transport (OT) to match the sequences generated in these two modes. An extension is further proposed to improve the OT learning, based on the structural and contextual information of the text sequences. The effectiveness of the proposed method is validated on machine translation, text summarization, and text generation tasks.
△ Less
Submitted 12 October, 2020;
originally announced October 2020.
-
Background Adaptive Faster R-CNN for Semi-Supervised Convolutional Object Detection of Threats in X-Ray Images
Authors:
John B. Sigman,
Gregory P. Spell,
Kevin J Liang,
Lawrence Carin
Abstract:
Recently, progress has been made in the supervised training of Convolutional Object Detectors (e.g. Faster R-CNN) for threat recognition in carry-on luggage using X-ray images. This is part of the Transportation Security Administration's (TSA's) mission to protect air travelers in the United States. While more training data with threats may reliably improve performance for this class of deep algor…
▽ More
Recently, progress has been made in the supervised training of Convolutional Object Detectors (e.g. Faster R-CNN) for threat recognition in carry-on luggage using X-ray images. This is part of the Transportation Security Administration's (TSA's) mission to protect air travelers in the United States. While more training data with threats may reliably improve performance for this class of deep algorithm, it is expensive to stage in realistic contexts. By contrast, data from the real world can be collected quickly with minimal cost. In this paper, we present a semi-supervised approach for threat recognition which we call Background Adaptive Faster R-CNN. This approach is a training method for two-stage object detectors which uses Domain Adaptation methods from the field of deep learning. The data sources described earlier make two "domains": a hand-collected data domain of images with threats, and a real-world domain of images assumed without threats. Two domain discriminators, one for discriminating object proposals and one for image features, are adversarially trained to prevent encoding domain-specific information. Without this penalty a Convolutional Neural Network (CNN) can learn to identify domains based on superficial characteristics, and minimize a supervised loss function without improving its ability to recognize objects. For the hand-collected data, only object proposals and image features from backgrounds are used. The losses for these domain-adaptive discriminators are added to the Faster R-CNN losses of images from both domains. This can reduce threat detection false alarm rates by matching the statistics of extracted features from hand-collected backgrounds to real world data. Performance improvements are demonstrated on two independently-collected datasets of labeled threats.
△ Less
Submitted 2 October, 2020;
originally announced October 2020.
-
Weakly supervised cross-domain alignment with optimal transport
Authors:
Siyang Yuan,
Ke Bai,
Liqun Chen,
Yizhe Zhang,
Chenyang Tao,
Chunyuan Li,
Guoyin Wang,
Ricardo Henao,
Lawrence Carin
Abstract:
Cross-domain alignment between image objects and text sequences is key to many visual-language tasks, and it poses a fundamental challenge to both computer vision and natural language processing. This paper investigates a novel approach for the identification and optimization of fine-grained semantic similarities between image and text entities, under a weakly-supervised setup, improving performan…
▽ More
Cross-domain alignment between image objects and text sequences is key to many visual-language tasks, and it poses a fundamental challenge to both computer vision and natural language processing. This paper investigates a novel approach for the identification and optimization of fine-grained semantic similarities between image and text entities, under a weakly-supervised setup, improving performance over state-of-the-art solutions. Our method builds upon recent advances in optimal transport (OT) to resolve the cross-domain matching problem in a principled manner. Formulated as a drop-in regularizer, the proposed OT solution can be efficiently computed and used in combination with other existing approaches. We present empirical evidence to demonstrate the effectiveness of our approach, showing how it enables simpler model architectures to outperform or be comparable with more sophisticated designs on a range of vision-language tasks.
△ Less
Submitted 14 August, 2020;
originally announced August 2020.
-
WAFFLe: Weight Anonymized Factorization for Federated Learning
Authors:
Weituo Hao,
Nikhil Mehta,
Kevin J Liang,
Pengyu Cheng,
Mostafa El-Khamy,
Lawrence Carin
Abstract:
In domains where data are sensitive or private, there is great value in methods that can learn in a distributed manner without the data ever leaving the local devices. In light of this need, federated learning has emerged as a popular training paradigm. However, many federated learning approaches trade transmitting data for communicating updated weight parameters for each local device. Therefore,…
▽ More
In domains where data are sensitive or private, there is great value in methods that can learn in a distributed manner without the data ever leaving the local devices. In light of this need, federated learning has emerged as a popular training paradigm. However, many federated learning approaches trade transmitting data for communicating updated weight parameters for each local device. Therefore, a successful breach that would have otherwise directly compromised the data instead grants whitebox access to the local model, which opens the door to a number of attacks, including exposing the very data federated learning seeks to protect. Additionally, in distributed scenarios, individual client devices commonly exhibit high statistical heterogeneity. Many common federated approaches learn a single global model; while this may do well on average, performance degrades when the i.i.d. assumption is violated, underfitting individuals further from the mean, and raising questions of fairness. To address these issues, we propose Weight Anonymized Factorization for Federated Learning (WAFFLe), an approach that combines the Indian Buffet Process with a shared dictionary of weight factors for neural networks. Experiments on MNIST, FashionMNIST, and CIFAR-10 demonstrate WAFFLe's significant improvement to local test performance and fairness while simultaneously providing an extra layer of security.
△ Less
Submitted 13 August, 2020;
originally announced August 2020.
-
Bridging Maximum Likelihood and Adversarial Learning via $α$-Divergence
Authors:
Miaoyun Zhao,
Yulai Cong,
Shuyang Dai,
Lawrence Carin
Abstract:
Maximum likelihood (ML) and adversarial learning are two popular approaches for training generative models, and from many perspectives these techniques are complementary. ML learning encourages the capture of all data modes, and it is typically characterized by stable training. However, ML learning tends to distribute probability mass diffusely over the data space, $e.g.$, yielding blurry syntheti…
▽ More
Maximum likelihood (ML) and adversarial learning are two popular approaches for training generative models, and from many perspectives these techniques are complementary. ML learning encourages the capture of all data modes, and it is typically characterized by stable training. However, ML learning tends to distribute probability mass diffusely over the data space, $e.g.$, yielding blurry synthetic images. Adversarial learning is well known to synthesize highly realistic natural images, despite practical challenges like mode drop** and delicate training. We propose an $α$-Bridge to unify the advantages of ML and adversarial learning, enabling the smooth transfer from one to the other via the $α$-divergence. We reveal that generalizations of the $α$-Bridge are closely related to approaches developed recently to regularize adversarial learning, providing insights into that prior work, and further understanding of why the $α$-Bridge performs well in practice.
△ Less
Submitted 13 July, 2020;
originally announced July 2020.
-
Graph Optimal Transport for Cross-Domain Alignment
Authors:
Liqun Chen,
Zhe Gan,
Yu Cheng,
Linjie Li,
Lawrence Carin,
**g**g Liu
Abstract:
Cross-domain alignment between two sets of entities (e.g., objects in an image, words in a sentence) is fundamental to both computer vision and natural language processing. Existing methods mainly focus on designing advanced attention mechanisms to simulate soft alignment, with no training signals to explicitly encourage alignment. The learned attention matrices are also dense and lacks interpreta…
▽ More
Cross-domain alignment between two sets of entities (e.g., objects in an image, words in a sentence) is fundamental to both computer vision and natural language processing. Existing methods mainly focus on designing advanced attention mechanisms to simulate soft alignment, with no training signals to explicitly encourage alignment. The learned attention matrices are also dense and lacks interpretability. We propose Graph Optimal Transport (GOT), a principled framework that germinates from recent advances in Optimal Transport (OT). In GOT, cross-domain alignment is formulated as a graph matching problem, by representing entities into a dynamically-constructed graph. Two types of OT distances are considered: (i) Wasserstein distance (WD) for node (entity) matching; and (ii) Gromov-Wasserstein distance (GWD) for edge (structure) matching. Both WD and GWD can be incorporated into existing neural network models, effectively acting as a drop-in regularizer. The inferred transport plan also yields sparse and self-normalized alignment, enhancing the interpretability of the learned model. Experiments show consistent outperformance of GOT over baselines across a wide range of tasks, including image-text retrieval, visual question answering, image captioning, machine translation, and text summarization.
△ Less
Submitted 24 July, 2020; v1 submitted 25 June, 2020;
originally announced June 2020.
-
CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information
Authors:
Pengyu Cheng,
Weituo Hao,
Shuyang Dai,
Jiachang Liu,
Zhe Gan,
Lawrence Carin
Abstract:
Mutual information (MI) minimization has gained considerable interests in various machine learning tasks. However, estimating and minimizing MI in high-dimensional spaces remains a challenging problem, especially when only samples, rather than distribution forms, are accessible. Previous works mainly focus on MI lower bound approximation, which is not applicable to MI minimization problems. In thi…
▽ More
Mutual information (MI) minimization has gained considerable interests in various machine learning tasks. However, estimating and minimizing MI in high-dimensional spaces remains a challenging problem, especially when only samples, rather than distribution forms, are accessible. Previous works mainly focus on MI lower bound approximation, which is not applicable to MI minimization problems. In this paper, we propose a novel Contrastive Log-ratio Upper Bound (CLUB) of mutual information. We provide a theoretical analysis of the properties of CLUB and its variational approximation. Based on this upper bound, we introduce a MI minimization training scheme and further accelerate it with a negative sampling strategy. Simulation studies on Gaussian distributions show the reliable estimation ability of CLUB. Real-world MI minimization experiments, including domain adaptation and information bottleneck, demonstrate the effectiveness of the proposed method. The code is at https://github.com/Linear95/CLUB.
△ Less
Submitted 23 July, 2020; v1 submitted 22 June, 2020;
originally announced June 2020.
-
Students Need More Attention: BERT-based AttentionModel for Small Data with Application to AutomaticPatient Message Triage
Authors:
Shi**g Si,
Rui Wang,
Jedrek Wosik,
Hao Zhang,
David Dov,
Guoyin Wang,
Ricardo Henao,
Lawrence Carin
Abstract:
Small and imbalanced datasets commonly seen in healthcare represent a challenge when training classifiers based on deep learning models. So motivated, we propose a novel framework based on BioBERT (Bidirectional Encoder Representations from Transformers forBiomedical TextMining). Specifically, (i) we introduce Label Embeddings for Self-Attention in each layer of BERT, which we call LESA-BERT, and…
▽ More
Small and imbalanced datasets commonly seen in healthcare represent a challenge when training classifiers based on deep learning models. So motivated, we propose a novel framework based on BioBERT (Bidirectional Encoder Representations from Transformers forBiomedical TextMining). Specifically, (i) we introduce Label Embeddings for Self-Attention in each layer of BERT, which we call LESA-BERT, and (ii) by distilling LESA-BERT to smaller variants, we aim to reduce overfitting and model size when working on small datasets. As an application, our framework is utilized to build a model for patient portal message triage that classifies the urgency of a message into three categories: non-urgent, medium and urgent. Experiments demonstrate that our approach can outperform several strong baseline classifiers by a significant margin of 4.3% in terms of macro F1 score. The code for this project is publicly available at \url{https://github.com/shi**g001/text_classifiers}.
△ Less
Submitted 21 June, 2020;
originally announced June 2020.
-
GO Hessian for Expectation-Based Objectives
Authors:
Yulai Cong,
Miaoyun Zhao,
Jianqiao Li,
Junya Chen,
Lawrence Carin
Abstract:
An unbiased low-variance gradient estimator, termed GO gradient, was proposed recently for expectation-based objectives $\mathbb{E}_{q_{\boldsymbolγ}(\boldsymbol{y})} [f(\boldsymbol{y})]$, where the random variable (RV) $\boldsymbol{y}$ may be drawn from a stochastic computation graph with continuous (non-reparameterizable) internal nodes and continuous/discrete leaves. Upgrading the GO gradient,…
▽ More
An unbiased low-variance gradient estimator, termed GO gradient, was proposed recently for expectation-based objectives $\mathbb{E}_{q_{\boldsymbolγ}(\boldsymbol{y})} [f(\boldsymbol{y})]$, where the random variable (RV) $\boldsymbol{y}$ may be drawn from a stochastic computation graph with continuous (non-reparameterizable) internal nodes and continuous/discrete leaves. Upgrading the GO gradient, we present for $\mathbb{E}_{q_{\boldsymbol{\boldsymbolγ}}(\boldsymbol{y})} [f(\boldsymbol{y})]$ an unbiased low-variance Hessian estimator, named GO Hessian. Considering practical implementation, we reveal that GO Hessian is easy-to-use with auto-differentiation and Hessian-vector products, enabling efficient cheap exploitation of curvature information over stochastic computation graphs. As representative examples, we present the GO Hessian for non-reparameterizable gamma and negative binomial RVs/nodes. Based on the GO Hessian, we design a new second-order method for $\mathbb{E}_{q_{\boldsymbol{\boldsymbolγ}}(\boldsymbol{y})} [f(\boldsymbol{y})]$, with rigorous experiments conducted to verify its effectiveness and efficiency.
△ Less
Submitted 15 June, 2020;
originally announced June 2020.
-
Enabling Counterfactual Survival Analysis with Balanced Representations
Authors:
Paidamoyo Chapfuwa,
Serge Assaad,
Shuxi Zeng,
Michael J. Pencina,
Lawrence Carin,
Ricardo Henao
Abstract:
Balanced representation learning methods have been applied successfully to counterfactual inference from observational data. However, approaches that account for survival outcomes are relatively limited. Survival data are frequently encountered across diverse medical applications, i.e., drug development, risk profiling, and clinical trials, and such data are also relevant in fields like manufactur…
▽ More
Balanced representation learning methods have been applied successfully to counterfactual inference from observational data. However, approaches that account for survival outcomes are relatively limited. Survival data are frequently encountered across diverse medical applications, i.e., drug development, risk profiling, and clinical trials, and such data are also relevant in fields like manufacturing (e.g., for equipment monitoring). When the outcome of interest is a time-to-event, special precautions for handling censored events need to be taken, as ignoring censored outcomes may lead to biased estimates. We propose a theoretically grounded unified framework for counterfactual inference applicable to survival outcomes. Further, we formulate a nonparametric hazard ratio metric for evaluating average and individualized treatment effects. Experimental results on real-world and semi-synthetic datasets, the latter of which we introduce, demonstrate that the proposed approach significantly outperforms competitive alternatives in both survival-outcome prediction and treatment-effect estimation.
△ Less
Submitted 3 March, 2021; v1 submitted 13 June, 2020;
originally announced June 2020.
-
GAN Memory with No Forgetting
Authors:
Yulai Cong,
Miaoyun Zhao,
Jianqiao Li,
Sijia Wang,
Lawrence Carin
Abstract:
As a fundamental issue in lifelong learning, catastrophic forgetting is directly caused by inaccessible historical data; accordingly, if the data (information) were memorized perfectly, no forgetting should be expected. Motivated by that, we propose a GAN memory for lifelong learning, which is capable of remembering a stream of datasets via generative processes, with \emph{no} forgetting. Our GAN…
▽ More
As a fundamental issue in lifelong learning, catastrophic forgetting is directly caused by inaccessible historical data; accordingly, if the data (information) were memorized perfectly, no forgetting should be expected. Motivated by that, we propose a GAN memory for lifelong learning, which is capable of remembering a stream of datasets via generative processes, with \emph{no} forgetting. Our GAN memory is based on recognizing that one can modulate the "style" of a GAN model to form perceptually-distant targeted generation. Accordingly, we propose to do sequential style modulations atop a well-behaved base GAN model, to form sequential targeted generative models, while simultaneously benefiting from the transferred base knowledge. The GAN memory -- that is motivated by lifelong learning -- is therefore itself manifested by a form of lifelong learning, via forward transfer and modulation of information from prior tasks. Experiments demonstrate the superiority of our method over existing approaches and its effectiveness in alleviating catastrophic forgetting for lifelong classification problems. Code is available at https://github.com/MiaoyunZhao/GANmemory_LifelongLearning.
△ Less
Submitted 12 November, 2020; v1 submitted 12 June, 2020;
originally announced June 2020.
-
Scalable Control Variates for Monte Carlo Methods via Stochastic Optimization
Authors:
Shi**g Si,
Chris. J. Oates,
Andrew B. Duncan,
Lawrence Carin,
François-Xavier Briol
Abstract:
Control variates are a well-established tool to reduce the variance of Monte Carlo estimators. However, for large-scale problems including high-dimensional and large-sample settings, their advantages can be outweighed by a substantial computational cost. This paper considers control variates based on Stein operators, presenting a framework that encompasses and generalizes existing approaches that…
▽ More
Control variates are a well-established tool to reduce the variance of Monte Carlo estimators. However, for large-scale problems including high-dimensional and large-sample settings, their advantages can be outweighed by a substantial computational cost. This paper considers control variates based on Stein operators, presenting a framework that encompasses and generalizes existing approaches that use polynomials, kernels and neural networks. A learning strategy based on minimising a variational objective through stochastic optimization is proposed, leading to scalable and effective control variates. Novel theoretical results are presented to provide insight into the variance reduction that can be achieved, and an empirical assessment, including applications to Bayesian inference, is provided in support.
△ Less
Submitted 21 July, 2021; v1 submitted 12 June, 2020;
originally announced June 2020.
-
Hierarchical Optimal Transport for Robust Multi-View Learning
Authors:
Dixin Luo,
Hongteng Xu,
Lawrence Carin
Abstract:
Traditional multi-view learning methods often rely on two assumptions: ($i$) the samples in different views are well-aligned, and ($ii$) their representations in latent space obey the same distribution. Unfortunately, these two assumptions may be questionable in practice, which limits the application of multi-view learning. In this work, we propose a hierarchical optimal transport (HOT) method to…
▽ More
Traditional multi-view learning methods often rely on two assumptions: ($i$) the samples in different views are well-aligned, and ($ii$) their representations in latent space obey the same distribution. Unfortunately, these two assumptions may be questionable in practice, which limits the application of multi-view learning. In this work, we propose a hierarchical optimal transport (HOT) method to mitigate the dependency on these two assumptions. Given unaligned multi-view data, the HOT method penalizes the sliced Wasserstein distance between the distributions of different views. These sliced Wasserstein distances are used as the ground distance to calculate the entropic optimal transport across different views, which explicitly indicates the clustering structure of the views. The HOT method is applicable to both unsupervised and semi-supervised learning, and experimental results show that it performs robustly on both synthetic and real-world tasks.
△ Less
Submitted 8 June, 2020; v1 submitted 4 June, 2020;
originally announced June 2020.
-
Towards Understanding Fast Adversarial Training
Authors:
Bai Li,
Shiqi Wang,
Suman Jana,
Lawrence Carin
Abstract:
Current neural-network-based classifiers are susceptible to adversarial examples. The most empirically successful approach to defending against such adversarial examples is adversarial training, which incorporates a strong self-attack during training to enhance its robustness. This approach, however, is computationally expensive and hence is hard to scale up. A recent work, called fast adversarial…
▽ More
Current neural-network-based classifiers are susceptible to adversarial examples. The most empirically successful approach to defending against such adversarial examples is adversarial training, which incorporates a strong self-attack during training to enhance its robustness. This approach, however, is computationally expensive and hence is hard to scale up. A recent work, called fast adversarial training, has shown that it is possible to markedly reduce computation time without sacrificing significant performance. This approach incorporates simple self-attacks, yet it can only run for a limited number of training epochs, resulting in sub-optimal performance. In this paper, we conduct experiments to understand the behavior of fast adversarial training and show the key to its success is the ability to recover from overfitting to weak attacks. We then extend our findings to improve fast adversarial training, demonstrating superior robust accuracy to strong adversarial training, with much-reduced training time.
△ Less
Submitted 4 June, 2020;
originally announced June 2020.