-
LibAUC: A Deep Learning Library for X-Risk Optimization
Authors:
Zhuoning Yuan,
Dixian Zhu,
Zi-Hao Qiu,
Gang Li,
Xuanhui Wang,
Tianbao Yang
Abstract:
This paper introduces the award-winning deep learning (DL) library called LibAUC for implementing state-of-the-art algorithms towards optimizing a family of risk functions named X-risks. X-risks refer to a family of compositional functions in which the loss function of each data point is defined in a way that contrasts the data point with a large number of others. They have broad applications in A…
▽ More
This paper introduces the award-winning deep learning (DL) library called LibAUC for implementing state-of-the-art algorithms towards optimizing a family of risk functions named X-risks. X-risks refer to a family of compositional functions in which the loss function of each data point is defined in a way that contrasts the data point with a large number of others. They have broad applications in AI for solving classical and emerging problems, including but not limited to classification for imbalanced data (CID), learning to rank (LTR), and contrastive learning of representations (CLR). The motivation of develo** LibAUC is to address the convergence issues of existing libraries for solving these problems. In particular, existing libraries may not converge or require very large mini-batch sizes in order to attain good performance for these problems, due to the usage of the standard mini-batch technique in the empirical risk minimization (ERM) framework. Our library is for deep X-risk optimization (DXO) that has achieved great success in solving a variety of tasks for CID, LTR and CLR. The contributions of this paper include: (1) It introduces a new mini-batch based pipeline for implementing DXO algorithms, which differs from existing DL pipeline in the design of controlled data samplers and dynamic mini-batch losses; (2) It provides extensive benchmarking experiments for ablation studies and comparison with existing libraries. The LibAUC library features scalable performance for millions of items to be contrasted, faster and better convergence than existing libraries for optimizing X-risks, seamless PyTorch deployment and versatile APIs for various loss optimization. Our library is available to the open source community at https://github.com/Optimization-AI/LibAUC, to facilitate further academic research and industrial applications.
△ Less
Submitted 5 June, 2023;
originally announced June 2023.
-
Not All Semantics are Created Equal: Contrastive Self-supervised Learning with Automatic Temperature Individualization
Authors:
Zi-Hao Qiu,
Quanqi Hu,
Zhuoning Yuan,
Denny Zhou,
Lijun Zhang,
Tianbao Yang
Abstract:
In this paper, we aim to optimize a contrastive loss with individualized temperatures in a principled and systematic manner for self-supervised learning. The common practice of using a global temperature parameter $τ$ ignores the fact that ``not all semantics are created equal", meaning that different anchor data may have different numbers of samples with similar semantics, especially when data ex…
▽ More
In this paper, we aim to optimize a contrastive loss with individualized temperatures in a principled and systematic manner for self-supervised learning. The common practice of using a global temperature parameter $τ$ ignores the fact that ``not all semantics are created equal", meaning that different anchor data may have different numbers of samples with similar semantics, especially when data exhibits long-tails. First, we propose a new robust contrastive loss inspired by distributionally robust optimization (DRO), providing us an intuition about the effect of $τ$ and a mechanism for automatic temperature individualization. Then, we propose an efficient stochastic algorithm for optimizing the robust contrastive loss with a provable convergence guarantee without using large mini-batch sizes. Theoretical and experimental results show that our algorithm automatically learns a suitable $τ$ for each sample. Specifically, samples with frequent semantics use large temperatures to keep local semantic structures, while samples with rare semantics use small temperatures to induce more separable features. Our method not only outperforms prior strong baselines (e.g., SimCLR, CLIP) on unimodal and bimodal datasets with larger improvements on imbalanced data but also is less sensitive to hyper-parameters. To our best knowledge, this is the first methodical approach to optimizing a contrastive loss with individualized temperatures.
△ Less
Submitted 19 May, 2023;
originally announced May 2023.
-
Robust Causal Learning for the Estimation of Average Treatment Effects
Authors:
Yiyan Huang,
Cheuk Hang Leung,
Xing Yan,
Qi Wu,
Shumin Ma,
Zhiri Yuan,
Dongdong Wang,
Zhixiang Huang
Abstract:
Many practical decision-making problems in economics and healthcare seek to estimate the average treatment effect (ATE) from observational data. The Double/Debiased Machine Learning (DML) is one of the prevalent methods to estimate ATE in the observational study. However, the DML estimators can suffer an error-compounding issue and even give an extreme estimate when the propensity scores are missp…
▽ More
Many practical decision-making problems in economics and healthcare seek to estimate the average treatment effect (ATE) from observational data. The Double/Debiased Machine Learning (DML) is one of the prevalent methods to estimate ATE in the observational study. However, the DML estimators can suffer an error-compounding issue and even give an extreme estimate when the propensity scores are misspecified or very close to 0 or 1. Previous studies have overcome this issue through some empirical tricks such as propensity score trimming, yet none of the existing literature solves this problem from a theoretical standpoint. In this paper, we propose a Robust Causal Learning (RCL) method to offset the deficiencies of the DML estimators. Theoretically, the RCL estimators i) are as consistent and doubly robust as the DML estimators, and ii) can get rid of the error-compounding issue. Empirically, the comprehensive experiments show that i) the RCL estimators give more stable estimations of the causal parameters than the DML estimators, and ii) the RCL estimators outperform the traditional estimators and their variants when applying different machine learning models on both simulation and benchmark datasets.
△ Less
Submitted 5 September, 2022;
originally announced September 2022.
-
An Improved Bernstein-type Inequality for C-Mixing-type Processes and Its Application to Kernel Smoothing
Authors:
Zihao Yuan,
Martin Spindler
Abstract:
There are many processes, particularly dynamic systems, that cannot be described as strong mixing processes. \citet{maume2006exponential} introduced a new mixing coefficient called C-mixing, which includes a large class of dynamic systems. Based on this, \citet{hang2017bernstein} obtained a Bernstein-type inequality for a geometric C-mixing process, which, modulo a logarithmic factor and some cons…
▽ More
There are many processes, particularly dynamic systems, that cannot be described as strong mixing processes. \citet{maume2006exponential} introduced a new mixing coefficient called C-mixing, which includes a large class of dynamic systems. Based on this, \citet{hang2017bernstein} obtained a Bernstein-type inequality for a geometric C-mixing process, which, modulo a logarithmic factor and some constants, coincides with the standard result for the iid case. In order to honor this pioneering work, we conduct follow-up research in this paper and obtain an improved result under more general preconditions. We allow for a weaker requirement for the semi-norm condition, fully non-stationarity, non-isotropic sampling behavior. Our result covers the case in which the index set of processes lies in $\mathbf{Z}^{d+}$ for any given positive integer $d$. Here $\mathbf{Z}^{d+}$ denotes the collection of all nonnegative integer-valued $d$-dimensional vector. This setting of index set takes both time and spatial data into consideration. For our application, we investigate the theoretical guarantee of multiple kernel-based nonparametric curve estimators for C-Mixing-type processes. More specifically we firstly obtain the $L^{\infty}$-convergence rate of the kernel density estimator and then discuss the attainability of optimality, which can also be regarded as an upate of the result of \citet{hang2018kernel}. Furthermore, we investigate the uniform convergence of the kernel-based estimators of the conditional mean and variance function in a heteroscedastic nonparametric regression model. Under a mild smoothing condition, these estimators are optimal. At last, we obtain the uniform convergence rate of conditional mode function.
△ Less
Submitted 7 October, 2022; v1 submitted 24 August, 2022;
originally announced August 2022.
-
Bernstein-type Inequalities and Nonparametric Estimation under Near-Epoch Dependence
Authors:
Zihao Yuan,
Martin Spindler
Abstract:
The major contributions of this paper lie in two aspects. Firstly, we focus on deriving Bernstein-type inequalities for both geometric and algebraic irregularly-spaced NED random fields, which contain time series as special case. Furthermore, by introducing the idea of "effective dimension" to the index set of random field, our results reflect that the sharpness of inequalities are only associated…
▽ More
The major contributions of this paper lie in two aspects. Firstly, we focus on deriving Bernstein-type inequalities for both geometric and algebraic irregularly-spaced NED random fields, which contain time series as special case. Furthermore, by introducing the idea of "effective dimension" to the index set of random field, our results reflect that the sharpness of inequalities are only associated with this "effective dimension". Up to the best of our knowledge, our paper may be the first one reflecting this phenomenon. Hence, the first contribution of this paper can be more or less regarded as an update of the pioneering work from \citeA{xu2018sieve}. Additionally, as a corollary of our first contribution, a Bernstein-type inequality for geometric irregularly-spaced $α$-mixing random fields is also obtained. The second aspect of our contributions is that, based on the inequalities mentioned above, we show the $L_{\infty}$ convergence rate of the many interesting kernel-based nonparametric estimators. To do this, two deviation inequalities for the supreme of empirical process are derived under NED and $α$-mixing conditions respectively. Then, for irregularly-spaced NED random fields, we prove the attainability of optimal rate for local linear estimator of nonparametric regression, which refreshes another pioneering work on this topic, \citeA{jenish2012nonparametric}. Subsequently, we analyze the uniform convergence rate of uni-modal regression under the same NED conditions as well. Furthermore, by following the guide of \citeA{rigollet2009optimal}, we also prove that the kernel-based plug-in density level set estimator could be optimal up to a logarithm factor. Meanwhile, when the data is collected from $α$-mixing random fields, we also derive the uniform convergence rate of a simple local polynomial density estimator \cite{cattaneo2020simple}.
△ Less
Submitted 17 October, 2022; v1 submitted 24 August, 2022;
originally announced August 2022.
-
Provable Stochastic Optimization for Global Contrastive Learning: Small Batch Does Not Harm Performance
Authors:
Zhuoning Yuan,
Yuexin Wu,
Zi-Hao Qiu,
Xianzhi Du,
Lijun Zhang,
Denny Zhou,
Tianbao Yang
Abstract:
In this paper, we study contrastive learning from an optimization perspective, aiming to analyze and address a fundamental issue of existing contrastive learning methods that either rely on a large batch size or a large dictionary of feature vectors. We consider a global objective for contrastive learning, which contrasts each positive pair with all negative pairs for an anchor point. From the opt…
▽ More
In this paper, we study contrastive learning from an optimization perspective, aiming to analyze and address a fundamental issue of existing contrastive learning methods that either rely on a large batch size or a large dictionary of feature vectors. We consider a global objective for contrastive learning, which contrasts each positive pair with all negative pairs for an anchor point. From the optimization perspective, we explain why existing methods such as SimCLR require a large batch size in order to achieve a satisfactory result. In order to remove such requirement, we propose a memory-efficient Stochastic Optimization algorithm for solving the Global objective of Contrastive Learning of Representations, named SogCLR. We show that its optimization error is negligible under a reasonable condition after a sufficient number of iterations or is diminishing for a slightly different global contrastive objective. Empirically, we demonstrate that SogCLR with small batch size (e.g., 256) can achieve similar performance as SimCLR with large batch size (e.g., 8192) on self-supervised learning task on ImageNet-1K. We also attempt to show that the proposed optimization technique is generic and can be applied to solving other contrastive losses, e.g., two-way contrastive losses for bimodal contrastive learning. The proposed method is implemented in our open-sourced library LibAUC (www.libauc.org).
△ Less
Submitted 20 September, 2022; v1 submitted 24 February, 2022;
originally announced February 2022.
-
Federated Deep AUC Maximization for Heterogeneous Data with a Constant Communication Complexity
Authors:
Zhuoning Yuan,
Zhishuai Guo,
Yi Xu,
Yiming Ying,
Tianbao Yang
Abstract:
Deep AUC (area under the ROC curve) Maximization (DAM) has attracted much attention recently due to its great potential for imbalanced data classification. However, the research on Federated Deep AUC Maximization (FDAM) is still limited. Compared with standard federated learning (FL) approaches that focus on decomposable minimization objectives, FDAM is more complicated due to its minimization obj…
▽ More
Deep AUC (area under the ROC curve) Maximization (DAM) has attracted much attention recently due to its great potential for imbalanced data classification. However, the research on Federated Deep AUC Maximization (FDAM) is still limited. Compared with standard federated learning (FL) approaches that focus on decomposable minimization objectives, FDAM is more complicated due to its minimization objective is non-decomposable over individual examples. In this paper, we propose improved FDAM algorithms for heterogeneous data by solving the popular non-convex strongly-concave min-max formulation of DAM in a distributed fashion, which can also be applied to a class of non-convex strongly-concave min-max problems. A striking result of this paper is that the communication complexity of the proposed algorithm is a constant independent of the number of machines and also independent of the accuracy level, which improves an existing result by orders of magnitude. The experiments have demonstrated the effectiveness of our FDAM algorithm on benchmark datasets, and on medical chest X-ray images from different organizations. Our experiment shows that the performance of FDAM using data from multiple hospitals can improve the AUC score on testing data from a single hospital for detecting life-threatening diseases based on chest radiographs. The proposed method is implemented in our open-sourced library LibAUC (www.libauc.org) whose github address is https://github.com/Optimization-AI/ICML2021_FedDeepAUC_CODASCA.
△ Less
Submitted 13 September, 2021; v1 submitted 8 February, 2021;
originally announced February 2021.
-
Large-scale Robust Deep AUC Maximization: A New Surrogate Loss and Empirical Studies on Medical Image Classification
Authors:
Zhuoning Yuan,
Yan Yan,
Milan Sonka,
Tianbao Yang
Abstract:
Deep AUC Maximization (DAM) is a new paradigm for learning a deep neural network by maximizing the AUC score of the model on a dataset. Most previous works of AUC maximization focus on the perspective of optimization by designing efficient stochastic algorithms, and studies on generalization performance of large-scale DAM on difficult tasks are missing. In this work, we aim to make DAM more practi…
▽ More
Deep AUC Maximization (DAM) is a new paradigm for learning a deep neural network by maximizing the AUC score of the model on a dataset. Most previous works of AUC maximization focus on the perspective of optimization by designing efficient stochastic algorithms, and studies on generalization performance of large-scale DAM on difficult tasks are missing. In this work, we aim to make DAM more practical for interesting real-world applications (e.g., medical image classification). First, we propose a new margin-based min-max surrogate loss function for the AUC score (named as AUC min-max-margin loss or simply AUC margin loss for short). It is more robust than the commonly used AUC square loss, while enjoying the same advantage in terms of large-scale stochastic optimization. Second, we conduct extensive empirical studies of our DAM method on four difficult medical image classification tasks, namely (i) classification of chest x-ray images for identifying many threatening diseases, (ii) classification of images of skin lesions for identifying melanoma, (iii) classification of mammogram for breast cancer screening, and (iv) classification of microscopic images for identifying tumor tissue. Our studies demonstrate that the proposed DAM method improves the performance of optimizing cross-entropy loss by a large margin, and also achieves better performance than optimizing the existing AUC square loss on these medical image classification tasks. Specifically, our DAM method has achieved the 1st place on Stanford CheXpert competition on Aug. 31, 2020. To the best of our knowledge, this is the first work that makes DAM succeed on large-scale medical image datasets. We also conduct extensive ablation studies to demonstrate the advantages of the new AUC margin loss over the AUC square loss on benchmark datasets. The proposed method is implemented in our open-sourced library LibAUC (www.libauc.org).
△ Less
Submitted 7 September, 2021; v1 submitted 5 December, 2020;
originally announced December 2020.
-
Review of Machine-Learning Methods for RNA Secondary Structure Prediction
Authors:
Qi Zhao,
Zheng Zhao,
Xiaoya Fan,
Zhengwei Yuan,
Qian Mao,
Yudong Yao
Abstract:
Secondary structure plays an important role in determining the function of non-coding RNAs. Hence, identifying RNA secondary structures is of great value to research. Computational prediction is a mainstream approach for predicting RNA secondary structure. Unfortunately, even though new methods have been proposed over the past 40 years, the performance of computational prediction methods has stagn…
▽ More
Secondary structure plays an important role in determining the function of non-coding RNAs. Hence, identifying RNA secondary structures is of great value to research. Computational prediction is a mainstream approach for predicting RNA secondary structure. Unfortunately, even though new methods have been proposed over the past 40 years, the performance of computational prediction methods has stagnated in the last decade. Recently, with the increasing availability of RNA structure data, new methods based on machine-learning technologies, especially deep learning, have alleviated the issue. In this review, we provide a comprehensive overview of RNA secondary structure prediction methods based on machine-learning technologies and a tabularized summary of the most important methods in this field. The current pending issues in the field of RNA secondary structure prediction and future trends are also discussed.
△ Less
Submitted 31 August, 2020;
originally announced September 2020.
-
The Devil is the Classifier: Investigating Long Tail Relation Classification with Decoupling Analysis
Authors:
Haiyang Yu,
Ningyu Zhang,
Shumin Deng,
Zonggang Yuan,
Yantao Jia,
Huajun Chen
Abstract:
Long-tailed relation classification is a challenging problem as the head classes may dominate the training phase, thereby leading to the deterioration of the tail performance. Existing solutions usually address this issue via class-balancing strategies, e.g., data re-sampling and loss re-weighting, but all these methods adhere to the schema of entangling learning of the representation and classifi…
▽ More
Long-tailed relation classification is a challenging problem as the head classes may dominate the training phase, thereby leading to the deterioration of the tail performance. Existing solutions usually address this issue via class-balancing strategies, e.g., data re-sampling and loss re-weighting, but all these methods adhere to the schema of entangling learning of the representation and classifier. In this study, we conduct an in-depth empirical investigation into the long-tailed problem and found that pre-trained models with instance-balanced sampling already capture the well-learned representations for all classes; moreover, it is possible to achieve better long-tailed classification ability at low cost by only adjusting the classifier. Inspired by this observation, we propose a robust classifier with attentive relation routing, which assigns soft weights by automatically aggregating the relations. Extensive experiments on two datasets demonstrate the effectiveness of our proposed approach. Code and datasets are available in https://github.com/zjunlp/deepke.
△ Less
Submitted 15 September, 2020;
originally announced September 2020.
-
Fast Objective & Duality Gap Convergence for Non-Convex Strongly-Concave Min-Max Problems with PL Condition
Authors:
Zhishuai Guo,
Yan Yan,
Zhuoning Yuan,
Tianbao Yang
Abstract:
This paper focuses on stochastic methods for solving smooth non-convex strongly-concave min-max problems, which have received increasing attention due to their potential applications in deep learning (e.g., deep AUC maximization, distributionally robust optimization). However, most of the existing algorithms are slow in practice, and their analysis revolves around the convergence to a nearly stati…
▽ More
This paper focuses on stochastic methods for solving smooth non-convex strongly-concave min-max problems, which have received increasing attention due to their potential applications in deep learning (e.g., deep AUC maximization, distributionally robust optimization). However, most of the existing algorithms are slow in practice, and their analysis revolves around the convergence to a nearly stationary point.We consider leveraging the Polyak-Lojasiewicz (PL) condition to design faster stochastic algorithms with stronger convergence guarantee. Although PL condition has been utilized for designing many stochastic minimization algorithms, their applications for non-convex min-max optimization remain rare. In this paper, we propose and analyze a generic framework of proximal stage-based method with many well-known stochastic updates embeddable. Fast convergence is established in terms of both the primal objective gap and the duality gap. Compared with existing studies, (i) our analysis is based on a novel Lyapunov function consisting of the primal objective gap and the duality gap of a regularized function, and (ii) the results are more comprehensive with improved rates that have better dependence on the condition number under different assumptions. We also conduct deep and non-deep learning experiments to verify the effectiveness of our methods.
△ Less
Submitted 17 April, 2023; v1 submitted 11 June, 2020;
originally announced June 2020.
-
Hybrid-DNNs: Hybrid Deep Neural Networks for Mixed Inputs
Authors:
Zhenyu Yuan,
Yuxin Jiang,
**g**g Li,
Handong Huang
Abstract:
Rapid development of big data and high-performance computing have encouraged explosive studies of deep learning in geoscience. However, most studies only take single-type data as input, frittering away invaluable multisource, multi-scale information. We develop a general architecture of hybrid deep neural networks (HDNNs) to support mixed inputs. Regarding as a combination of feature learning and…
▽ More
Rapid development of big data and high-performance computing have encouraged explosive studies of deep learning in geoscience. However, most studies only take single-type data as input, frittering away invaluable multisource, multi-scale information. We develop a general architecture of hybrid deep neural networks (HDNNs) to support mixed inputs. Regarding as a combination of feature learning and target learning, the new proposed networks provide great capacity in high-hierarchy feature extraction and in-depth data mining. Furthermore, the hybrid architecture is an aggregation of multiple networks, demonstrating good flexibility and wide applicability. The configuration of multiple networks depends on application tasks and varies with inputs and targets. Concentrating on reservoir production prediction, a specific HDNN model is configured and applied to an oil development block. Considering their contributions to hydrocarbon production, core photos, logging images and curves, geologic and engineering parameters can all be taken as inputs. After preprocessing, the mixed inputs are prepared as regular-sampled structural and numerical data. For feature learning, convolutional neural networks (CNN) and multilayer perceptron (MLP) network are configured to separately process structural and numerical inputs. Learned features are then concatenated and fed to subsequent networks for target learning. Comparison with typical MLP model and CNN model highlights the superiority of proposed HDNN model with high accuracy and good generalization.
△ Less
Submitted 17 May, 2020;
originally announced May 2020.
-
Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks
Authors:
Zhishuai Guo,
Mingrui Liu,
Zhuoning Yuan,
Li Shen,
Wei Liu,
Tianbao Yang
Abstract:
In this paper, we study distributed algorithms for large-scale AUC maximization with a deep neural network as a predictive model. Although distributed learning techniques have been investigated extensively in deep learning, they are not directly applicable to stochastic AUC maximization with deep neural networks due to its striking differences from standard loss minimization problems (e.g., cross-…
▽ More
In this paper, we study distributed algorithms for large-scale AUC maximization with a deep neural network as a predictive model. Although distributed learning techniques have been investigated extensively in deep learning, they are not directly applicable to stochastic AUC maximization with deep neural networks due to its striking differences from standard loss minimization problems (e.g., cross-entropy). Towards addressing this challenge, we propose and analyze a communication-efficient distributed optimization algorithm based on a {\it non-convex concave} reformulation of the AUC maximization, in which the communication of both the primal variable and the dual variable between each worker and the parameter server only occurs after multiple steps of gradient-based updates in each worker. Compared with the naive parallel version of an existing algorithm that computes stochastic gradients at individual machines and averages them for updating the model parameters, our algorithm requires a much less number of communication rounds and still achieves a linear speedup in theory. To the best of our knowledge, this is the \textbf{first} work that solves the {\it non-convex concave min-max} problem for AUC maximization with deep neural networks in a communication-efficient distributed manner while still maintaining the linear speedup property in theory. Our experiments on several benchmark datasets show the effectiveness of our algorithm and also confirm our theory.
△ Less
Submitted 8 October, 2020; v1 submitted 5 May, 2020;
originally announced May 2020.
-
A flexible method for estimating luminosity functions via Kernel Density Estimation
Authors:
Zunli Yuan,
Matt J. Jarvis,
Jiancheng Wang
Abstract:
We propose a flexible method for estimating luminosity functions (LFs) based on kernel density estimation (KDE), the most popular nonparametric density estimation approach developed in modern statistics, to overcome issues surrounding binning of LFs. One challenge in applying KDE to LFs is how to treat the boundary bias problem, since astronomical surveys usually obtain truncated samples predomina…
▽ More
We propose a flexible method for estimating luminosity functions (LFs) based on kernel density estimation (KDE), the most popular nonparametric density estimation approach developed in modern statistics, to overcome issues surrounding binning of LFs. One challenge in applying KDE to LFs is how to treat the boundary bias problem, since astronomical surveys usually obtain truncated samples predominantly due to the flux-density limits of surveys. We use two solutions, the transformation KDE method ($\hatφ_{\mathrm{t}}$), and the transformation-reflection KDE method ($\hatφ_{\mathrm{tr}}$) to reduce the boundary bias. We develop a new likelihood cross-validation criterion for selecting optimal bandwidths, based on which, the posterior probability distribution of bandwidth and transformation parameters for $\hatφ_{\mathrm{t}}$ and $\hatφ_{\mathrm{tr}}$ are derived within a Markov chain Monte Carlo (MCMC) sampling procedure. The simulation result shows that $\hatφ_{\mathrm{t}}$ and $\hatφ_{\mathrm{tr}}$ perform better than the traditional binned method, especially in the sparse data regime around the flux-limit of a survey or at the bright-end of the LF. To further improve the performance of our KDE methods, we develop the transformation-reflection adaptive KDE approach ($\hatφ_{\mathrm{tra}}$). Monte Carlo simulations suggest that it has a good stability and reliability in performance, and is around an order of magnitude more accurate than using the binned method. By applying our adaptive KDE method to a quasar sample, we find that it achieves estimates comparable to the rigorous determination by a previous work, while making far fewer assumptions about the LF. The KDE method we develop has the advantages of both parametric and non-parametric methods.
△ Less
Submitted 30 April, 2020; v1 submitted 30 March, 2020;
originally announced March 2020.
-
Modular Deep Reinforcement Learning with Temporal Logic Specifications
Authors:
Lim Zun Yuan,
Mohammadhosein Hasanbeig,
Alessandro Abate,
Daniel Kroening
Abstract:
We propose an actor-critic, model-free, and online Reinforcement Learning (RL) framework for continuous-state continuous-action Markov Decision Processes (MDPs) when the reward is highly sparse but encompasses a high-level temporal structure. We represent this temporal structure by a finite-state machine and construct an on-the-fly synchronised product with the MDP and the finite machine. The temp…
▽ More
We propose an actor-critic, model-free, and online Reinforcement Learning (RL) framework for continuous-state continuous-action Markov Decision Processes (MDPs) when the reward is highly sparse but encompasses a high-level temporal structure. We represent this temporal structure by a finite-state machine and construct an on-the-fly synchronised product with the MDP and the finite machine. The temporal structure acts as a guide for the RL agent within the product, where a modular Deep Deterministic Policy Gradient (DDPG) architecture is proposed to generate a low-level control policy. We evaluate our framework in a Mars rover experiment and we present the success rate of the synthesised policy.
△ Less
Submitted 22 November, 2019; v1 submitted 23 September, 2019;
originally announced September 2019.
-
Stochastic AUC Maximization with Deep Neural Networks
Authors:
Mingrui Liu,
Zhuoning Yuan,
Yiming Ying,
Tianbao Yang
Abstract:
Stochastic AUC maximization has garnered an increasing interest due to better fit to imbalanced data classification. However, existing works are limited to stochastic AUC maximization with a linear predictive model, which restricts its predictive power when dealing with extremely complex data. In this paper, we consider stochastic AUC maximization problem with a deep neural network as the predicti…
▽ More
Stochastic AUC maximization has garnered an increasing interest due to better fit to imbalanced data classification. However, existing works are limited to stochastic AUC maximization with a linear predictive model, which restricts its predictive power when dealing with extremely complex data. In this paper, we consider stochastic AUC maximization problem with a deep neural network as the predictive model. Building on the saddle point reformulation of a surrogated loss of AUC, the problem can be cast into a {\it non-convex concave} min-max problem. The main contribution made in this paper is to make stochastic AUC maximization more practical for deep neural networks and big data with theoretical insights as well. In particular, we propose to explore Polyak-Łojasiewicz (PL) condition that has been proved and observed in deep learning, which enables us to develop new stochastic algorithms with even faster convergence rate and more practical step size scheme. An AdaGrad-style algorithm is also analyzed under the PL condition with adaptive convergence rate. Our experimental results demonstrate the effectiveness of the proposed algorithms.
△ Less
Submitted 29 June, 2020; v1 submitted 28 August, 2019;
originally announced August 2019.
-
Multi-Kernel Correntropy for Robust Learning
Authors:
Badong Chen,
Yuqing Xie,
Xin Wang,
Zejian yuan,
Pengju Ren,
**g Qin
Abstract:
As a novel similarity measure that is defined as the expectation of a kernel function between two random variables, correntropy has been successfully applied in robust machine learning and signal processing to combat large outliers. The kernel function in correntropy is usually a zero-mean Gaussian kernel. In a recent work, the concept of mixture correntropy (MC) was proposed to improve the learni…
▽ More
As a novel similarity measure that is defined as the expectation of a kernel function between two random variables, correntropy has been successfully applied in robust machine learning and signal processing to combat large outliers. The kernel function in correntropy is usually a zero-mean Gaussian kernel. In a recent work, the concept of mixture correntropy (MC) was proposed to improve the learning performance, where the kernel function is a mixture Gaussian kernel, namely a linear combination of several zero-mean Gaussian kernels with different widths. In both correntropy and mixture correntropy, the center of the kernel function is, however, always located at zero. In the present work, to further improve the learning performance, we propose the concept of multi-kernel correntropy (MKC), in which each component of the mixture Gaussian kernel can be centered at a different location. The properties of the MKC are investigated and an efficient approach is proposed to determine the free parameters in MKC. Experimental results show that the learning algorithms under the maximum multi-kernel correntropy criterion (MMKCC) can outperform those under the original maximum correntropy criterion (MCC) and the maximum mixture correntropy criterion (MMCC).
△ Less
Submitted 5 September, 2021; v1 submitted 24 May, 2019;
originally announced May 2019.
-
Stagewise Training Accelerates Convergence of Testing Error Over SGD
Authors:
Zhuoning Yuan,
Yan Yan,
Rong **,
Tianbao Yang
Abstract:
Stagewise training strategy is widely used for learning neural networks, which runs a stochastic algorithm (e.g., SGD) starting with a relatively large step size (aka learning rate) and geometrically decreasing the step size after a number of iterations. It has been observed that the stagewise SGD has much faster convergence than the vanilla SGD with a polynomially decaying step size in terms of b…
▽ More
Stagewise training strategy is widely used for learning neural networks, which runs a stochastic algorithm (e.g., SGD) starting with a relatively large step size (aka learning rate) and geometrically decreasing the step size after a number of iterations. It has been observed that the stagewise SGD has much faster convergence than the vanilla SGD with a polynomially decaying step size in terms of both training error and testing error. {\it But how to explain this phenomenon has been largely ignored by existing studies.} This paper provides some theoretical evidence for explaining this faster convergence. In particular, we consider a stagewise training strategy for minimizing empirical risk that satisfies the Polyak-Łojasiewicz (PL) condition, which has been observed/proved for neural networks and also holds for a broad family of convex functions. For convex loss functions and two classes of "nice-behaviored" non-convex objectives that are close to a convex function, we establish faster convergence of stagewise training than the vanilla SGD under the PL condition on both training error and testing error. Experiments on stagewise learning of deep residual networks exhibits that it satisfies one type of non-convexity assumption and therefore can be explained by our theory. Of independent interest, the testing error bounds for the considered non-convex loss functions are dimensionality and norm independent.
△ Less
Submitted 2 February, 2019; v1 submitted 10 December, 2018;
originally announced December 2018.
-
Automatic Graphics Program Generation using Attention-Based Hierarchical Decoder
Authors:
Zhihao Zhu,
Zhan Xue,
Zejian Yuan
Abstract:
Recent progress on deep learning has made it possible to automatically transform the screenshot of Graphic User Interface (GUI) into code by using the encoder-decoder framework. While the commonly adopted image encoder (e.g., CNN network), might be capable of extracting image features to the desired level, interpreting these abstract image features into hundreds of tokens of code puts a particular…
▽ More
Recent progress on deep learning has made it possible to automatically transform the screenshot of Graphic User Interface (GUI) into code by using the encoder-decoder framework. While the commonly adopted image encoder (e.g., CNN network), might be capable of extracting image features to the desired level, interpreting these abstract image features into hundreds of tokens of code puts a particular challenge on the decoding power of the RNN-based code generator. Considering the code used for describing GUI is usually hierarchically structured, we propose a new attention-based hierarchical code generation model, which can describe GUI images in a finer level of details, while also being able to generate hierarchically structured code in consistency with the hierarchical layout of the graphic elements in the GUI. Our model follows the encoder-decoder framework, all the components of which can be trained jointly in an end-to-end manner. The experimental results show that our method outperforms other current state-of-the-art methods on both a publicly available GUI-code dataset as well as a dataset established by our own.
△ Less
Submitted 26 October, 2018;
originally announced October 2018.
-
Universal Stagewise Learning for Non-Convex Problems with Convergence on Averaged Solutions
Authors:
Zaiyi Chen,
Zhuoning Yuan,
**feng Yi,
Bowen Zhou,
Enhong Chen,
Tianbao Yang
Abstract:
Although stochastic gradient descent (SGD) method and its variants (e.g., stochastic momentum methods, AdaGrad) are the choice of algorithms for solving non-convex problems (especially deep learning), there still remain big gaps between the theory and the practice with many questions unresolved. For example, there is still a lack of theories of convergence for SGD and its variants that use stagewi…
▽ More
Although stochastic gradient descent (SGD) method and its variants (e.g., stochastic momentum methods, AdaGrad) are the choice of algorithms for solving non-convex problems (especially deep learning), there still remain big gaps between the theory and the practice with many questions unresolved. For example, there is still a lack of theories of convergence for SGD and its variants that use stagewise step size and return an averaged solution in practice. In addition, theoretical insights of why adaptive step size of AdaGrad could improve non-adaptive step size of {\sgd} is still missing for non-convex optimization. This paper aims to address these questions and fill the gap between theory and practice. We propose a universal stagewise optimization framework for a broad family of {\bf non-smooth non-convex} (namely weakly convex) problems with the following key features: (i) at each stage any suitable stochastic convex optimization algorithms (e.g., SGD or AdaGrad) that return an averaged solution can be employed for minimizing a regularized convex problem; (ii) the step size is decreased in a stagewise manner; (iii) an averaged solution is returned as the final solution that is selected from all stagewise averaged solutions with sampling probabilities {\it increasing} as the stage number. Our theoretical results of stagewise AdaGrad exhibit its adaptive convergence, therefore shed insights on its faster convergence for problems with sparse stochastic gradients than stagewise SGD. To the best of our knowledge, these new results are the first of their kind for addressing the unresolved issues of existing theories mentioned earlier. Besides theoretical contributions, our empirical studies show that our stagewise SGD and ADAGRAD improve the generalization performance of existing variants/implementations of SGD and ADAGRAD.
△ Less
Submitted 5 March, 2019; v1 submitted 19 August, 2018;
originally announced August 2018.
-
Explaining Explanations: An Overview of Interpretability of Machine Learning
Authors:
Leilani H. Gilpin,
David Bau,
Ben Z. Yuan,
Ayesha Bajwa,
Michael Specter,
Lalana Kagal
Abstract:
There has recently been a surge of work in explanatory artificial intelligence (XAI). This research area tackles the important problem that complex machines and algorithms often cannot provide insights into their behavior and thought processes. XAI allows users and parts of the internal system to be more transparent, providing explanations of their decisions in some level of detail. These explanat…
▽ More
There has recently been a surge of work in explanatory artificial intelligence (XAI). This research area tackles the important problem that complex machines and algorithms often cannot provide insights into their behavior and thought processes. XAI allows users and parts of the internal system to be more transparent, providing explanations of their decisions in some level of detail. These explanations are important to ensure algorithmic fairness, identify potential bias/problems in the training data, and to ensure that the algorithms perform as expected. However, explanations produced by these systems is neither standardized nor systematically assessed. In an effort to create best practices and identify open challenges, we provide our definition of explainability and show how it can be used to classify existing literature. We discuss why current approaches to explanatory methods especially for deep neural networks are insufficient. Finally, based on our survey, we conclude with suggested future research directions for explanatory artificial intelligence.
△ Less
Submitted 3 February, 2019; v1 submitted 31 May, 2018;
originally announced June 2018.