Search | arXiv e-print repository

Exact phylodynamic likelihood via structured Markov genealogy processes

Authors: Aaron A. King, Qianying Lin, Edward L. Ionides

Abstract: We consider genealogies arising from a Markov population process in which individuals are categorized into a discrete collection of compartments, with the requirement that individuals within the same compartment are statistically exchangeable. When equipped with a sampling process, each such population process induces a time-evolving tree-valued process defined as the genealogy of all sampled indi… ▽ More We consider genealogies arising from a Markov population process in which individuals are categorized into a discrete collection of compartments, with the requirement that individuals within the same compartment are statistically exchangeable. When equipped with a sampling process, each such population process induces a time-evolving tree-valued process defined as the genealogy of all sampled individuals. We provide a construction of this genealogy process and derive exact expressions for the likelihood of an observed genealogy in terms of filter equations. These filter equations can be numerically solved using standard Monte Carlo integration methods. Thus, we obtain statistically efficient likelihood-based inference for essentially arbitrary compartment models based on an observed genealogy of individuals sampled from the population. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.09362 [pdf, other]

On the Saturation Effect of Kernel Ridge Regression

Authors: Yicheng Li, Haobo Zhang, Qian Lin

Abstract: The saturation effect refers to the phenomenon that the kernel ridge regression (KRR) fails to achieve the information theoretical lower bound when the smoothness of the underground truth function exceeds certain level. The saturation effect has been widely observed in practices and a saturation lower bound of KRR has been conjectured for decades. In this paper, we provide a proof of this long-sta… ▽ More The saturation effect refers to the phenomenon that the kernel ridge regression (KRR) fails to achieve the information theoretical lower bound when the smoothness of the underground truth function exceeds certain level. The saturation effect has been widely observed in practices and a saturation lower bound of KRR has been conjectured for decades. In this paper, we provide a proof of this long-standing conjecture. △ Less

Submitted 28 May, 2024; v1 submitted 15 May, 2024; originally announced May 2024.

Comments: ICLR 2023; Minor errors are corrected in this version

arXiv:2404.12597 [pdf, other]

The phase diagram of kernel interpolation in large dimensions

Authors: Haobo Zhang, Weihao Lu, Qian Lin

Abstract: The generalization ability of kernel interpolation in large dimensions (i.e., $n \asymp d^γ$ for some $γ>0$) might be one of the most interesting problems in the recent renaissance of kernel regression, since it may help us understand the 'benign overfitting phenomenon' reported in the neural networks literature. Focusing on the inner product kernel on the sphere, we fully characterized the exact… ▽ More The generalization ability of kernel interpolation in large dimensions (i.e., $n \asymp d^γ$ for some $γ>0$) might be one of the most interesting problems in the recent renaissance of kernel regression, since it may help us understand the 'benign overfitting phenomenon' reported in the neural networks literature. Focusing on the inner product kernel on the sphere, we fully characterized the exact order of both the variance and bias of large-dimensional kernel interpolation under various source conditions $s\geq 0$. Consequently, we obtained the $(s,γ)$-phase diagram of large-dimensional kernel interpolation, i.e., we determined the regions in $(s,γ)$-plane where the kernel interpolation is minimax optimal, sub-optimal and inconsistent. △ Less

Submitted 18 April, 2024; originally announced April 2024.

Comments: 18 pages, 1 figure

arXiv:2402.01148 [pdf, other]

The Optimality of Kernel Classifiers in Sobolev Space

Authors: Jianfa Lai, Zhifan Li, Dongming Huang, Qian Lin

Abstract: Kernel methods are widely used in machine learning, especially for classification problems. However, the theoretical analysis of kernel classification is still limited. This paper investigates the statistical performances of kernel classifiers. With some mild assumptions on the conditional probability $η(x)=\mathbb{P}(Y=1\mid X=x)$, we derive an upper bound on the classification excess risk of a k… ▽ More Kernel methods are widely used in machine learning, especially for classification problems. However, the theoretical analysis of kernel classification is still limited. This paper investigates the statistical performances of kernel classifiers. With some mild assumptions on the conditional probability $η(x)=\mathbb{P}(Y=1\mid X=x)$, we derive an upper bound on the classification excess risk of a kernel classifier using recent advances in the theory of kernel regression. We also obtain a minimax lower bound for Sobolev spaces, which shows the optimality of the proposed classifier. Our theoretical results can be extended to the generalization error of overparameterized neural network classifiers. To make our theoretical results more applicable in realistic settings, we also propose a simple method to estimate the interpolation smoothness of $2η(x)-1$ and apply the method to real datasets. △ Less

Submitted 2 February, 2024; originally announced February 2024.

Comments: 21 pages, 2 figures

MSC Class: 62G08 (Primary); 68T07; 46E22 (secondary) ACM Class: G.3

arXiv:2309.04268 [pdf, other]

Optimal Rate of Kernel Regression in Large Dimensions

Authors: Weihao Lu, Haobo Zhang, Yicheng Li, Manyun Xu, Qian Lin

Abstract: We perform a study on kernel regression for large-dimensional data (where the sample size $n$ is polynomially depending on the dimension $d$ of the samples, i.e., $n\asymp d^γ$ for some $γ>0$ ). We first build a general tool to characterize the upper bound and the minimax lower bound of kernel regression for large dimensional data through the Mendelson complexity $\varepsilon_{n}^{2}$ and the metr… ▽ More We perform a study on kernel regression for large-dimensional data (where the sample size $n$ is polynomially depending on the dimension $d$ of the samples, i.e., $n\asymp d^γ$ for some $γ>0$ ). We first build a general tool to characterize the upper bound and the minimax lower bound of kernel regression for large dimensional data through the Mendelson complexity $\varepsilon_{n}^{2}$ and the metric entropy $\bar{\varepsilon}_{n}^{2}$ respectively. When the target function falls into the RKHS associated with a (general) inner product model defined on $\mathbb{S}^{d}$, we utilize the new tool to show that the minimax rate of the excess risk of kernel regression is $n^{-1/2}$ when $n\asymp d^γ$ for $γ=2, 4, 6, 8, \cdots$. We then further determine the optimal rate of the excess risk of kernel regression for all the $γ>0$ and find that the curve of optimal rate varying along $γ$ exhibits several new phenomena including the multiple descent behavior and the periodic plateau behavior. As an application, For the neural tangent kernel (NTK), we also provide a similar explicit description of the curve of optimal rate. As a direct corollary, we know these claims hold for wide neural networks as well. △ Less

Submitted 28 June, 2024; v1 submitted 8 September, 2023; originally announced September 2023.

MSC Class: 62G08; 46E22; 68T07

arXiv:2305.18506 [pdf, other]

Generalization Ability of Wide Residual Networks

Authors: Jianfa Lai, Zixiong Yu, Songtao Tian, Qian Lin

Abstract: In this paper, we study the generalization ability of the wide residual network on $\mathbb{S}^{d-1}$ with the ReLU activation function. We first show that as the width $m\rightarrow\infty$, the residual network kernel (RNK) uniformly converges to the residual neural tangent kernel (RNTK). This uniform convergence further guarantees that the generalization error of the residual network converges t… ▽ More In this paper, we study the generalization ability of the wide residual network on $\mathbb{S}^{d-1}$ with the ReLU activation function. We first show that as the width $m\rightarrow\infty$, the residual network kernel (RNK) uniformly converges to the residual neural tangent kernel (RNTK). This uniform convergence further guarantees that the generalization error of the residual network converges to that of the kernel regression with respect to the RNTK. As direct corollaries, we then show $i)$ the wide residual network with the early stop** strategy can achieve the minimax rate provided that the target regression function falls in the reproducing kernel Hilbert space (RKHS) associated with the RNTK; $ii)$ the wide residual network can not generalize well if it is trained till overfitting the data. We finally illustrate some experiments to reconcile the contradiction between our theoretical result and the widely observed ``benign overfitting phenomenon'' △ Less

Submitted 29 May, 2023; originally announced May 2023.

Comments: 28 pages, 3 figures

MSC Class: 62G08 (Primary); 68T07; 46E22 (secondary) ACM Class: G.3

arXiv:2305.02657 [pdf, other]

On the Eigenvalue Decay Rates of a Class of Neural-Network Related Kernel Functions Defined on General Domains

Authors: Yicheng Li, Zixiong Yu, Guhan Chen, Qian Lin

Abstract: In this paper, we provide a strategy to determine the eigenvalue decay rate (EDR) of a large class of kernel functions defined on a general domain rather than $\mathbb S^{d}$. This class of kernel functions include but are not limited to the neural tangent kernel associated with neural networks with different depths and various activation functions. After proving that the dynamics of training the… ▽ More In this paper, we provide a strategy to determine the eigenvalue decay rate (EDR) of a large class of kernel functions defined on a general domain rather than $\mathbb S^{d}$. This class of kernel functions include but are not limited to the neural tangent kernel associated with neural networks with different depths and various activation functions. After proving that the dynamics of training the wide neural networks uniformly approximated that of the neural tangent kernel regression on general domains, we can further illustrate the minimax optimality of the wide neural network provided that the underground truth function $f\in [\mathcal H_{\mathrm{NTK}}]^{s}$, an interpolation space associated with the RKHS $\mathcal{H}_{\mathrm{NTK}}$ of NTK. We also showed that the overfitted neural network can not generalize well. We believe our approach for determining the EDR of kernels might be also of independent interests. △ Less

Submitted 8 January, 2024; v1 submitted 4 May, 2023; originally announced May 2023.

arXiv:2302.05933 [pdf, other]

Generalization Ability of Wide Neural Networks on $\mathbb{R}$

Authors: Jianfa Lai, Manyun Xu, Rui Chen, Qian Lin

Abstract: We perform a study on the generalization ability of the wide two-layer ReLU neural network on $\mathbb{R}$. We first establish some spectral properties of the neural tangent kernel (NTK): $a)$ $K_{d}$, the NTK defined on $\mathbb{R}^{d}$, is positive definite; $b)$ $λ_{i}(K_{1})$, the $i$-th largest eigenvalue of $K_{1}$, is proportional to $i^{-2}$. We then show that: $i)$ when the width… ▽ More We perform a study on the generalization ability of the wide two-layer ReLU neural network on $\mathbb{R}$. We first establish some spectral properties of the neural tangent kernel (NTK): $a)$ $K_{d}$, the NTK defined on $\mathbb{R}^{d}$, is positive definite; $b)$ $λ_{i}(K_{1})$, the $i$-th largest eigenvalue of $K_{1}$, is proportional to $i^{-2}$. We then show that: $i)$ when the width $m\rightarrow\infty$, the neural network kernel (NNK) uniformly converges to the NTK; $ii)$ the minimax rate of regression over the RKHS associated to $K_{1}$ is $n^{-2/3}$; $iii)$ if one adopts the early stop** strategy in training a wide neural network, the resulting neural network achieves the minimax rate; $iv)$ if one trains the neural network till it overfits the data, the resulting neural network can not generalize well. Finally, we provide an explanation to reconcile our theory and the widely observed ``benign overfitting phenomenon''. △ Less

Submitted 12 February, 2023; originally announced February 2023.

Comments: 47 pages, 4 figures

MSC Class: 62G08 (Primary); 68T07 (secondary); 46E22 ACM Class: G.3

arXiv:2301.05690 [pdf, other]

doi 10.1098/rsif.2023.0310

Tunable robustness in power-law inference

Authors: Qianying Lin, Mitchell Newberry

Abstract: Power-law probability distributions arise often in the social and natural sciences. Statistics have been developed for estimating the exponent parameter as well as gauging goodness-of-fit to a power law. Yet paradoxically, many famous power laws such as the distribution of wealth and earthquake magnitudes have not found good statistical support in data by modern methods. We show that measurement e… ▽ More Power-law probability distributions arise often in the social and natural sciences. Statistics have been developed for estimating the exponent parameter as well as gauging goodness-of-fit to a power law. Yet paradoxically, many famous power laws such as the distribution of wealth and earthquake magnitudes have not found good statistical support in data by modern methods. We show that measurement errors such as quantization and noise bias both maximum-likelihood estimators and goodness-of-fit measures. We address this issue using logarithmic binning and the corresponding discrete reference distribution for maximum likelihood estimators and Kolmogorov-Smirnov statistics. Using simulated errors, we validate that binning attenuates bias in parameter estimates and recalibrates goodness of fit to a power law by removing small errors from consideration. These benefits come at modest cost in statistical power, which can be compensated with larger sample sizes. We reanalyse three empirical cases of wealth, earthquake magnitudes and wildfire area and show that binning reverses statistical conclusions and aligns the statistical results with historical and scientific expectations. We explain through these cases how routine errors lead to incorrect conclusions and the necessity for more robust methods. △ Less

Submitted 13 January, 2023; originally announced January 2023.

arXiv:2212.12603 [pdf, ps, other]

Stochastic Methods for AUC Optimization subject to AUC-based Fairness Constraints

Authors: Yao Yao, Qihang Lin, Tianbao Yang

Abstract: As machine learning being used increasingly in making high-stakes decisions, an arising challenge is to avoid unfair AI systems that lead to discriminatory decisions for protected population. A direct approach for obtaining a fair predictive model is to train the model through optimizing its prediction performance subject to fairness constraints, which achieves Pareto efficiency when trading off p… ▽ More As machine learning being used increasingly in making high-stakes decisions, an arising challenge is to avoid unfair AI systems that lead to discriminatory decisions for protected population. A direct approach for obtaining a fair predictive model is to train the model through optimizing its prediction performance subject to fairness constraints, which achieves Pareto efficiency when trading off performance against fairness. Among various fairness metrics, the ones based on the area under the ROC curve (AUC) are emerging recently because they are threshold-agnostic and effective for unbalanced data. In this work, we formulate the training problem of a fairness-aware machine learning model as an AUC optimization problem subject to a class of AUC-based fairness constraints. This problem can be reformulated as a min-max optimization problem with min-max constraints, which we solve by stochastic first-order methods based on a new Bregman divergence designed for the special structure of the problem. We numerically demonstrate the effectiveness of our approach on real-world data under different fairness metrics. △ Less

Submitted 22 February, 2023; v1 submitted 23 December, 2022; originally announced December 2022.

Comments: Published in AISTATS 2023

arXiv:2203.01505 [pdf, ps, other]

Large-scale Optimization of Partial AUC in a Range of False Positive Rates

Authors: Yao Yao, Qihang Lin, Tianbao Yang

Abstract: The area under the ROC curve (AUC) is one of the most widely used performance measures for classification models in machine learning. However, it summarizes the true positive rates (TPRs) over all false positive rates (FPRs) in the ROC space, which may include the FPRs with no practical relevance in some applications. The partial AUC, as a generalization of the AUC, summarizes only the TPRs over a… ▽ More The area under the ROC curve (AUC) is one of the most widely used performance measures for classification models in machine learning. However, it summarizes the true positive rates (TPRs) over all false positive rates (FPRs) in the ROC space, which may include the FPRs with no practical relevance in some applications. The partial AUC, as a generalization of the AUC, summarizes only the TPRs over a specific range of the FPRs and is thus a more suitable performance measure in many real-world situations. Although partial AUC optimization in a range of FPRs had been studied, existing algorithms are not scalable to big data and not applicable to deep learning. To address this challenge, we cast the problem into a non-smooth difference-of-convex (DC) program for any smooth predictive functions (e.g., deep neural networks), which allowed us to develop an efficient approximated gradient descent method based on the Moreau envelope smoothing technique, inspired by recent advances in non-smooth DC optimization. To increase the efficiency of large data processing, we used an efficient stochastic block coordinate update in our algorithm. Our proposed algorithm can also be used to minimize the sum of ranked range loss, which also lacks efficient solvers. We established a complexity of $\tilde O(1/ε^6)$ for finding a nearly $ε$-critical solution. Finally, we numerically demonstrated the effectiveness of our proposed algorithms for both partial AUC maximization and sum of ranked range loss minimization. △ Less

Submitted 27 October, 2022; v1 submitted 2 March, 2022; originally announced March 2022.

arXiv:2202.01163 [pdf, other]

A Recommender System Based on a Double Feature Allocation Model

Authors: Qiaohui Lin, Peter Mueller

Abstract: A collaborative filtering recommender system predicts user preferences by discovering common features among users and items. We implement such inference using a Bayesian double feature allocation model, that is, a model for random pairs of subsets. We use an Indian buffet process (IBP) to link users and items to features. Here a feature is a subset of users and a matching subset of items. By train… ▽ More A collaborative filtering recommender system predicts user preferences by discovering common features among users and items. We implement such inference using a Bayesian double feature allocation model, that is, a model for random pairs of subsets. We use an Indian buffet process (IBP) to link users and items to features. Here a feature is a subset of users and a matching subset of items. By training feature-specific rating effects, we predict ratings. We use MovieLens Data to demonstrate posterior inference in the model and prediction of user preferences for unseen items compared to items they have previously rated. Part of the implementation is a novel semi-consensus Monte Carlo method to accomodate large numbers of users and items, as is typical for related applications. The proposed approach implements parallel posterior sampling in multiple shards of users while sharing item-related global parameters across shards. △ Less

Submitted 2 February, 2022; originally announced February 2022.

arXiv:2112.07755 [pdf, other]

Separate Exchangeability as Modeling Principle in Bayesian Nonparametrics

Authors: Giovanni Rebaudo, Qiaohui Lin, Peter Mueller

Abstract: We argue for the use of separate exchangeability as a modeling principle in Bayesian nonparametric (BNP) inference. Separate exchangeability is \emph{de facto} widely applied in the Bayesian parametric case, e.g., it naturally arises in simple mixed models. However, while in some areas, such as random graphs, separate and (closely related) joint exchangeability are widely used, it is curiously und… ▽ More We argue for the use of separate exchangeability as a modeling principle in Bayesian nonparametric (BNP) inference. Separate exchangeability is \emph{de facto} widely applied in the Bayesian parametric case, e.g., it naturally arises in simple mixed models. However, while in some areas, such as random graphs, separate and (closely related) joint exchangeability are widely used, it is curiously underused for several other applications in BNP. We briefly review the definition of separate exchangeability focusing on the implications of such a definition in Bayesian modeling. We then discuss two tractable classes of models that implement separate exchangeability that are the natural counterparts of familiar partially exchangeable BNP models. The first is nested random partitions for a data matrix, defining a partition of columns and nested partitions of rows, nested within column clusters. Many recent models for nested partitions implement partially exchangeable models related to variations of the well-known nested Dirichlet process. We argue that inference under such models in some cases ignores important features of the experimental setup. We obtain the separately exchangeable counterpart of such partially exchangeable partition structures. The second class is about setting up separately exchangeable priors for a nonparametric regression model when multiple sets of experimental units are involved. We highlight how a Dirichlet process mixture of linear models known as ANOVA DDP can naturally implement separate exchangeability in such regression problems. Finally, we illustrate how to perform inference under such models in two real data examples. △ Less

Submitted 20 June, 2024; v1 submitted 14 December, 2021; originally announced December 2021.

arXiv:2106.15400 [pdf, other]

Online Interaction Detection for Click-Through Rate Prediction

Authors: Qiuqiang Lin, Chuanhou Gao

Abstract: Click-Through Rate prediction aims to predict the ratio of clicks to impressions of a specific link. This is a challenging task since (1) there are usually categorical features, and the inputs will be extremely high-dimensional if one-hot encoding is applied, (2) not only the original features but also their interactions are important, (3) an effective prediction may rely on different features and… ▽ More Click-Through Rate prediction aims to predict the ratio of clicks to impressions of a specific link. This is a challenging task since (1) there are usually categorical features, and the inputs will be extremely high-dimensional if one-hot encoding is applied, (2) not only the original features but also their interactions are important, (3) an effective prediction may rely on different features and interactions in different time periods. To overcome these difficulties, we propose a new interaction detection method, named Online Random Intersection Chains. The method, which is based on the idea of frequent itemset mining, detects informative interactions by observing the intersections of randomly chosen samples. The discovered interactions enjoy high interpretability as they can be comprehended as logical expressions. ORIC can be updated every time new data is collected, without being retrained on historical data. What's more, the importance of the historical and latest data can be controlled by a tuning parameter. A framework is designed to deal with the streaming interactions, so almost all existing models for CTR prediction can be applied after interaction detection. Empirical results demonstrate the efficiency and effectiveness of ORIC on three benchmark datasets. △ Less

Submitted 27 June, 2021; originally announced June 2021.

Comments: 11pages, 4 figures, 1 supplement

arXiv:2105.12730 [pdf, other]

doi 10.1016/j.tpb.2021.11.003

Markov Genealogy Processes

Authors: Aaron A. King, Qianying Lin, Edward L. Ionides

Abstract: We construct a family of genealogy-valued Markov processes that are induced by a continuous-time Markov population process. We derive exact expressions for the likelihood of a given genealogy conditional on the history of the underlying population process. These lead to a nonlinear filtering equation which can be used to design efficient Monte Carlo inference algorithms. We demonstrate these calcu… ▽ More We construct a family of genealogy-valued Markov processes that are induced by a continuous-time Markov population process. We derive exact expressions for the likelihood of a given genealogy conditional on the history of the underlying population process. These lead to a nonlinear filtering equation which can be used to design efficient Monte Carlo inference algorithms. We demonstrate these calculations with several examples. Existing full-information approaches for phylodynamic inference are special cases of the theory. △ Less

Submitted 24 January, 2022; v1 submitted 26 May, 2021; originally announced May 2021.

MSC Class: 60J99

Journal ref: Theoretical Population Biology 143:77-91 (2022)

arXiv:2104.04714 [pdf, other]

Random Intersection Chains

Authors: Qiuqiang Lin, Chuanhou Gao

Abstract: Interactions between several features sometimes play an important role in prediction tasks. But taking all the interactions into consideration will lead to an extremely heavy computational burden. For categorical features, the situation is more complicated since the input will be extremely high-dimensional and sparse if one-hot encoding is applied. Inspired by association rule mining, we propose a… ▽ More Interactions between several features sometimes play an important role in prediction tasks. But taking all the interactions into consideration will lead to an extremely heavy computational burden. For categorical features, the situation is more complicated since the input will be extremely high-dimensional and sparse if one-hot encoding is applied. Inspired by association rule mining, we propose a method that selects interactions of categorical features, called Random Intersection Chains. It uses random intersections to detect frequent patterns, then selects the most meaningful ones among them. At first a number of chains are generated, in which each node is the intersection of the previous node and a random chosen observation. The frequency of patterns in the tail nodes is estimated by maximum likelihood estimation, then the patterns with largest estimated frequency are selected. After that, their confidence is calculated by Bayes' theorem. The most confident patterns are finally returned by Random Intersection Chains. We show that if the number and length of chains are appropriately chosen, the patterns in the tail nodes are indeed the most frequent ones in the data set. We analyze the computation complexity of the proposed algorithm and prove the convergence of the estimators. The results of a series of experiments verify the efficiency and effectiveness of the algorithm. △ Less

Submitted 10 April, 2021; originally announced April 2021.

arXiv:2009.06170 [pdf, other]

Trading off Accuracy for Speedup: Multiplier Bootstraps for Subgraph Counts

Authors: Qiaohui Lin, Robert Lunde, Purnamrita Sarkar

Abstract: We propose a new class of multiplier bootstraps for count functionals, ranging from a fast, approximate linear bootstrap tailored to sparse, massive graphs to a quadratic bootstrap procedure that offers refined accuracy for smaller, denser graphs. For the fast, approximate linear bootstrap, we show that $\sqrt{n}$-consistent inference of the count functional is attainable in certain computational… ▽ More We propose a new class of multiplier bootstraps for count functionals, ranging from a fast, approximate linear bootstrap tailored to sparse, massive graphs to a quadratic bootstrap procedure that offers refined accuracy for smaller, denser graphs. For the fast, approximate linear bootstrap, we show that $\sqrt{n}$-consistent inference of the count functional is attainable in certain computational regimes that depend on the sparsity level of the graph. Furthermore, even in more challenging regimes, we prove that our bootstrap procedure offers valid coverage and vanishing confidence intervals. For the quadratic bootstrap, we establish an Edgeworth expansion and show that this procedure offers higher-order accuracy under appropriate sparsity conditions. We complement our theoretical results with a simulation study and real data analysis and verify that our procedure offers state-of-the-art performance for several functionals. △ Less

Submitted 7 April, 2022; v1 submitted 13 September, 2020; originally announced September 2020.

arXiv:2004.08935 [pdf, other]

On the Theoretical Properties of the Network Jackknife

Authors: Qiaohui Lin, Robert Lunde, Purnamrita Sarkar

Abstract: We study the properties of a leave-node-out jackknife procedure for network data. Under the sparse graphon model, we prove an Efron-Stein-type inequality, showing that the network jackknife leads to conservative estimates of the variance (in expectation) for any network functional that is invariant to node permutation. For a general class of count functionals, we also establish consistency of the… ▽ More We study the properties of a leave-node-out jackknife procedure for network data. Under the sparse graphon model, we prove an Efron-Stein-type inequality, showing that the network jackknife leads to conservative estimates of the variance (in expectation) for any network functional that is invariant to node permutation. For a general class of count functionals, we also establish consistency of the network jackknife. We complement our theoretical analysis with a range of simulated and real-data examples and show that the network jackknife offers competitive performance in cases where other resampling methods are known to be valid. In fact, for several network statistics, we see that the jackknife provides more accurate inferences compared to related methods such as subsampling. △ Less

Submitted 21 April, 2020; v1 submitted 19 April, 2020; originally announced April 2020.

arXiv:2002.12761 [pdf, other]

DIHARD II is Still Hard: Experimental Results and Discussions from the DKU-LENOVO Team

Authors: Qingjian Lin, Weicheng Cai, Lin Yang, Junjie Wang, Jun Zhang, Ming Li

Abstract: In this paper, we present the submitted system for the second DIHARD Speech Diarization Challenge from the DKULENOVO team. Our diarization system includes multiple modules, namely voice activity detection (VAD), segmentation, speaker embedding extraction, similarity scoring, clustering, resegmentation and overlap detection. For each module, we explore different techniques to enhance performance. O… ▽ More In this paper, we present the submitted system for the second DIHARD Speech Diarization Challenge from the DKULENOVO team. Our diarization system includes multiple modules, namely voice activity detection (VAD), segmentation, speaker embedding extraction, similarity scoring, clustering, resegmentation and overlap detection. For each module, we explore different techniques to enhance performance. Our final submission employs the ResNet-LSTM based VAD, the Deep ResNet based speaker embedding, the LSTM based similarity scoring and spectral clustering. Variational Bayes (VB) diarization is applied in the resegmentation stage and overlap detection also brings slight improvement. Our proposed system achieves 18.84% DER in Track1 and 27.90% DER in Track2. Although our systems have reduced the DERs by 27.5% and 31.7% relatively against the official baselines, we believe that the diarization task is still very difficult. △ Less

Submitted 4 May, 2020; v1 submitted 23 February, 2020; originally announced February 2020.

Comments: Submitted to Odyssesy 2020

arXiv:2002.11184 [pdf, other]

The Sampled Moran Genealogy Process

Authors: Aaron A. King, Qianying Lin, Edward L. Ionides

Abstract: We define the Sampled Moran Genealogy Process, a continuous-time Markov process on the space of genealogies with the demography of the classical Moran process, sampled through time. To do so, we begin by defining the Moran Genealogy Process using a novel representation. We then extend this process to include sampling through time. We derive exact conditional and marginal probability distributions… ▽ More We define the Sampled Moran Genealogy Process, a continuous-time Markov process on the space of genealogies with the demography of the classical Moran process, sampled through time. To do so, we begin by defining the Moran Genealogy Process using a novel representation. We then extend this process to include sampling through time. We derive exact conditional and marginal probability distributions for the sampled process under a stationarity assumption, and an exact expression for the likelihood of any sequence of genealogies it generates. This leads to some interesting observations pertinent to existing phylodynamic methods in the literature. Throughout, our proofs are original and make use of strictly forward-in-time calculations and are exact for all population sizes and sampling processes. △ Less

Submitted 19 October, 2020; v1 submitted 25 February, 2020; originally announced February 2020.

MSC Class: 60J99

arXiv:2002.05309 [pdf, ps, other]

Optimal Epoch Stochastic Gradient Descent Ascent Methods for Min-Max Optimization

Authors: Yan Yan, Yi Xu, Qihang Lin, Wei Liu, Tianbao Yang

Abstract: Epoch gradient descent method (a.k.a. Epoch-GD) proposed by Hazan and Kale (2011) was deemed a breakthrough for stochastic strongly convex minimization, which achieves the optimal convergence rate of $O(1/T)$ with $T$ iterative updates for the {\it objective gap}. However, its extension to solving stochastic min-max problems with strong convexity and strong concavity still remains open, and it is… ▽ More Epoch gradient descent method (a.k.a. Epoch-GD) proposed by Hazan and Kale (2011) was deemed a breakthrough for stochastic strongly convex minimization, which achieves the optimal convergence rate of $O(1/T)$ with $T$ iterative updates for the {\it objective gap}. However, its extension to solving stochastic min-max problems with strong convexity and strong concavity still remains open, and it is still unclear whether a fast rate of $O(1/T)$ for the {\it duality gap} is achievable for stochastic min-max optimization under strong convexity and strong concavity. Although some recent studies have proposed stochastic algorithms with fast convergence rates for min-max problems, they require additional assumptions about the problem, e.g., smoothness, bi-linear structure, etc. In this paper, we bridge this gap by providing a sharp analysis of epoch-wise stochastic gradient descent ascent method (referred to as Epoch-GDA) for solving strongly convex strongly concave (SCSC) min-max problems, without imposing any additional assumption about smoothness or the function's structure. To the best of our knowledge, our result is the first one that shows Epoch-GDA can achieve the optimal rate of $O(1/T)$ for the duality gap of general SCSC min-max problems. We emphasize that such generalization of Epoch-GD for strongly convex minimization problems to Epoch-GDA for SCSC min-max problems is non-trivial and requires novel technical analysis. Moreover, we notice that the key lemma can also be used for proving the convergence of Epoch-GDA for weakly-convex strongly-concave min-max problems, leading to a nearly optimal complexity without resorting to smoothness or other structural conditions. △ Less

Submitted 17 June, 2020; v1 submitted 12 February, 2020; originally announced February 2020.

arXiv:2002.04180 [pdf, other]

LoCEC: Local Community-based Edge Classification in Large Online Social Networks

Authors: Chonggang Song, Qian Lin, Guohui Ling, Zongyi Zhang, Hongzhao Chen, Jun Liao, Chuan Chen

Abstract: Relationships in online social networks often imply social connections in the real world. An accurate understanding of relationship types benefits many applications, e.g. social advertising and recommendation. Some recent attempts have been proposed to classify user relationships into predefined types with the help of pre-labeled relationships or abundant interaction features on relationships. Unf… ▽ More Relationships in online social networks often imply social connections in the real world. An accurate understanding of relationship types benefits many applications, e.g. social advertising and recommendation. Some recent attempts have been proposed to classify user relationships into predefined types with the help of pre-labeled relationships or abundant interaction features on relationships. Unfortunately, both relationship feature data and label data are very sparse in real social platforms like WeChat, rendering existing methods inapplicable. In this paper, we present an in-depth analysis of WeChat relationships to identify the major challenges for the relationship classification task. To tackle the challenges, we propose a Local Community-based Edge Classification (LoCEC) framework that classifies user relationships in a social network into real-world social connection types. LoCEC enforces a three-phase processing, namely local community detection, community classification and relationship classification, to address the sparsity issue of relationship features and relationship labels. Moreover, LoCEC is designed to handle large-scale networks by allowing parallel and distributed processing. We conduct extensive experiments on the real-world WeChat network with hundreds of billions of edges to validate the effectiveness and efficiency of LoCEC. △ Less

Submitted 20 March, 2020; v1 submitted 10 February, 2020; originally announced February 2020.

arXiv:2001.02798 [pdf, other]

Self-guided Approximate Linear Programs

Authors: Parshan Pakiman, Selvaprabu Nadarajah, Negar Soheili, Qihang Lin

Abstract: Approximate linear programs (ALPs) are well-known models based on value function approximations (VFAs) to obtain policies and lower bounds on the optimal policy cost of discounted-cost Markov decision processes (MDPs). Formulating an ALP requires (i) basis functions, the linear combination of which defines the VFA, and (ii) a state-relevance distribution, which determines the relative importance o… ▽ More Approximate linear programs (ALPs) are well-known models based on value function approximations (VFAs) to obtain policies and lower bounds on the optimal policy cost of discounted-cost Markov decision processes (MDPs). Formulating an ALP requires (i) basis functions, the linear combination of which defines the VFA, and (ii) a state-relevance distribution, which determines the relative importance of different states in the ALP objective for the purpose of minimizing VFA error. Both these choices are typically heuristic: basis function selection relies on domain knowledge while the state-relevance distribution is specified using the frequency of states visited by a heuristic policy. We propose a self-guided sequence of ALPs that embeds random basis functions obtained via inexpensive sampling and uses the known VFA from the previous iteration to guide VFA computation in the current iteration. Self-guided ALPs mitigate the need for domain knowledge during basis function selection as well as the impact of the initial choice of the state-relevance distribution, thus significantly reducing the ALP implementation burden. We establish high probability error bounds on the VFAs from this sequence and show that a worst-case measure of policy performance is improved. We find that these favorable implementation and theoretical properties translate to encouraging numerical results on perishable inventory control and options pricing applications, where self-guided ALP policies improve upon policies from problem-specific methods. More broadly, our research takes a meaningful step toward application-agnostic policies and bounds for MDPs. △ Less

Submitted 12 October, 2021; v1 submitted 8 January, 2020; originally announced January 2020.

Comments: 52 pages

MSC Class: 90C39; 90C40; 90C05; 90C06; 90C15; 90C22; 90C90; 46C07; 93E20; 93E35; 68T99; 65K99 ACM Class: I.2.8; G.1.2; G.3

arXiv:2001.01006 [pdf, other]

Review of Single-cell RNA-seq Data Clustering for Cell Type Identification and Characterization

Authors: Shixiong Zhang, Xiangtao Li, Qiuzhen Lin, Ka-Chun Wong

Abstract: In recent years, the advances in single-cell RNA-seq techniques have enabled us to perform large-scale transcriptomic profiling at single-cell resolution in a high-throughput manner. Unsupervised learning such as data clustering has become the central component to identify and characterize novel cell types and gene expression patterns. In this study, we review the existing single-cell RNA-seq data… ▽ More In recent years, the advances in single-cell RNA-seq techniques have enabled us to perform large-scale transcriptomic profiling at single-cell resolution in a high-throughput manner. Unsupervised learning such as data clustering has become the central component to identify and characterize novel cell types and gene expression patterns. In this study, we review the existing single-cell RNA-seq data clustering methods with critical insights into the related advantages and limitations. In addition, we also review the upstream single-cell RNA-seq data processing techniques such as quality control, normalization, and dimension reduction. We conduct performance comparison experiments to evaluate several popular single-cell RNA-seq clustering approaches on two single-cell transcriptomic datasets. △ Less

Submitted 3 January, 2020; originally announced January 2020.

arXiv:1910.11537 [pdf, other]

doi 10.1109/ACCESS.2020.2987033

Unified model selection approach based on minimum description length principle in Granger causality analysis

Authors: Fei Li, Xuewei Wang, Qiang Lin, Zhenghui Hu

Abstract: Granger causality analysis (GCA) provides a powerful tool for uncovering the patterns of brain connectivity mechanism using neuroimaging techniques. Conventional GCA applies two different mathematical theories in a two-stage scheme: (1) the Bayesian information criterion (BIC) or Akaike information criterion (AIC) for the regression model orders associated with endogenous and exogenous information… ▽ More Granger causality analysis (GCA) provides a powerful tool for uncovering the patterns of brain connectivity mechanism using neuroimaging techniques. Conventional GCA applies two different mathematical theories in a two-stage scheme: (1) the Bayesian information criterion (BIC) or Akaike information criterion (AIC) for the regression model orders associated with endogenous and exogenous information; (2) F-statistics for determining the causal effects of exogenous variables. While specifying endogenous and exogenous effects are essentially the same model selection problem, this could produce different benchmarks in the two stages and therefore degrade the performance of GCA. In this course, we present a unified model selection approach based on the minimum description length (MDL) principle for GCA in the context of the general regression model paradigm. Compared with conventional methods, our approach emphasize that a single mathematical theory should be held throughout the GCA process. Under this framework, all candidate models within the model space might be compared freely in the context of the code length, without the need for an intermediate model. We illustrate its advantages over conventional two-stage GCA approach in a 3-node network and a 5-node network synthetic experiments. The unified model selection approach is capable of identifying the actual connectivity while avoiding the false influences of noise. More importantly, the proposed approach obtained more consistent results in a challenge fMRI dataset for causality investigation, mental calculation network under visual and auditory stimulus, respectively. The proposed approach has potential to accommodate other Granger causality representations in other function space. The comparison between different GC representations in different function spaces can also be naturally deal with in the framework. △ Less

Submitted 19 March, 2020; v1 submitted 25 October, 2019; originally announced October 2019.

arXiv:1910.07099 [pdf, other]

Entire Space Multi-Task Modeling via Post-Click Behavior Decomposition for Conversion Rate Prediction

Authors: Hong Wen, **g Zhang, Yuan Wang, Fuyu Lv, Wentian Bao, Quan Lin, Ke** Yang

Abstract: Recommender system, as an essential part of modern e-commerce, consists of two fundamental modules, namely Click-Through Rate (CTR) and Conversion Rate (CVR) prediction. While CVR has a direct impact on the purchasing volume, its prediction is well-known challenging due to the Sample Selection Bias (SSB) and Data Sparsity (DS) issues. Although existing methods, typically built on the user sequenti… ▽ More Recommender system, as an essential part of modern e-commerce, consists of two fundamental modules, namely Click-Through Rate (CTR) and Conversion Rate (CVR) prediction. While CVR has a direct impact on the purchasing volume, its prediction is well-known challenging due to the Sample Selection Bias (SSB) and Data Sparsity (DS) issues. Although existing methods, typically built on the user sequential behavior path ``impression$\to$click$\to$purchase'', is effective for dealing with SSB issue, they still struggle to address the DS issue due to rare purchase training samples. Observing that users always take several purchase-related actions after clicking, we propose a novel idea of post-click behavior decomposition. Specifically, disjoint purchase-related Deterministic Action (DAction) and Other Action (OAction) are inserted between click and purchase in parallel, forming a novel user sequential behavior graph ``impression$\to$click$\to$D(O)Action$\to$purchase''. Defining model on this graph enables to leverage all the impression samples over the entire space and extra abundant supervised signals from D(O)Action, which will effectively address the SSB and DS issues together. To this end, we devise a novel deep recommendation model named Elaborated Entire Space Supervised Multi-task Model ($ESM^{2}$). According to the conditional probability rule defined on the graph, it employs multi-task learning to predict some decomposed sub-targets in parallel and compose them sequentially to formulate the final CVR. Extensive experiments on both offline and online environments demonstrate the superiority of $ESM^{2}$ over state-of-the-art models. The source code and dataset will be released. △ Less

Submitted 9 June, 2020; v1 submitted 15 October, 2019; originally announced October 2019.

Comments: 10page, 7 figures. Accepted by SIGIR 2020. The source code will be released at https://github.com/chaimi2013/ESM2

arXiv:1909.10467 [pdf, other]

Model-Agnostic Linear Competitors -- When Interpretable Models Compete and Collaborate with Black-Box Models

Authors: Hassan Rafique, Tong Wang, Qihang Lin

Abstract: Driven by an increasing need for model interpretability, interpretable models have become strong competitors for black-box models in many real applications. In this paper, we propose a novel type of model where interpretable models compete and collaborate with black-box models. We present the Model-Agnostic Linear Competitors (MALC) for partially interpretable classification. MALC is a hybrid mode… ▽ More Driven by an increasing need for model interpretability, interpretable models have become strong competitors for black-box models in many real applications. In this paper, we propose a novel type of model where interpretable models compete and collaborate with black-box models. We present the Model-Agnostic Linear Competitors (MALC) for partially interpretable classification. MALC is a hybrid model that uses linear models to locally substitute any black-box model, capturing subspaces that are most likely to be in a class while leaving the rest of the data to the black-box. MALC brings together the interpretable power of linear models and good predictive performance of a black-box model. We formulate the training of a MALC model as a convex optimization. The predictive accuracy and transparency (defined as the percentage of data captured by the linear models) balance through a carefully designed objective function and the optimization problem is solved with the accelerated proximal gradient method. Experiments show that MALC can effectively trade prediction accuracy for transparency and provide an efficient frontier that spans the entire spectrum of transparency. △ Less

Submitted 23 September, 2019; originally announced September 2019.

arXiv:1908.03077 [pdf, ps, other]

A Data Efficient and Feasible Level Set Method for Stochastic Convex Optimization with Expectation Constraints

Authors: Qihang Lin, Selvaprabu Nadarajah, Negar Soheili, Tianbao Yang

Abstract: Stochastic convex optimization problems with expectation constraints (SOECs) are encountered in statistics and machine learning, business, and engineering. In data-rich environments, the SOEC objective and constraints contain expectations defined with respect to large datasets. Therefore, efficient algorithms for solving such SOECs need to limit the fraction of data points that they use, which we… ▽ More Stochastic convex optimization problems with expectation constraints (SOECs) are encountered in statistics and machine learning, business, and engineering. In data-rich environments, the SOEC objective and constraints contain expectations defined with respect to large datasets. Therefore, efficient algorithms for solving such SOECs need to limit the fraction of data points that they use, which we refer to as algorithmic data complexity. Recent stochastic first order methods exhibit low data complexity when handling SOECs but guarantee near-feasibility and near-optimality only at convergence. These methods may thus return highly infeasible solutions when heuristically terminated, as is often the case, due to theoretical convergence criteria being highly conservative. This issue limits the use of first order methods in several applications where the SOEC constraints encode implementation requirements. We design a stochastic feasible level set method (SFLS) for SOECs that has low data complexity and emphasizes feasibility before convergence. Specifically, our level-set method solves a root-finding problem by calling a novel first order oracle that computes a stochastic upper bound on the level-set function by extending mirror descent and online validation techniques. We establish that SFLS maintains a high-probability feasible solution at each root-finding iteration and exhibits favorable iteration complexity compared to state-of-the-art deterministic feasible level set and stochastic subgradient methods. Numerical experiments on three diverse applications validate the low data complexity of SFLS relative to the former approach and highlight how SFLS finds feasible solutions with small optimality gaps significantly faster than the latter method. △ Less

Submitted 1 January, 2020; v1 submitted 7 August, 2019; originally announced August 2019.

arXiv:1907.10393 [pdf, other]

doi 10.21437/Interspeech.2019-1388

LSTM based Similarity Measurement with Spectral Clustering for Speaker Diarization

Authors: Qingjian Lin, Ruiqing Yin, Ming Li, Hervé Bredin, Claude Barras

Abstract: More and more neural network approaches have achieved considerable improvement upon submodules of speaker diarization system, including speaker change detection and segment-wise speaker embedding extraction. Still, in the clustering stage, traditional algorithms like probabilistic linear discriminant analysis (PLDA) are widely used for scoring the similarity between two speech segments. In this pa… ▽ More More and more neural network approaches have achieved considerable improvement upon submodules of speaker diarization system, including speaker change detection and segment-wise speaker embedding extraction. Still, in the clustering stage, traditional algorithms like probabilistic linear discriminant analysis (PLDA) are widely used for scoring the similarity between two speech segments. In this paper, we propose a supervised method to measure the similarity matrix between all segments of an audio recording with sequential bidirectional long short-term memory networks (Bi-LSTM). Spectral clustering is applied on top of the similarity matrix to further improve the performance. Experimental results show that our system significantly outperforms the state-of-the-art methods and achieves a diarization error rate of 6.63% on the NIST SRE 2000 CALLHOME database. △ Less

Submitted 23 July, 2019; originally announced July 2019.

Comments: Accepted for INTERSPEECH 2019

arXiv:1905.07835 [pdf, other]

Label Map** Neural Networks with Response Consolidation for Class Incremental Learning

Authors: Xu Zhang, Yang Yao, Baile Xu, Lekun Mao, Furao Shen, Jian Zhao, Qingwei Lin

Abstract: Class incremental learning refers to a special multi-class classification task, in which the number of classes is not fixed but is increasing with the continual arrival of new data. Existing researches mainly focused on solving catastrophic forgetting problem in class incremental learning. To this end, however, these models still require the old classes cached in the auxiliary data structure or mo… ▽ More Class incremental learning refers to a special multi-class classification task, in which the number of classes is not fixed but is increasing with the continual arrival of new data. Existing researches mainly focused on solving catastrophic forgetting problem in class incremental learning. To this end, however, these models still require the old classes cached in the auxiliary data structure or models, which is inefficient in space or time. In this paper, it is the first time to discuss the difficulty without support of old classes in class incremental learning, which is called as softmax suppression problem. To address these challenges, we develop a new model named Label Map** with Response Consolidation (LMRC), which need not access the old classes anymore. We propose the Label Map** algorithm combined with the multi-head neural network for mitigating the softmax suppression problem, and propose the Response Consolidation method to overcome the catastrophic forgetting problem. Experimental results on the benchmark datasets show that our proposed method achieves much better performance compared to the related methods in different scenarios. △ Less

Submitted 19 May, 2019; originally announced May 2019.

arXiv:1905.04241 [pdf, other]

Hybrid Predictive Model: When an Interpretable Model Collaborates with a Black-box Model

Authors: Tong Wang, Qihang Lin

Abstract: Interpretable machine learning has become a strong competitor for traditional black-box models. However, the possible loss of the predictive performance for gaining interpretability is often inevitable, putting practitioners in a dilemma of choosing between high accuracy (black-box models) and interpretability (interpretable models). In this work, we propose a novel framework for building a Hybrid… ▽ More Interpretable machine learning has become a strong competitor for traditional black-box models. However, the possible loss of the predictive performance for gaining interpretability is often inevitable, putting practitioners in a dilemma of choosing between high accuracy (black-box models) and interpretability (interpretable models). In this work, we propose a novel framework for building a Hybrid Predictive Model (HPM) that integrates an interpretable model with any black-box model to combine their strengths. The interpretable model substitutes the black-box model on a subset of data where the black-box is overkill or nearly overkill, gaining transparency at no or low cost of the predictive accuracy. We design a principled objective function that considers predictive accuracy, model interpretability, and model transparency (defined as the percentage of data processed by the interpretable substitute.) Under this framework, we propose two hybrid models, one substituting with association rules and the other with linear models, and we design customized training algorithms for both models. We test the hybrid models on structured data and text data where interpretable models collaborate with various state-of-the-art black-box models. Results show that hybrid models obtain an efficient trade-off between transparency and predictive performance, characterized by our proposed efficient frontiers. △ Less

Submitted 10 May, 2019; originally announced May 2019.

arXiv:1904.10112 [pdf, other]

Stochastic Primal-Dual Algorithms with Faster Convergence than $O(1/\sqrt{T})$ for Problems without Bilinear Structure

Authors: Yan Yan, Yi Xu, Qihang Lin, Lijun Zhang, Tianbao Yang

Abstract: Previous studies on stochastic primal-dual algorithms for solving min-max problems with faster convergence heavily rely on the bilinear structure of the problem, which restricts their applicability to a narrowed range of problems. The main contribution of this paper is the design and analysis of new stochastic primal-dual algorithms that use a mixture of stochastic gradient updates and a logarithm… ▽ More Previous studies on stochastic primal-dual algorithms for solving min-max problems with faster convergence heavily rely on the bilinear structure of the problem, which restricts their applicability to a narrowed range of problems. The main contribution of this paper is the design and analysis of new stochastic primal-dual algorithms that use a mixture of stochastic gradient updates and a logarithmic number of deterministic dual updates for solving a family of convex-concave problems with no bilinear structure assumed. Faster convergence rates than $O(1/\sqrt{T})$ with $T$ being the number of stochastic gradient updates are established under some mild conditions of involved functions on the primal and the dual variable. For example, for a family of problems that enjoy a weak strong convexity in terms of the primal variable and has a strongly concave function of the dual variable, the convergence rate of the proposed algorithm is $O(1/T)$. We also investigate the effectiveness of the proposed algorithms for learning robust models and empirical AUC maximization. △ Less

Submitted 18 December, 2019; v1 submitted 22 April, 2019; originally announced April 2019.

arXiv:1903.00070 [pdf, other]

Learning to Plan in High Dimensions via Neural Exploration-Exploitation Trees

Authors: Binghong Chen, Bo Dai, Qinjie Lin, Guo Ye, Han Liu, Le Song

Abstract: We propose a meta path planning algorithm named \emph{Neural Exploration-Exploitation Trees~(NEXT)} for learning from prior experience for solving new path planning problems in high dimensional continuous state and action spaces. Compared to more classical sampling-based methods like RRT, our approach achieves much better sample efficiency in high-dimensions and can benefit from prior experience o… ▽ More We propose a meta path planning algorithm named \emph{Neural Exploration-Exploitation Trees~(NEXT)} for learning from prior experience for solving new path planning problems in high dimensional continuous state and action spaces. Compared to more classical sampling-based methods like RRT, our approach achieves much better sample efficiency in high-dimensions and can benefit from prior experience of planning in similar environments. More specifically, NEXT exploits a novel neural architecture which can learn promising search directions from problem structures. The learned prior is then integrated into a UCB-type algorithm to achieve an online balance between \emph{exploration} and \emph{exploitation} when solving a new problem. We conduct thorough experiments to show that NEXT accomplishes new planning problems with more compact search trees and significantly outperforms state-of-the-art methods on several benchmarks. △ Less

Submitted 23 February, 2020; v1 submitted 28 February, 2019; originally announced March 2019.

Comments: 26 pages, 74 figures, ICLR 2020 spotlight

arXiv:1811.11829 [pdf, other]

Stochastic Optimization for DC Functions and Non-smooth Non-convex Regularizers with Non-asymptotic Convergence

Authors: Yi Xu, Qi Qi, Qihang Lin, Rong **, Tianbao Yang

Abstract: Difference of convex (DC) functions cover a broad family of non-convex and possibly non-smooth and non-differentiable functions, and have wide applications in machine learning and statistics. Although deterministic algorithms for DC functions have been extensively studied, stochastic optimization that is more suitable for learning with big data remains under-explored. In this paper, we propose new… ▽ More Difference of convex (DC) functions cover a broad family of non-convex and possibly non-smooth and non-differentiable functions, and have wide applications in machine learning and statistics. Although deterministic algorithms for DC functions have been extensively studied, stochastic optimization that is more suitable for learning with big data remains under-explored. In this paper, we propose new stochastic optimization algorithms and study their first-order convergence theories for solving a broad family of DC functions. We improve the existing algorithms and theories of stochastic optimization for DC functions from both practical and theoretical perspectives. On the practical side, our algorithm is more user-friendly without requiring a large mini-batch size and more efficient by saving unnecessary computations. On the theoretical side, our convergence analysis does not necessarily require the involved functions to be smooth with Lipschitz continuous gradient. Instead, the convergence rate of the proposed stochastic algorithm is automatically adaptive to the Hölder continuity of the gradient of one component function. Moreover, we extend the proposed stochastic algorithms for DC functions to solve problems with a general non-convex non-differentiable regularizer, which does not necessarily have a DC decomposition but enjoys an efficient proximal map**. To the best of our knowledge, this is the first work that gives the first non-asymptotic convergence for solving non-convex optimization whose objective has a general non-convex non-differentiable regularizer. △ Less

Submitted 4 February, 2019; v1 submitted 28 November, 2018; originally announced November 2018.

Comments: In the revised version, we present some improved complexity results for non-smooth and non-convex regularizers and for functions with known Hölder continuity parameter $ν\in(0,1]$ by a simple change of an algorithmic parameter

arXiv:1810.10207 [pdf, other]

First-order Convergence Theory for Weakly-Convex-Weakly-Concave Min-max Problems

Authors: Mingrui Liu, Hassan Rafique, Qihang Lin, Tianbao Yang

Abstract: In this paper, we consider first-order convergence theory and algorithms for solving a class of non-convex non-concave min-max saddle-point problems, whose objective function is weakly convex in the variables of minimization and weakly concave in the variables of maximization. It has many important applications in machine learning including training Generative Adversarial Nets (GANs). We propose a… ▽ More In this paper, we consider first-order convergence theory and algorithms for solving a class of non-convex non-concave min-max saddle-point problems, whose objective function is weakly convex in the variables of minimization and weakly concave in the variables of maximization. It has many important applications in machine learning including training Generative Adversarial Nets (GANs). We propose an algorithmic framework motivated by the inexact proximal point method, where the weakly monotone variational inequality (VI) corresponding to the original min-max problem is solved through approximately solving a sequence of strongly monotone VIs constructed by adding a strongly monotone map** to the original gradient map**. We prove first-order convergence to a nearly stationary solution of the original min-max problem of the generic algorithmic framework and establish different rates by employing different algorithms for solving each strongly monotone VI. Experiments verify the convergence theory and also demonstrate the effectiveness of the proposed methods on training GANs. △ Less

Submitted 7 July, 2021; v1 submitted 24 October, 2018; originally announced October 2018.

Comments: Accepted by Journal of Machine Learning Research (JMLR)

arXiv:1810.08559 [pdf, other]

EdgeSpeechNets: Highly Efficient Deep Neural Networks for Speech Recognition on the Edge

Authors: Zhong Qiu Lin, Audrey G. Chung, Alexander Wong

Abstract: Despite showing state-of-the-art performance, deep learning for speech recognition remains challenging to deploy in on-device edge scenarios such as mobile and other consumer devices. Recently, there have been greater efforts in the design of small, low-footprint deep neural networks (DNNs) that are more appropriate for edge devices, with much of the focus on design principles for hand-crafting ef… ▽ More Despite showing state-of-the-art performance, deep learning for speech recognition remains challenging to deploy in on-device edge scenarios such as mobile and other consumer devices. Recently, there have been greater efforts in the design of small, low-footprint deep neural networks (DNNs) that are more appropriate for edge devices, with much of the focus on design principles for hand-crafting efficient network architectures. In this study, we explore a human-machine collaborative design strategy for building low-footprint DNN architectures for speech recognition through a marriage of human-driven principled network design prototy** and machine-driven design exploration. The efficacy of this design strategy is demonstrated through the design of a family of highly-efficient DNNs (nicknamed EdgeSpeechNets) for limited-vocabulary speech recognition. Experimental results using the Google Speech Commands dataset for limited-vocabulary speech recognition showed that EdgeSpeechNets have higher accuracies than state-of-the-art DNNs (with the best EdgeSpeechNet achieving ~97% accuracy), while achieving significantly smaller network sizes (as much as 7.8x smaller) and lower computational cost (as much as 36x fewer multiply-add operations, 10x lower prediction latency, and 16x smaller memory footprint on a Motorola Moto E phone), making them very well-suited for on-device edge voice interface applications. △ Less

Submitted 13 November, 2018; v1 submitted 17 October, 2018; originally announced October 2018.

Comments: 4 pages

arXiv:1810.04472

Domain Confusion with Self Ensembling for Unsupervised Adaptation

Authors: Jiawei Wang, Zhaoshui He, Chengjian Feng, Zhou** Zhu, Qinzhuang Lin, Jun Lv, Shengli Xie

Abstract: Data collection and annotation are time-consuming in machine learning, expecially for large scale problem. A common approach for this problem is to transfer knowledge from a related labeled domain to a target one. There are two popular ways to achieve this goal: adversarial learning and self training. In this article, we first analyze the training unstablity problem and the mistaken confusion issu… ▽ More Data collection and annotation are time-consuming in machine learning, expecially for large scale problem. A common approach for this problem is to transfer knowledge from a related labeled domain to a target one. There are two popular ways to achieve this goal: adversarial learning and self training. In this article, we first analyze the training unstablity problem and the mistaken confusion issue in adversarial learning process. Then, inspired by domain confusion and self-ensembling methods, we propose a combined model to learn feature and class jointly invariant representation, namely Domain Confusion with Self Ensembling (DCSE). The experiments verified that our proposed approach can offer better performance than empirical art in a variety of unsupervised domain adaptation benchmarks. △ Less

Submitted 8 July, 2020; v1 submitted 10 October, 2018; originally announced October 2018.

Comments: The expression is ambiguous, which is not convenient for readers to understand, and in today's view, the conclusion of the paper is of little significance, so it is no longer open

arXiv:1808.10396 [pdf, other]

A Unified Analysis of Stochastic Momentum Methods for Deep Learning

Authors: Yan Yan, Tianbao Yang, Zhe Li, Qihang Lin, Yi Yang

Abstract: Stochastic momentum methods have been widely adopted in training deep neural networks. However, their theoretical analysis of convergence of the training objective and the generalization error for prediction is still under-explored. This paper aims to bridge the gap between practice and theory by analyzing the stochastic gradient (SG) method, and the stochastic momentum methods including two famou… ▽ More Stochastic momentum methods have been widely adopted in training deep neural networks. However, their theoretical analysis of convergence of the training objective and the generalization error for prediction is still under-explored. This paper aims to bridge the gap between practice and theory by analyzing the stochastic gradient (SG) method, and the stochastic momentum methods including two famous variants, i.e., the stochastic heavy-ball (SHB) method and the stochastic variant of Nesterov's accelerated gradient (SNAG) method. We propose a framework that unifies the three variants. We then derive the convergence rates of the norm of gradient for the non-convex optimization problem, and analyze the generalization performance through the uniform stability approach. Particularly, the convergence analysis of the training objective exhibits that SHB and SNAG have no advantage over SG. However, the stability analysis shows that the momentum term can improve the stability of the learned model and hence improve the generalization performance. These theoretical insights verify the common wisdom and are also corroborated by our empirical analysis on deep learning. △ Less

Submitted 30 August, 2018; originally announced August 2018.

Comments: Previous Technical Report: arXiv:1604.03257

Journal ref: In IJCAI, pp. 2955-2961. 2018

arXiv:1807.01635 [pdf, other]

Randomization Inference for Peer Effects

Authors: Xinran Li, Peng Ding, Qian Lin, Dawei Yang, Jun S. Liu

Abstract: Many previous causal inference studies require no interference, that is, the potential outcomes of a unit do not depend on the treatments of other units. However, this no-interference assumption becomes unreasonable when a unit interacts with other units in the same group or cluster. In a motivating application, a university in China admits students through two channels: the college entrance exam… ▽ More Many previous causal inference studies require no interference, that is, the potential outcomes of a unit do not depend on the treatments of other units. However, this no-interference assumption becomes unreasonable when a unit interacts with other units in the same group or cluster. In a motivating application, a university in China admits students through two channels: the college entrance exam (also known as Gaokao) and recommendation (often based on Olympiads in various subjects). The university randomly assigns students to dorms, each of which hosts four students. Students within the same dorm live together and have extensive interactions. Therefore, it is likely that peer effects exist and the no-interference assumption does not hold. It is important to understand peer effects, because they give useful guidance for future roommate assignment to improve the performance of students. We define peer effects using potential outcomes. We then propose a randomization-based inference framework to study peer effects with arbitrary numbers of peers and peer types. Our inferential procedure does not assume any parametric model on the outcome distribution. Our analysis gives useful practical guidance for policy makers of the university in China. △ Less

Submitted 20 December, 2018; v1 submitted 4 July, 2018; originally announced July 2018.

arXiv:1805.09484 [pdf, other]

Multi-Level Deep Cascade Trees for Conversion Rate Prediction in Recommendation System

Authors: Hong Wen, **g Zhang, Quan Lin, Ke** Yang, Pipei Huang

Abstract: Develo** effective and efficient recommendation methods is very challenging for modern e-commerce platforms. Generally speaking, two essential modules named "Click-Through Rate Prediction" (\textit{CTR}) and "Conversion Rate Prediction" (\textit{CVR}) are included, where \textit{CVR} module is a crucial factor that affects the final purchasing volume directly. However, it is indeed very challeng… ▽ More Develo** effective and efficient recommendation methods is very challenging for modern e-commerce platforms. Generally speaking, two essential modules named "Click-Through Rate Prediction" (\textit{CTR}) and "Conversion Rate Prediction" (\textit{CVR}) are included, where \textit{CVR} module is a crucial factor that affects the final purchasing volume directly. However, it is indeed very challenging due to its sparseness nature. In this paper, we tackle this problem by proposing multi-Level Deep Cascade Trees (\textit{ldcTree}), which is a novel decision tree ensemble approach. It leverages deep cascade structures by stacking Gradient Boosting Decision Trees (\textit{GBDT}) to effectively learn feature representation. In addition, we propose to utilize the cross-entropy in each tree of the preceding \textit{GBDT} as the input feature representation for next level \textit{GBDT}, which has a clear explanation, i.e., a traversal from root to leaf nodes in the next level \textit{GBDT} corresponds to the combination of certain traversals in the preceding \textit{GBDT}. The deep cascade structure and the combination rule enable the proposed \textit{ldcTree} to have a stronger distributed feature representation ability. Moreover, inspired by ensemble learning, we propose an Ensemble \textit{ldcTree} (\textit{E-ldcTree}) to encourage the model's diversity and enhance the representation ability further. Finally, we propose an improved Feature learning method based on \textit{EldcTree} (\textit{F-EldcTree}) for taking adequate use of weak and strong correlation features identified by pre-trained \textit{GBDT} models. Experimental results on off-line data set and online deployment demonstrate the effectiveness of the proposed methods. △ Less

Submitted 18 November, 2018; v1 submitted 23 May, 2018; originally announced May 2018.

Comments: 8 pages, 5 figures, To appear in AAAI'2019

arXiv:1802.04918 [pdf, other]

Prophit: Causal inverse classification for multiple continuously valued treatment policies

Authors: Michael T. Lash, Qihang Lin, W. Nick Street

Abstract: Inverse classification uses an induced classifier as a queryable oracle to guide test instances towards a preferred posterior class label. The result produced from the process is a set of instance-specific feature perturbations, or recommendations, that optimally improve the probability of the class label. In this work, we adopt a causal approach to inverse classification, eliciting treatment poli… ▽ More Inverse classification uses an induced classifier as a queryable oracle to guide test instances towards a preferred posterior class label. The result produced from the process is a set of instance-specific feature perturbations, or recommendations, that optimally improve the probability of the class label. In this work, we adopt a causal approach to inverse classification, eliciting treatment policies (i.e., feature perturbations) for models induced with causal properties. In so doing, we solve a long-standing problem of eliciting multiple, continuously valued treatment policies, using an updated framework and corresponding set of assumptions, which we term the inverse classification potential outcomes framework (ICPOF), along with a new measure, referred to as the individual future estimated effects ($i$FEE). We also develop the approximate propensity score (APS), based on Gaussian processes, to weight treatments, much like the inverse propensity score weighting used in past works. We demonstrate the viability of our methods on student performance. △ Less

Submitted 13 February, 2018; originally announced February 2018.

arXiv:1710.05080 [pdf, other]

DSCOVR: Randomized Primal-Dual Block Coordinate Algorithms for Asynchronous Distributed Optimization

Authors: Lin Xiao, Adams Wei Yu, Qihang Lin, Weizhu Chen

Abstract: Machine learning with big data often involves large optimization models. For distributed optimization over a cluster of machines, frequent communication and synchronization of all model parameters (optimization variables) can be very costly. A promising solution is to use parameter servers to store different subsets of the model parameters, and update them asynchronously at different machines usin… ▽ More Machine learning with big data often involves large optimization models. For distributed optimization over a cluster of machines, frequent communication and synchronization of all model parameters (optimization variables) can be very costly. A promising solution is to use parameter servers to store different subsets of the model parameters, and update them asynchronously at different machines using local datasets. In this paper, we focus on distributed optimization of large linear models with convex loss functions, and propose a family of randomized primal-dual block coordinate algorithms that are especially suitable for asynchronous distributed implementation with parameter servers. In particular, we work with the saddle-point formulation of such problems which allows simultaneous data and model partitioning, and exploit its structure by doubly stochastic coordinate optimization with variance reduction (DSCOVR). Compared with other first-order distributed algorithms, we show that DSCOVR may require less amount of overall computation and communication, and less or no synchronization. We discuss the implementation details of the DSCOVR algorithms, and present numerical experiments on an industrial distributed computing system. △ Less

Submitted 13 October, 2017; originally announced October 2017.

arXiv:1612.09466 [pdf]

doi 10.1109/TSP.2018.2830317

Double Coupled Canonical Polyadic Decomposition for Joint Blind Source Separation

Authors: Xiao-Feng Gong, Qiu-Hua Lin, Feng-Yu Cong, Lieven De Lathauwer

Abstract: Joint blind source separation (J-BSS) is an emerging data-driven technique for multi-set data-fusion. In this paper, J-BSS is addressed from a tensorial perspective. We show how, by using second-order multi-set statistics in J-BSS, a specific double coupled canonical polyadic decomposition (DC-CPD) problem can be formulated. We propose an algebraic DC-CPD algorithm based on a coupled rank-1 detect… ▽ More Joint blind source separation (J-BSS) is an emerging data-driven technique for multi-set data-fusion. In this paper, J-BSS is addressed from a tensorial perspective. We show how, by using second-order multi-set statistics in J-BSS, a specific double coupled canonical polyadic decomposition (DC-CPD) problem can be formulated. We propose an algebraic DC-CPD algorithm based on a coupled rank-1 detection map**. This algorithm converts a possibly underdetermined DC-CPD to a set of overdetermined CPDs. The latter can be solved algebraically via a generalized eigenvalue decomposition based scheme. Therefore, this algorithm is deterministic and returns the exact solution in the noiseless case. In the noisy case, it can be used to effectively initialize optimization based DC-CPD algorithms. In addition, we obtain the determini- stic and generic uniqueness conditions for DC-CPD, which are shown to be more relaxed than their CPD counterpart. Experiment results are given to illustrate the superiority of DC-CPD over standard CPD based BSS methods and several existing J-BSS methods, with regards to uniqueness and accuracy. △ Less

Submitted 27 April, 2018; v1 submitted 30 December, 2016; originally announced December 2016.

Comments: Accepted by IEEE Transactions on Signal Processing

arXiv:1612.07222 [pdf, other]

Bayesian Decision Process for Cost-Efficient Dynamic Ranking via Crowdsourcing

Authors: Xi Chen, Kevin Jiao, Qihang Lin

Abstract: Rank aggregation based on pairwise comparisons over a set of items has a wide range of applications. Although considerable research has been devoted to the development of rank aggregation algorithms, one basic question is how to efficiently collect a large amount of high-quality pairwise comparisons for the ranking purpose. Because of the advent of many crowdsourcing services, a crowd of workers a… ▽ More Rank aggregation based on pairwise comparisons over a set of items has a wide range of applications. Although considerable research has been devoted to the development of rank aggregation algorithms, one basic question is how to efficiently collect a large amount of high-quality pairwise comparisons for the ranking purpose. Because of the advent of many crowdsourcing services, a crowd of workers are often hired to conduct pairwise comparisons with a small monetary reward for each pair they compare. Since different workers have different levels of reliability and different pairs have different levels of ambiguity, it is desirable to wisely allocate the limited budget for comparisons among the pairs of items and workers so that the global ranking can be accurately inferred from the comparison results. To this end, we model the active sampling problem in crowdsourced ranking as a Bayesian Markov decision process, which dynamically selects item pairs and workers to improve the ranking accuracy under a budget constraint. We further develop a computationally efficient sampling policy based on knowledge gradient as well as a moment matching technique for posterior approximation. Experimental evaluations on both synthetic and real data show that the proposed policy achieves high ranking accuracy with a lower labeling cost. △ Less

Submitted 21 December, 2016; originally announced December 2016.

Journal ref: Journal of Machine Learning Research 17 (2016) 1-40

arXiv:1611.07100 [pdf, other]

Interpreting Finite Automata for Sequential Data

Authors: Christian Albert Hammerschmidt, Sicco Verwer, Qin Lin, Radu State

Abstract: Automaton models are often seen as interpretable models. Interpretability itself is not well defined: it remains unclear what interpretability means without first explicitly specifying objectives or desired attributes. In this paper, we identify the key properties used to interpret automata and propose a modification of a state-merging approach to learn variants of finite state automata. We apply… ▽ More Automaton models are often seen as interpretable models. Interpretability itself is not well defined: it remains unclear what interpretability means without first explicitly specifying objectives or desired attributes. In this paper, we identify the key properties used to interpret automata and propose a modification of a state-merging approach to learn variants of finite state automata. We apply the approach to problems beyond typical grammar inference tasks. Additionally, we cover several use-cases for prediction, classification, and clustering on sequential data in both supervised and unsupervised scenarios to show how the identified key properties are applicable in a wide range of contexts. △ Less

Submitted 24 November, 2016; v1 submitted 21 November, 2016; originally announced November 2016.

Comments: Presented at NIPS 2016 Workshop on Interpretable Machine Learning in Complex Systems

ACM Class: I.2.6

arXiv:1611.06655 [pdf, ps, other]

Sparse Sliced Inverse Regression Via Lasso

Authors: Qian Lin, Zhigen Zhao, Jun S. Liu

Abstract: For multiple index models, it has recently been shown that the sliced inverse regression (SIR) is consistent for estimating the sufficient dimension reduction (SDR) space if and only if $ρ=\lim\frac{p}{n}=0$, where $p$ is the dimension and $n$ is the sample size. Thus, when $p$ is of the same or a higher order of $n$, additional assumptions such as sparsity must be imposed in order to ensure consi… ▽ More For multiple index models, it has recently been shown that the sliced inverse regression (SIR) is consistent for estimating the sufficient dimension reduction (SDR) space if and only if $ρ=\lim\frac{p}{n}=0$, where $p$ is the dimension and $n$ is the sample size. Thus, when $p$ is of the same or a higher order of $n$, additional assumptions such as sparsity must be imposed in order to ensure consistency for SIR. By constructing artificial response variables made up from top eigenvectors of the estimated conditional covariance matrix, we introduce a simple Lasso regression method to obtain an estimate of the SDR space. The resulting algorithm, Lasso-SIR, is shown to be consistent and achieve the optimal convergence rate under certain sparsity conditions when $p$ is of order $o(n^2λ^2)$, where $λ$ is the generalized signal-to-noise ratio. We also demonstrate the superior performance of Lasso-SIR compared with existing approaches via extensive numerical studies and several real data examples. △ Less

Submitted 17 June, 2018; v1 submitted 21 November, 2016; originally announced November 2016.

Comments: 41 pages, 2 figures

MSC Class: 62J02 (Primary); 62H25 (Secondary)

arXiv:1610.01675 [pdf, other]

doi 10.1137/1.9781611974973.19

Generalized Inverse Classification

Authors: Michael T. Lash, Qihang Lin, W. Nick Street, Jennifer G. Robinson, Jeffrey Ohlmann

Abstract: Inverse classification is the process of perturbing an instance in a meaningful way such that it is more likely to conform to a specific class. Historical methods that address such a problem are often framed to leverage only a single classifier, or specific set of classifiers. These works are often accompanied by naive assumptions. In this work we propose generalized inverse classification (GIC),… ▽ More Inverse classification is the process of perturbing an instance in a meaningful way such that it is more likely to conform to a specific class. Historical methods that address such a problem are often framed to leverage only a single classifier, or specific set of classifiers. These works are often accompanied by naive assumptions. In this work we propose generalized inverse classification (GIC), which avoids restricting the classification model that can be used. We incorporate this formulation into a refined framework in which GIC takes place. Under this framework, GIC operates on features that are immediately actionable. Each change incurs an individual cost, either linear or non-linear. Such changes are subjected to occur within a specified level of cumulative change (budget). Furthermore, our framework incorporates the estimation of features that change as a consequence of direct actions taken (indirectly changeable features). To solve such a problem, we propose three real-valued heuristic-based methods and two sensitivity analysis-based comparison methods, each of which is evaluated on two freely available real-world datasets. Our results demonstrate the validity and benefits of our formulation, framework, and methods. △ Less

Submitted 12 January, 2017; v1 submitted 5 October, 2016; originally announced October 2016.

Comments: Accepted to SDM 2017. Full paper + supplemental material

arXiv:1608.03487 [pdf, ps, other]

A Richer Theory of Convex Constrained Optimization with Reduced Projections and Improved Rates

Authors: Tianbao Yang, Qihang Lin, Lijun Zhang

Abstract: This paper focuses on convex constrained optimization problems, where the solution is subject to a convex inequality constraint. In particular, we aim at challenging problems for which both projection into the constrained domain and a linear optimization under the inequality constraint are time-consuming, which render both projected gradient methods and conditional gradient methods (a.k.a. the Fra… ▽ More This paper focuses on convex constrained optimization problems, where the solution is subject to a convex inequality constraint. In particular, we aim at challenging problems for which both projection into the constrained domain and a linear optimization under the inequality constraint are time-consuming, which render both projected gradient methods and conditional gradient methods (a.k.a. the Frank-Wolfe algorithm) expensive. In this paper, we develop projection reduced optimization algorithms for both smooth and non-smooth optimization with improved convergence rates under a certain regularity condition of the constraint function. We first present a general theory of optimization with only one projection. Its application to smooth optimization with only one projection yields $O(1/ε)$ iteration complexity, which improves over the $O(1/ε^2)$ iteration complexity established before for non-smooth optimization and can be further reduced under strong convexity. Then we introduce a local error bound condition and develop faster algorithms for non-strongly convex optimization at the price of a logarithmic number of projections. In particular, we achieve an iteration complexity of $\widetilde O(1/ε^{2(1-θ)})$ for non-smooth optimization and $\widetilde O(1/ε^{1-θ})$ for smooth optimization, where $θ\in(0,1]$ appearing the local error bound condition characterizes the functional local growth rate around the optimal solutions. Novel applications in solving the constrained $\ell_1$ minimization problem and a positive semi-definite constrained distance metric learning problem demonstrate that the proposed algorithms achieve significant speed-up compared with previous algorithms. △ Less

Submitted 12 June, 2017; v1 submitted 11 August, 2016; originally announced August 2016.

Comments: This is the long version of our ICML 2017 paper

arXiv:1607.03815 [pdf, ps, other]

Homotopy Smoothing for Non-Smooth Problems with Lower Complexity than $O(1/ε)$

Authors: Yi Xu, Yan Yan, Qihang Lin, Tianbao Yang

Abstract: In this paper, we develop a novel {\bf ho}moto{\bf p}y {\bf s}moothing (HOPS) algorithm for solving a family of non-smooth problems that is composed of a non-smooth term with an explicit max-structure and a smooth term or a simple non-smooth term whose proximal map** is easy to compute. The best known iteration complexity for solving such non-smooth optimization problems is $O(1/ε)$ without any… ▽ More In this paper, we develop a novel {\bf ho}moto{\bf p}y {\bf s}moothing (HOPS) algorithm for solving a family of non-smooth problems that is composed of a non-smooth term with an explicit max-structure and a smooth term or a simple non-smooth term whose proximal map** is easy to compute. The best known iteration complexity for solving such non-smooth optimization problems is $O(1/ε)$ without any assumption on the strong convexity. In this work, we will show that the proposed HOPS achieved a lower iteration complexity of $\widetilde O(1/ε^{1-θ})$\footnote{$\widetilde O()$ suppresses a logarithmic factor.} with $θ\in(0,1]$ capturing the local sharpness of the objective function around the optimal solutions. To the best of our knowledge, this is the lowest iteration complexity achieved so far for the considered non-smooth optimization problems without strong convexity assumption. The HOPS algorithm employs Nesterov's smoothing technique and Nesterov's accelerated gradient method and runs in stages, which gradually decreases the smoothing parameter in a stage-wise manner until it yields a sufficiently good approximation of the original function. We show that HOPS enjoys a linear convergence for many well-known non-smooth problems (e.g., empirical risk minimization with a piece-wise linear loss function and $\ell_1$ norm regularizer, finding a point in a polyhedron, cone programming, etc). Experimental results verify the effectiveness of HOPS in comparison with Nesterov's smoothing algorithm and the primal-dual style of first-order methods. △ Less

Submitted 3 November, 2016; v1 submitted 13 July, 2016; originally announced July 2016.

Comments: This is a long version of the paper accepted by NIPS 2016

arXiv:1607.01027 [pdf, ps, other]

Accelerate Stochastic Subgradient Method by Leveraging Local Growth Condition

Authors: Yi Xu, Qihang Lin, Tianbao Yang

Abstract: In this paper, a new theory is developed for first-order stochastic convex optimization, showing that the global convergence rate is sufficiently quantified by a local growth rate of the objective function in a neighborhood of the optimal solutions. In particular, if the objective function $F(\mathbf w)$ in the $ε$-sublevel set grows as fast as $\|\mathbf w - \mathbf w_*\|_2^{1/θ}$, where… ▽ More In this paper, a new theory is developed for first-order stochastic convex optimization, showing that the global convergence rate is sufficiently quantified by a local growth rate of the objective function in a neighborhood of the optimal solutions. In particular, if the objective function $F(\mathbf w)$ in the $ε$-sublevel set grows as fast as $\|\mathbf w - \mathbf w_*\|_2^{1/θ}$, where $\mathbf w_*$ represents the closest optimal solution to $\mathbf w$ and $θ\in(0,1]$ quantifies the local growth rate, the iteration complexity of first-order stochastic optimization for achieving an $ε$-optimal solution can be $\widetilde O(1/ε^{2(1-θ)})$, which is optimal at most up to a logarithmic factor. To achieve the faster global convergence, we develop two different accelerated stochastic subgradient methods by iteratively solving the original problem approximately in a local region around a historical solution with the size of the local region gradually decreasing as the solution approaches the optimal set. Besides the theoretical improvements, this work also includes new contributions towards making the proposed algorithms practical: (i) we present practical variants of accelerated stochastic subgradient methods that can run without the knowledge of multiplicative growth constant and even the growth rate $θ$; (ii) we consider a broad family of problems in machine learning to demonstrate that the proposed algorithms enjoy faster convergence than traditional stochastic subgradient method. We also characterize the complexity of the proposed algorithms for ensuring the gradient is small without the smoothness assumption. △ Less

Submitted 5 May, 2020; v1 submitted 4 July, 2016; originally announced July 2016.

Showing 1–50 of 66 results for author: Lin, Q