Search | arXiv e-print repository

Interpret the estimand framework from a causal inference perspective

Abstract: The estimand framework proposed by ICH in 2017 has brought fundamental changes in the pharmaceutical industry. It clearly describes how a treatment effect in a clinical question should be precisely defined and estimated, through attributes including treatments, endpoints and intercurrent events. However, ideas around the estimand framework are commonly in text, and different interpretations on thi… ▽ More The estimand framework proposed by ICH in 2017 has brought fundamental changes in the pharmaceutical industry. It clearly describes how a treatment effect in a clinical question should be precisely defined and estimated, through attributes including treatments, endpoints and intercurrent events. However, ideas around the estimand framework are commonly in text, and different interpretations on this framework may exist. This article aims to interpret the estimand framework through its underlying theories, the causal inference framework based on potential outcomes. The statistical origin and formula of an estimand is given through the causal inference framework, with all attributes translated into statistical terms. How five strategies proposed by ICH to analyze intercurrent events are incorporated in the statistical formula of an estimand is described, and a new strategy to analyze intercurrent events is also suggested. The roles of target populations and analysis sets in the estimand framework are compared and discussed based on the statistical formula of an estimand. This article recommends continuing study of causal inference theories behind the estimand framework and improving the estimand framework with greater methodological comprehensibility and availability. △ Less

Submitted 28 June, 2024; originally announced July 2024.

arXiv:2302.14505 [pdf]

Nonlinear regression models to forecast PM$_{2.5}$ concentration in Wuhan, China

Authors: **ghong Zeng

Abstract: Forecasting PM$_{2.5}$ concentration is important to solving air pollution problems in Wuhan. This paper proposes a PM$_{2.5}$ concentration forecast model based on nonlinear regression, including a single-value forecast model and an interval forecast model. The single-value forecast model can precisely forecast PM$_{2.5}$ concentration for the next day, with forecast bias about 6 $μg/m^3$ in good… ▽ More Forecasting PM$_{2.5}$ concentration is important to solving air pollution problems in Wuhan. This paper proposes a PM$_{2.5}$ concentration forecast model based on nonlinear regression, including a single-value forecast model and an interval forecast model. The single-value forecast model can precisely forecast PM$_{2.5}$ concentration for the next day, with forecast bias about 6 $μg/m^3$ in goodness of fit analysis. The interval forecast model can efficiently forecast high-concentration and low-concentration days, which covers 60%-80% observed samples in model validation. Moreover, this paper combines the PM$_{2.5}$ concentration forecast model with NCEP Climate Forecast System Version 2 to realize its forecast application, then develops NCEP CFS2's PM$_{2.5}$ concentration forecast model to enhance forecast accuracy. The results indicate that the PM$_{2.5}$ concentration forecast model has good capacity for independent forecasting. △ Less

Submitted 28 February, 2023; originally announced February 2023.

Comments: In Chinese, supervised by Yurong Chen

arXiv:2302.14469 [pdf, other]

Bayesian inference on average treatment effects in the PreventS trial data in the presence of unmeasured confounding

Authors: **ghong Zeng

Abstract: Using the PreventS trial data, our objective is to estimate average effects of a Health Wellness Coaching (HWC) intervention on improvement of cardiovascular health at 9 months post randomization and in three consecutive 3-month periods over 9 months post randomization. Conventional approaches, including instrumental variable models, are not applicable in the presence of multiple correlated multiv… ▽ More Using the PreventS trial data, our objective is to estimate average effects of a Health Wellness Coaching (HWC) intervention on improvement of cardiovascular health at 9 months post randomization and in three consecutive 3-month periods over 9 months post randomization. Conventional approaches, including instrumental variable models, are not applicable in the presence of multiple correlated multivalued exposures and unmeasured confounding. We propose a causal framework and its Bayesian modelling procedures to identify and estimate average effects of one or multiple multivalued exposures on one outcome in the presence of unmeasured confounding, noncompliance and missing data, in a two-arm randomized trial. We also propose estimation methods of unmeasured confounders, where the exposure and outcome distributions are conditional on unmeasured confounders and then unmeasured confounders are imputed as completely missing variables. Several types of model non-identifiability and possible solutions are described. There is a risk that estimation methods of unmeasured confounders can fail when multiple contradictory posterior solutions are produced. The random intercept outcome models that only adjust for unmeasured confounding in the outcome distribution are proposed as a good surrogate causal model in this case, and they need further development. There is evidence that the HWC intervention is beneficial to cardiovascular health at 9 months post randomization. On average, completing one HWC session improves the Life's Simple Seven total score by 0.16 (0.09, 0.22) and reduces systolic blood pressure by 0.54 (0.19, 0.90) mm Hg. There is also evidence that the HWC intervention has a larger beneficial effect on cardiovascular health during 3 months post randomization. There is no clear evidence that the HWC intervention benefits or harms mental health. The complete abstract is in the article. △ Less

Submitted 28 February, 2023; originally announced February 2023.

Comments: Supervised by Alain C. Vandal

arXiv:2210.11834 [pdf, other]

Optimal Contextual Bandits with Knapsacks under Realizability via Regression Oracles

Authors: Yuxuan Han, Jialin Zeng, Yang Wang, Yang Xiang, Jiheng Zhang

Abstract: We study the stochastic contextual bandit with knapsacks (CBwK) problem, where each action, taken upon a context, not only leads to a random reward but also costs a random resource consumption in a vector form. The challenge is to maximize the total reward without violating the budget for each resource. We study this problem under a general realizability setting where the expected reward and expec… ▽ More We study the stochastic contextual bandit with knapsacks (CBwK) problem, where each action, taken upon a context, not only leads to a random reward but also costs a random resource consumption in a vector form. The challenge is to maximize the total reward without violating the budget for each resource. We study this problem under a general realizability setting where the expected reward and expected cost are functions of contexts and actions in some given general function classes $\mathcal{F}$ and $\mathcal{G}$, respectively. Existing works on CBwK are restricted to the linear function class since they use UCB-type algorithms, which heavily rely on the linear form and thus are difficult to extend to general function classes. Motivated by online regression oracles that have been successfully applied to contextual bandits, we propose the first universal and optimal algorithmic framework for CBwK by reducing it to online regression. We also establish the lower regret bound to show the optimality of our algorithm for a variety of function classes. △ Less

Submitted 22 February, 2023; v1 submitted 21 October, 2022; originally announced October 2022.

Comments: AISTATS2023

arXiv:2209.05742 [pdf, other]

doi 10.1109/TPAMI.2022.3190939

A Tale of HodgeRank and Spectral Method: Target Attack Against Rank Aggregation Is the Fixed Point of Adversarial Game

Authors: Ke Ma, Qianqian Xu, **shan Zeng, Guorong Li, Xiaochun Cao, Qingming Huang

Abstract: Rank aggregation with pairwise comparisons has shown promising results in elections, sports competitions, recommendations, and information retrieval. However, little attention has been paid to the security issue of such algorithms, in contrast to numerous research work on the computational and statistical characteristics. Driven by huge profits, the potential adversary has strong motivation and in… ▽ More Rank aggregation with pairwise comparisons has shown promising results in elections, sports competitions, recommendations, and information retrieval. However, little attention has been paid to the security issue of such algorithms, in contrast to numerous research work on the computational and statistical characteristics. Driven by huge profits, the potential adversary has strong motivation and incentives to manipulate the ranking list. Meanwhile, the intrinsic vulnerability of the rank aggregation methods is not well studied in the literature. To fully understand the possible risks, we focus on the purposeful adversary who desires to designate the aggregated results by modifying the pairwise data in this paper. From the perspective of the dynamical system, the attack behavior with a target ranking list is a fixed point belonging to the composition of the adversary and the victim. To perform the targeted attack, we formulate the interaction between the adversary and the victim as a game-theoretic framework consisting of two continuous operators while Nash equilibrium is established. Then two procedures against HodgeRank and RankCentrality are constructed to produce the modification of the original data. Furthermore, we prove that the victims will produce the target ranking list once the adversary masters the complete information. It is noteworthy that the proposed methods allow the adversary only to hold incomplete information or imperfect feedback and perform the purposeful attack. The effectiveness of the suggested target attack strategies is demonstrated by a series of toy simulations and several real-world data experiments. These experimental results show that the proposed methods could achieve the attacker's goal in the sense that the leading candidate of the perturbed ranking list is the designated one by the adversary. △ Less

Submitted 13 September, 2022; originally announced September 2022.

Comments: 33 pages, https://github.com/alphaprime/Target_Attack_Rank_Aggregation

Journal ref: Early Access by TPAMI 2022 (https://ieeexplore.ieee.org/document/9830042)

arXiv:2209.00869 [pdf, other]

A Survey of Causal Inference Frameworks

Authors: **gying Zeng, Run Wang

Abstract: Causal inference is a science with multi-disciplinary evolution and applications. On the one hand, it measures effects of treatments in observational data based on experimental designs and rigorous statistical inference to draw causal statements. One of the most influential framework in quantifying causal effects is the potential outcomes framework. On the other hand, causal graphical models utili… ▽ More Causal inference is a science with multi-disciplinary evolution and applications. On the one hand, it measures effects of treatments in observational data based on experimental designs and rigorous statistical inference to draw causal statements. One of the most influential framework in quantifying causal effects is the potential outcomes framework. On the other hand, causal graphical models utilizes directed edges to represent causalities and encodes conditional independence relationships among variables in the graphs. A series of research has been done both in reading-off conditional independencies from graphs and in re-constructing causal structures. In recent years, the most state-of-art research in causal inference starts unifying the different causal inference frameworks together. This survey aims to provide a review of the past work on causal inference, focusing mainly on potential outcomes framework and causal graphical models. We hope that this survey will help accelerate the understanding of causal inference in different domains. △ Less

Submitted 2 September, 2022; originally announced September 2022.

arXiv:2207.12630 [pdf, other]

Bayesian Causal Inference in Sequentially Randomized Experiments with Noncompliance

Authors: **gying Zeng

Abstract: Scientific researchers utilize randomized experiments to draw casual statements. Most early studies as well as current work on experiments with sequential intervention decisions has been focusing on estimating the causal effects among sequential treatments, ignoring the non-compliance issues that experimental units might not be compliant with the treatment assignments that they were originally all… ▽ More Scientific researchers utilize randomized experiments to draw casual statements. Most early studies as well as current work on experiments with sequential intervention decisions has been focusing on estimating the causal effects among sequential treatments, ignoring the non-compliance issues that experimental units might not be compliant with the treatment assignments that they were originally allocated. A series of methodologies have been developed to address the non-compliance issues in randomized experiments with time-fixed treatment. However, to our best knowledge, there is little literature studies on the non-compliance issues in sequential experiments settings. In this paper, we go beyond the traditional methods using per-protocol, as-treated, or intention-to-treat analysis and propose a latent mixture Bayesian framework to estimate the sample-average treatment effect in sequential experiment having non-compliance concerns. △ Less

Submitted 25 July, 2022; originally announced July 2022.

arXiv:2207.11932 [pdf, other]

Semiparametric Estimation on Multi-treatment Causal Effects via Cross-Fitting

Authors: **gying Zeng

Abstract: Causal inference is a critical research area with multi-disciplinary origins and applications, ranging from statistics, computer science, economics, psychology to public health. In many scientific research, randomized experiments provide a golden standard for estimation of causal effects for decades. However, in many situations, randomized experiments are not feasible in practice so that practitio… ▽ More Causal inference is a critical research area with multi-disciplinary origins and applications, ranging from statistics, computer science, economics, psychology to public health. In many scientific research, randomized experiments provide a golden standard for estimation of causal effects for decades. However, in many situations, randomized experiments are not feasible in practice so that practitioners need to rely on empirical investigation for causal reasoning. Causal inference via observational data is a challenging task since the knowledge of the treatment assignment mechanism is missing, which typically requires non-testable assumptions to make the inference possible. For several years, great effort has been devoted to the research of causal inference for binary treatments. In practice, it is also common to use observational data on multiple treatment comparisons. Within the potential outcomes framework, we propose a generalized cross-fitting estimator (GCF), which generalizes the doubly robust estimator with cross-fitting for binary treatment to multiple treatment comparisons and provides rigorous proofs on its statistical properties. This estimator permits the use of more flexible machine learning methods to model the nuisance parts, and based on relatively weak assumptions, while there is still a theoretical guarantee for valid statistical inference. We show the asymptotic properties of the GCF estimators, and provide the asymptotic simultaneous confidence intervals that achieve the semiparametric efficiency bound for average treatment effect. The performance of the estimator is accessed through simulation study based on the common evaluation metrics generally considered in the causal inference literature. △ Less

Submitted 25 July, 2022; originally announced July 2022.

arXiv:2101.11190 [pdf, other]

Boost-S: Gradient Boosted Trees for Spatial Data and Its Application to FDG-PET Imaging Data

Authors: Reza Iranzad, Xiao Liu, W. Art Chaovalitwongse, Daniel S. Hippe, Shouyi Wang, Jie Han, Phawis Thammasorn, Chunyan Duan, **g Zeng, Stephen R. Bowen

Abstract: Boosting Trees are one of the most successful statistical learning approaches that involve sequentially growing an ensemble of simple regression trees (i.e., "weak learners"). However, gradient boosted trees are not yet available for spatially correlated data. This paper proposes a new gradient Boosted Trees algorithm for Spatial Data (Boost-S) with covariate information. Boost-S integrates the sp… ▽ More Boosting Trees are one of the most successful statistical learning approaches that involve sequentially growing an ensemble of simple regression trees (i.e., "weak learners"). However, gradient boosted trees are not yet available for spatially correlated data. This paper proposes a new gradient Boosted Trees algorithm for Spatial Data (Boost-S) with covariate information. Boost-S integrates the spatial correlation structure into the classical framework of gradient boosted trees. Each tree is grown by solving a regularized optimization problem, where the objective function involves two penalty terms on tree complexity and takes into account the underlying spatial correlation. A computationally-efficient algorithm is proposed to obtain the ensemble trees. The proposed Boost-S is applied to the spatially-correlated FDG-PET (fluorodeoxyglucose-positron emission tomography) imaging data collected during cancer chemoradiotherapy. Our numerical investigations successfully demonstrate the advantages of the proposed Boost-S over existing approaches for this particular application. △ Less

Submitted 3 February, 2021; v1 submitted 26 January, 2021; originally announced January 2021.

arXiv:2009.05923 [pdf, other]

Contrastive Self-supervised Learning for Graph Classification

Authors: Jiaqi Zeng, Pengtao Xie

Abstract: Graph classification is a widely studied problem and has broad applications. In many real-world problems, the number of labeled graphs available for training classification models is limited, which renders these models prone to overfitting. To address this problem, we propose two approaches based on contrastive self-supervised learning (CSSL) to alleviate overfitting. In the first approach, we use… ▽ More Graph classification is a widely studied problem and has broad applications. In many real-world problems, the number of labeled graphs available for training classification models is limited, which renders these models prone to overfitting. To address this problem, we propose two approaches based on contrastive self-supervised learning (CSSL) to alleviate overfitting. In the first approach, we use CSSL to pretrain graph encoders on widely-available unlabeled graphs without relying on human-provided labels, then finetune the pretrained encoders on labeled graphs. In the second approach, we develop a regularizer based on CSSL, and solve the supervised classification task and the unsupervised CSSL task simultaneously. To perform CSSL on graphs, given a collection of original graphs, we perform data augmentation to create augmented graphs out of the original graphs. An augmented graph is created by consecutively applying a sequence of graph alteration operations. A contrastive loss is defined to learn graph encoders by judging whether two augmented graphs are from the same original graph. Experiments on various graph classification datasets demonstrate the effectiveness of our proposed methods. △ Less

Submitted 13 September, 2020; originally announced September 2020.

arXiv:2009.02528 [pdf, other]

Structured Sparsity Modeling for Improved Multivariate Statistical Analysis based Fault Isolation

Authors: Wei Chen, Jiusun Zeng, Xiaobin Xu, Shihua Luo, Chuanhou Gao

Abstract: In order to improve the fault diagnosis capability of multivariate statistical methods, this article introduces a fault isolation framework based on structured sparsity modeling. The developed method relies on the reconstruction based contribution analysis and the process structure information can be incorporated into the reconstruction objective function in the form of structured sparsity regular… ▽ More In order to improve the fault diagnosis capability of multivariate statistical methods, this article introduces a fault isolation framework based on structured sparsity modeling. The developed method relies on the reconstruction based contribution analysis and the process structure information can be incorporated into the reconstruction objective function in the form of structured sparsity regularization terms. The structured sparsity terms allow selection of fault variables over structures like blocks or networks of process variables, hence more accurate fault isolation can be achieved. Four structured sparsity terms corresponding to different kinds of process information are considered, namely, partially known sparse support, block sparsity, clustered sparsity and tree-structured sparsity. The optimization problems involving the structured sparsity terms can be solved using the Alternating Direction Method of Multipliers (ADMM) algorithm, which is fast and efficient. Through a simulation example and an application study to a coal-fired power plant, it is verified that the proposed method can better isolate faulty variables by incorporating process structure information. △ Less

Submitted 21 December, 2020; v1 submitted 5 September, 2020; originally announced September 2020.

Comments: 36 pages, 12 figures

arXiv:2009.02517 [pdf, other]

Uncertainty modelling and computational aspects of data association

Authors: Jeremie Houssineau, Jiajie Zeng, Ajay Jasra

Abstract: A novel solution to the smoothing problem for multi-object dynamical systems is proposed and evaluated. The systems of interest contain an unknown and varying number of dynamical objects that are partially observed under noisy and corrupted observations. An alternative representation of uncertainty is considered in order to account for the lack of information about the different aspects of this ty… ▽ More A novel solution to the smoothing problem for multi-object dynamical systems is proposed and evaluated. The systems of interest contain an unknown and varying number of dynamical objects that are partially observed under noisy and corrupted observations. An alternative representation of uncertainty is considered in order to account for the lack of information about the different aspects of this type of complex system. The corresponding statistical model can be formulated as a hierarchical model consisting of conditionally-independent hidden Markov models. This particular structure is leveraged to propose an efficient method in the context of Markov chain Monte Carlo (MCMC) by relying on an approximate solution to the corresponding filtering problem, in a similar fashion to particle MCMC. This approach is shown to outperform existing algorithms in a range of scenarios. △ Less

Submitted 5 September, 2020; originally announced September 2020.

arXiv:2008.03733 [pdf, other]

Generalized Liquid Association Analysis for Multimodal Data Integration

Authors: Lexin Li, **g Zeng, Xin Zhang

Abstract: Multimodal data are now prevailing in scientific research. A central question in multimodal integrative analysis is to understand how two data modalities associate and interact with each other given another modality or demographic variables. The problem can be formulated as studying the associations among three sets of random variables, a question that has received relatively less attention in the… ▽ More Multimodal data are now prevailing in scientific research. A central question in multimodal integrative analysis is to understand how two data modalities associate and interact with each other given another modality or demographic variables. The problem can be formulated as studying the associations among three sets of random variables, a question that has received relatively less attention in the literature. In this article, we propose a novel generalized liquid association analysis method, which offers a new and unique angle to this important class of problems of studying three-way associations. We extend the notion of liquid association of \citet{li2002LA} from the univariate setting to the sparse, multivariate, and high-dimensional setting. We establish a population dimension reduction model, transform the problem to sparse Tucker decomposition of a three-way tensor, and develop a higher-order orthogonal iteration algorithm for parameter estimation. We derive the non-asymptotic error bound and asymptotic consistency of the proposed estimator, while allowing the variable dimensions to be larger than and diverge with the sample size. We demonstrate the efficacy of the method through both simulations and a multimodal neuroimaging application for Alzheimer's disease research. △ Less

Submitted 24 April, 2021; v1 submitted 9 August, 2020; originally announced August 2020.

arXiv:2007.02010 [pdf, other]

DessiLBI: Exploring Structural Sparsity of Deep Networks via Differential Inclusion Paths

Authors: Yanwei Fu, Chen Liu, Donghao Li, Xinwei Sun, **shan Zeng, Yuan Yao

Abstract: Over-parameterization is ubiquitous nowadays in training neural networks to benefit both optimization in seeking global optima and generalization in reducing prediction error. However, compressive networks are desired in many real world applications and direct training of small networks may be trapped in local optima. In this paper, instead of pruning or distilling over-parameterized models to com… ▽ More Over-parameterization is ubiquitous nowadays in training neural networks to benefit both optimization in seeking global optima and generalization in reducing prediction error. However, compressive networks are desired in many real world applications and direct training of small networks may be trapped in local optima. In this paper, instead of pruning or distilling over-parameterized models to compressive ones, we propose a new approach based on differential inclusions of inverse scale spaces. Specifically, it generates a family of models from simple to complex ones that couples a pair of parameters to simultaneously train over-parameterized deep models and structural sparsity on weights of fully connected and convolutional layers. Such a differential inclusion scheme has a simple discretization, proposed as Deep structurally splitting Linearized Bregman Iteration (DessiLBI), whose global convergence analysis in deep learning is established that from any initializations, algorithmic iterations converge to a critical point of empirical risks. Experimental evidence shows that DessiLBI achieve comparable and even better performance than the competitive optimizers in exploring the structural sparsity of several widely used backbones on the benchmark datasets. Remarkably, with early stop**, DessiLBI unveils "winning tickets" in early epochs: the effective sparse structure with comparable test accuracy to fully trained over-parameterized models. △ Less

Submitted 4 July, 2020; originally announced July 2020.

Comments: conference , 23 pages https://github.com/corwinliu9669/dS2LBI. arXiv admin note: text overlap with arXiv:1905.09449

Journal ref: ICML 2020

arXiv:2005.07916 [pdf]

doi 10.1063/5.0042868

Deep-learning of Parametric Partial Differential Equations from Sparse and Noisy Data

Authors: Hao Xu, Dongxiao Zhang, Junsheng Zeng

Abstract: Data-driven methods have recently made great progress in the discovery of partial differential equations (PDEs) from spatial-temporal data. However, several challenges remain to be solved, including sparse noisy data, incomplete candidate library, and spatially- or temporally-varying coefficients. In this work, a new framework, which combines neural network, genetic algorithm and adaptive methods,… ▽ More Data-driven methods have recently made great progress in the discovery of partial differential equations (PDEs) from spatial-temporal data. However, several challenges remain to be solved, including sparse noisy data, incomplete candidate library, and spatially- or temporally-varying coefficients. In this work, a new framework, which combines neural network, genetic algorithm and adaptive methods, is put forward to address all of these challenges simultaneously. In the framework, a trained neural network is utilized to calculate derivatives and generate a large amount of meta-data, which solves the problem of sparse noisy data. Next, genetic algorithm is utilized to discover the form of PDEs and corresponding coefficients with an incomplete candidate library. Finally, a two-step adaptive method is introduced to discover parametric PDEs with spatially- or temporally-varying coefficients. In this method, the structure of a parametric PDE is first discovered, and then the general form of varying coefficients is identified. The proposed algorithm is tested on the Burgers equation, the convection-diffusion equation, the wave equation, and the KdV equation. The results demonstrate that this method is robust to sparse and noisy data, and is able to discover parametric PDEs with an incomplete candidate library. △ Less

Submitted 16 May, 2020; originally announced May 2020.

Comments: 30 pages, 6 figures, and 7 tables

Journal ref: Phys. Fluids, 33, 037132, 10.1063/5.0042868, 2021

arXiv:2004.03329 [pdf, other]

MedDialog: Two Large-scale Medical Dialogue Datasets

Authors: Xuehai He, Shu Chen, Zeqian Ju, Xiangyu Dong, Hongchao Fang, Sicheng Wang, Yue Yang, Jiaqi Zeng, Ruisi Zhang, Ruoyu Zhang, Meng Zhou, Penghui Zhu, Pengtao Xie

Abstract: Medical dialogue systems are promising in assisting in telemedicine to increase access to healthcare services, improve the quality of patient care, and reduce medical costs. To facilitate the research and development of medical dialogue systems, we build two large-scale medical dialogue datasets: MedDialog-EN and MedDialog-CN. MedDialog-EN is an English dataset containing 0.3 million conversations… ▽ More Medical dialogue systems are promising in assisting in telemedicine to increase access to healthcare services, improve the quality of patient care, and reduce medical costs. To facilitate the research and development of medical dialogue systems, we build two large-scale medical dialogue datasets: MedDialog-EN and MedDialog-CN. MedDialog-EN is an English dataset containing 0.3 million conversations between patients and doctors and 0.5 million utterances. MedDialog-CN is an Chinese dataset containing 1.1 million conversations and 4 million utterances. To our best knowledge, MedDialog-(EN,CN) are the largest medical dialogue datasets to date. The dataset is available at https://github.com/UCSD-AI4H/Medical-Dialogue-System △ Less

Submitted 7 July, 2020; v1 submitted 7 April, 2020; originally announced April 2020.

arXiv:2004.00179 [pdf, other]

Fully-Corrective Gradient Boosting with Squared Hinge: Fast Learning Rates and Early Stop**

Authors: **shan Zeng, Min Zhang, Shao-Bo Lin

Abstract: Boosting is a well-known method for improving the accuracy of weak learners in machine learning. However, its theoretical generalization guarantee is missing in literature. In this paper, we propose an efficient boosting method with theoretical generalization guarantees for binary classification. Three key ingredients of the proposed boosting method are: a) the \textit{fully-corrective greedy} (FC… ▽ More Boosting is a well-known method for improving the accuracy of weak learners in machine learning. However, its theoretical generalization guarantee is missing in literature. In this paper, we propose an efficient boosting method with theoretical generalization guarantees for binary classification. Three key ingredients of the proposed boosting method are: a) the \textit{fully-corrective greedy} (FCG) update in the boosting procedure, b) a differentiable \textit{squared hinge} (also called \textit{truncated quadratic}) function as the loss function, and c) an efficient alternating direction method of multipliers (ADMM) algorithm for the associated FCG optimization. The used squared hinge loss not only inherits the robustness of the well-known hinge loss for classification with outliers, but also brings some benefits for computational implementation and theoretical justification. Under some sparseness assumption, we derive a fast learning rate of the order ${\cal O}((m/\log m)^{-1/4})$ for the proposed boosting method, which can be further improved to ${\cal O}((m/\log m)^{-1/2})$ if certain additional noise assumption is imposed, where $m$ is the size of sample set. Both derived learning rates are the best ones among the existing generalization results of boosting-type methods for classification. Moreover, an efficient early stop** scheme is provided for the proposed method. A series of toy simulations and real data experiments are conducted to verify the developed theories and demonstrate the effectiveness of the proposed method. △ Less

Submitted 31 March, 2020; originally announced April 2020.

Comments: 14 pages

arXiv:2002.12135 [pdf, other]

Block Hankel Tensor ARIMA for Multiple Short Time Series Forecasting

Authors: Qiquan Shi, Jiaming Yin, Jiajun Cai, Andrzej Cichocki, Tatsuya Yokota, Lei Chen, Mingxuan Yuan, Jia Zeng

Abstract: This work proposes a novel approach for multiple time series forecasting. At first, multi-way delay embedding transform (MDT) is employed to represent time series as low-rank block Hankel tensors (BHT). Then, the higher-order tensors are projected to compressed core tensors by applying Tucker decomposition. At the same time, the generalized tensor Autoregressive Integrated Moving Average (ARIMA) i… ▽ More This work proposes a novel approach for multiple time series forecasting. At first, multi-way delay embedding transform (MDT) is employed to represent time series as low-rank block Hankel tensors (BHT). Then, the higher-order tensors are projected to compressed core tensors by applying Tucker decomposition. At the same time, the generalized tensor Autoregressive Integrated Moving Average (ARIMA) is explicitly used on consecutive core tensors to predict future samples. In this manner, the proposed approach tactically incorporates the unique advantages of MDT tensorization (to exploit mutual correlations) and tensor ARIMA coupled with low-rank Tucker decomposition into a unified framework. This framework exploits the low-rank structure of block Hankel tensors in the embedded space and captures the intrinsic correlations among multiple TS, which thus can improve the forecasting results, especially for multiple short time series. Experiments conducted on three public datasets and two industrial datasets verify that the proposed BHT-ARIMA effectively improves forecasting accuracy and reduces computational cost compared with the state-of-the-art methods. △ Less

Submitted 25 February, 2020; originally announced February 2020.

Comments: Accepted by AAAI 2020

arXiv:1912.04521 [pdf, other]

Transfer Learning-Based Outdoor Position Recovery with Telco Data

Authors: Yige Zhang, Aaron Yi Ding, Jorg Ott, Mingxuan Yuan, Jia Zeng, Kun Zhang, Weixiong Rao

Abstract: Telecommunication (Telco) outdoor position recovery aims to localize outdoor mobile devices by leveraging measurement report (MR) data. Unfortunately, Telco position recovery requires sufficient amount of MR samples across different areas and suffers from high data collection cost. For an area with scarce MR samples, it is hard to achieve good accuracy. In this paper, by leveraging the recently de… ▽ More Telecommunication (Telco) outdoor position recovery aims to localize outdoor mobile devices by leveraging measurement report (MR) data. Unfortunately, Telco position recovery requires sufficient amount of MR samples across different areas and suffers from high data collection cost. For an area with scarce MR samples, it is hard to achieve good accuracy. In this paper, by leveraging the recently developed transfer learning techniques, we design a novel Telco position recovery framework, called TLoc, to transfer good models in the carefully selected source domains (those fine-grained small subareas) to a target one which originally suffers from poor localization accuracy. Specifically, TLoc introduces three dedicated components: 1) a new coordinate space to divide an area of interest into smaller domains, 2) a similarity measurement to select best source domains, and 3) an adaptation of an existing transfer learning approach. To the best of our knowledge, TLoc is the first framework that demonstrates the efficacy of applying transfer learning in the Telco outdoor position recovery. To exemplify, on the 2G GSM and 4G LTE MR datasets in Shanghai, TLoc outperforms a nontransfer approach by 27.58% and 26.12% less median errors, and further leads to 47.77% and 49.22% less median errors than a recent fingerprinting approach NBL. △ Less

Submitted 10 December, 2019; originally announced December 2019.

arXiv:1912.00362 [pdf, other]

doi 10.1109/TKDE.2019.2956700

Fast Stochastic Ordinal Embedding with Variance Reduction and Adaptive Step Size

Authors: Ke Ma, **shan Zeng, Qianqian Xu, Xiaochun Cao, Wei Liu, Yuan Yao

Abstract: Learning representation from relative similarity comparisons, often called ordinal embedding, gains rising attention in recent years. Most of the existing methods are based on semi-definite programming (\textit{SDP}), which is generally time-consuming and degrades the scalability, especially confronting large-scale data. To overcome this challenge, we propose a stochastic algorithm called \textit{… ▽ More Learning representation from relative similarity comparisons, often called ordinal embedding, gains rising attention in recent years. Most of the existing methods are based on semi-definite programming (\textit{SDP}), which is generally time-consuming and degrades the scalability, especially confronting large-scale data. To overcome this challenge, we propose a stochastic algorithm called \textit{SVRG-SBB}, which has the following features: i) achieving good scalability via drop** positive semi-definite (\textit{PSD}) constraints as serving a fast algorithm, i.e., stochastic variance reduced gradient (\textit{SVRG}) method, and ii) adaptive learning via introducing a new, adaptive step size called the stabilized Barzilai-Borwein (\textit{SBB}) step size. Theoretically, under some natural assumptions, we show the $\boldsymbol{O}(\frac{1}{T})$ rate of convergence to a stationary point of the proposed algorithm, where $T$ is the number of total iterations. Under the further Polyak-Łojasiewicz assumption, we can show the global linear convergence (i.e., exponentially fast converging to a global optimum) of the proposed algorithm. Numerous simulations and real-world data experiments are conducted to show the effectiveness of the proposed algorithm by comparing with the state-of-the-art methods, notably, much lower computational cost with good prediction performance. △ Less

Submitted 1 December, 2019; originally announced December 2019.

Comments: 19 pages, 5 figures, accepted by IEEE Transaction on Knowledge and Data Engineering, Conference Version: arXiv:1711.06446

arXiv:1911.10558 [pdf, other]

Fast Polynomial Kernel Classification for Massive Data

Authors: **shan Zeng, Minrun Wu, Shao-Bo Lin, Ding-Xuan Zhou

Abstract: In the era of big data, it is desired to develop efficient machine learning algorithms to tackle massive data challenges such as storage bottleneck, algorithmic scalability, and interpretability. In this paper, we develop a novel efficient classification algorithm, called fast polynomial kernel classification (FPC), to conquer the scalability and storage challenges. Our main tools are a suitable s… ▽ More In the era of big data, it is desired to develop efficient machine learning algorithms to tackle massive data challenges such as storage bottleneck, algorithmic scalability, and interpretability. In this paper, we develop a novel efficient classification algorithm, called fast polynomial kernel classification (FPC), to conquer the scalability and storage challenges. Our main tools are a suitable selected feature map** based on polynomial kernels and an alternating direction method of multipliers (ADMM) algorithm for a related non-smooth convex optimization problem. Fast learning rates as well as feasibility verifications including the efficiency of an ADMM solver with convergence guarantees and the selection of center points are established to justify theoretical behaviors of FPC. Our theoretical assertions are verified by a series of simulations and real data applications. Numerical results demonstrate that FPC significantly reduces the computational burden and storage memory of existing learning schemes such as support vector machines, Nyström and random feature methods, without sacrificing their generalization abilities much. △ Less

Submitted 11 November, 2022; v1 submitted 24 November, 2019; originally announced November 2019.

Comments: arXiv admin note: text overlap with arXiv:1402.4735 by other authors

arXiv:1905.09449 [pdf, other]

Exploring Structural Sparsity of Deep Networks via Inverse Scale Spaces

Authors: Yanwei Fu, Chen Liu, Donghao Li, Zuyuan Zhong, Xinwei Sun, **shan Zeng, Yuan Yao

Abstract: The great success of deep neural networks is built upon their over-parameterization, which smooths the optimization landscape without degrading the generalization ability. Despite the benefits of over-parameterization, a huge amount of parameters makes deep networks cumbersome in daily life applications. Though techniques such as pruning and distillation are developed, they are expensive in fully… ▽ More The great success of deep neural networks is built upon their over-parameterization, which smooths the optimization landscape without degrading the generalization ability. Despite the benefits of over-parameterization, a huge amount of parameters makes deep networks cumbersome in daily life applications. Though techniques such as pruning and distillation are developed, they are expensive in fully training a dense network as backward selection methods, and there is still a void on systematically exploring forward selection methods for learning structural sparsity in deep networks. To fill in this gap, this paper proposes a new approach based on differential inclusions of inverse scale spaces, which generate a family of models from simple to complex ones along the dynamics via coupling a pair of parameters, such that over-parameterized deep models and their structural sparsity can be explored simultaneously. This kind of differential inclusion scheme has a simple discretization, dubbed Deep structure splitting Linearized Bregman Iteration (DessiLBI), whose global convergence in learning deep networks could be established under the Kurdyka-Lojasiewicz framework. Experimental evidence shows that our method achieves comparable and even better performance than the competitive optimizers in exploring the sparse structure of several widely used backbones on the benchmark datasets. Remarkably, with early stop**, our method unveils `winning tickets' in early epochs: the effective sparse network structures with comparable test accuracy to fully trained over-parameterized models, that are further transferable to similar alternative tasks. Furthermore, our method is able to grow networks efficiently with adaptive filter configurations, demonstrating a good performance with much less computational cost. Codes and models can be downloaded at {https://github.com/DessiLBI2020/DessiLBI}. △ Less

Submitted 21 April, 2022; v1 submitted 22 May, 2019; originally announced May 2019.

Comments: This is the journal extension version of the ICML conference paper, "DessiLBI: Exploring Structural Sparsity of Deep Networks via Differential Inclusion Paths"

Journal ref: International Conference on Machine Learning. PMLR, 2020, pp. 3315--3326

arXiv:1902.02060 [pdf, other]

On ADMM in Deep Learning: Convergence and Saturation-Avoidance

Authors: **shan Zeng, Shao-Bo Lin, Yuan Yao, Ding-Xuan Zhou

Abstract: In this paper, we develop an alternating direction method of multipliers (ADMM) for deep neural networks training with sigmoid-type activation functions (called \textit{sigmoid-ADMM pair}), mainly motivated by the gradient-free nature of ADMM in avoiding the saturation of sigmoid-type activations and the advantages of deep neural networks with sigmoid-type activations (called deep sigmoid nets) ov… ▽ More In this paper, we develop an alternating direction method of multipliers (ADMM) for deep neural networks training with sigmoid-type activation functions (called \textit{sigmoid-ADMM pair}), mainly motivated by the gradient-free nature of ADMM in avoiding the saturation of sigmoid-type activations and the advantages of deep neural networks with sigmoid-type activations (called deep sigmoid nets) over their rectified linear unit (ReLU) counterparts (called deep ReLU nets) in terms of approximation. In particular, we prove that the approximation capability of deep sigmoid nets is not worse than that of deep ReLU nets by showing that ReLU activation function can be well approximated by deep sigmoid nets with two hidden layers and finitely many free parameters but not vice-verse. We also establish the global convergence of the proposed ADMM for the nonlinearly constrained formulation of the deep sigmoid nets training from arbitrary initial points to a Karush-Kuhn-Tucker (KKT) point at a rate of order ${\cal O}(1/k)$. Besides sigmoid activation, such a convergence theorem holds for a general class of smooth activations. Compared with the widely used stochastic gradient descent (SGD) algorithm for the deep ReLU nets training (called ReLU-SGD pair), the proposed sigmoid-ADMM pair is practically stable with respect to the algorithmic hyperparameters including the learning rate, initial schemes and the pro-processing of the input data. Moreover, we find that to approximate and learn simple but important functions the proposed sigmoid-ADMM pair numerically outperforms the ReLU-SGD pair. △ Less

Submitted 15 September, 2021; v1 submitted 6 February, 2019; originally announced February 2019.

Comments: This is a revised version of our previous one entitled "A Convergence Analysis of Nonlinearly Constrained ADMM in Deep Learning, arXiv:1902.02060" with some significantly changes

Journal ref: Journal of Machine Learning Research 22 (2021) 1-67

arXiv:1811.12535 [pdf, other]

The Relevance of Bayesian Layer Positioning to Model Uncertainty in Deep Bayesian Active Learning

Authors: Jiaming Zeng, Adam Lesnikowski, Jose M. Alvarez

Abstract: One of the main challenges of deep learning tools is their inability to capture model uncertainty. While Bayesian deep learning can be used to tackle the problem, Bayesian neural networks often require more time and computational power to train than deterministic networks. Our work explores whether fully Bayesian networks are needed to successfully capture model uncertainty. We vary the number and… ▽ More One of the main challenges of deep learning tools is their inability to capture model uncertainty. While Bayesian deep learning can be used to tackle the problem, Bayesian neural networks often require more time and computational power to train than deterministic networks. Our work explores whether fully Bayesian networks are needed to successfully capture model uncertainty. We vary the number and position of Bayesian layers in a network and compare their performance on active learning with the MNIST dataset. We found that we can fully capture the model uncertainty by using only a few Bayesian layers near the output of the network, combining the advantages of deterministic and Bayesian networks. △ Less

Submitted 29 November, 2018; originally announced November 2018.

Journal ref: Third workshop on Bayesian Deep Learning (NeurIPS 2018)

arXiv:1808.03425 [pdf, other]

doi 10.1103/PhysRevA.99.052306

Learning and Inference on Generative Adversarial Quantum Circuits

Authors: **feng Zeng, Yufeng Wu, **-Guo Liu, Lei Wang, Jiang** Hu

Abstract: Quantum mechanics is inherently probabilistic in light of Born's rule. Using quantum circuits as probabilistic generative models for classical data exploits their superior expressibility and efficient direct sampling ability. However, training of quantum circuits can be more challenging compared to classical neural networks due to lack of efficient differentiable learning algorithm. We devise an a… ▽ More Quantum mechanics is inherently probabilistic in light of Born's rule. Using quantum circuits as probabilistic generative models for classical data exploits their superior expressibility and efficient direct sampling ability. However, training of quantum circuits can be more challenging compared to classical neural networks due to lack of efficient differentiable learning algorithm. We devise an adversarial quantum-classical hybrid training scheme via coupling a quantum circuit generator and a classical neural network discriminator together. After training, the quantum circuit generative model can infer missing data with quadratic speed up via amplitude amplification. We numerically simulate the learning and inference of generative adversarial quantum circuit using the prototypical Bars-and-Stripes dataset. Generative adversarial quantum circuits is a fresh approach to machine learning which may enjoy the practically useful quantum advantage on near-term quantum devices. △ Less

Submitted 10 August, 2018; originally announced August 2018.

Comments: 7 pages, 6 figures

Journal ref: Phys. Rev. A 99, 052306 (2019)

arXiv:1803.09082 [pdf, other]

A Proximal Block Coordinate Descent Algorithm for Deep Neural Network Training

Authors: Tim Tsz-Kit Lau, **shan Zeng, Baoyuan Wu, Yuan Yao

Abstract: Training deep neural networks (DNNs) efficiently is a challenge due to the associated highly nonconvex optimization. The backpropagation (backprop) algorithm has long been the most widely used algorithm for gradient computation of parameters of DNNs and is used along with gradient descent-type algorithms for this optimization task. Recent work have shown the efficiency of block coordinate descent… ▽ More Training deep neural networks (DNNs) efficiently is a challenge due to the associated highly nonconvex optimization. The backpropagation (backprop) algorithm has long been the most widely used algorithm for gradient computation of parameters of DNNs and is used along with gradient descent-type algorithms for this optimization task. Recent work have shown the efficiency of block coordinate descent (BCD) type methods empirically for training DNNs. In view of this, we propose a novel algorithm based on the BCD method for training DNNs and provide its global convergence results built upon the powerful framework of the Kurdyka-Lojasiewicz (KL) property. Numerical experiments on standard datasets demonstrate its competitive efficiency against standard optimizers with backprop. △ Less

Submitted 24 March, 2018; originally announced March 2018.

Comments: The 6th International Conference on Learning Representations (ICLR 2018), Workshop Track

arXiv:1803.00225 [pdf, other]

Global Convergence of Block Coordinate Descent in Deep Learning

Authors: **shan Zeng, Tim Tsz-Kit Lau, Shaobo Lin, Yuan Yao

Abstract: Deep learning has aroused extensive attention due to its great empirical success. The efficiency of the block coordinate descent (BCD) methods has been recently demonstrated in deep neural network (DNN) training. However, theoretical studies on their convergence properties are limited due to the highly nonconvex nature of DNN training. In this paper, we aim at providing a general methodology for p… ▽ More Deep learning has aroused extensive attention due to its great empirical success. The efficiency of the block coordinate descent (BCD) methods has been recently demonstrated in deep neural network (DNN) training. However, theoretical studies on their convergence properties are limited due to the highly nonconvex nature of DNN training. In this paper, we aim at providing a general methodology for provable convergence guarantees for this type of methods. In particular, for most of the commonly used DNN training models involving both two- and three-splitting schemes, we establish the global convergence to a critical point at a rate of ${\cal O}(1/k)$, where $k$ is the number of iterations. The results extend to general loss functions which have Lipschitz continuous gradients and deep residual networks (ResNets). Our key development adds several new elements to the Kurdyka-Łojasiewicz inequality framework that enables us to carry out the global convergence analysis of BCD in the general scenario of deep learning. △ Less

Submitted 12 May, 2019; v1 submitted 1 March, 2018; originally announced March 2018.

Comments: 27 pages, 2 figures

Journal ref: Proceeding of the 36th International Conference on Machine Learning (ICML), 2019

arXiv:1711.06446 [pdf, other]

Stochastic Non-convex Ordinal Embedding with Stabilized Barzilai-Borwein Step Size

Authors: Ke Ma, **shan Zeng, Jiechao Xiong, Qianqian Xu, Xiaochun Cao, Wei Liu, Yuan Yao

Abstract: Learning representation from relative similarity comparisons, often called ordinal embedding, gains rising attention in recent years. Most of the existing methods are batch methods designed mainly based on the convex optimization, say, the projected gradient descent method. However, they are generally time-consuming due to that the singular value decomposition (SVD) is commonly adopted during the… ▽ More Learning representation from relative similarity comparisons, often called ordinal embedding, gains rising attention in recent years. Most of the existing methods are batch methods designed mainly based on the convex optimization, say, the projected gradient descent method. However, they are generally time-consuming due to that the singular value decomposition (SVD) is commonly adopted during the update, especially when the data size is very large. To overcome this challenge, we propose a stochastic algorithm called SVRG-SBB, which has the following features: (a) SVD-free via drop** convexity, with good scalability by the use of stochastic algorithm, i.e., stochastic variance reduced gradient (SVRG), and (b) adaptive step size choice via introducing a new stabilized Barzilai-Borwein (SBB) method as the original version for convex problems might fail for the considered stochastic \textit{non-convex} optimization problem. Moreover, we show that the proposed algorithm converges to a stationary point at a rate $\mathcal{O}(\frac{1}{T})$ in our setting, where $T$ is the number of total iterations. Numerous simulations and real-world data experiments are conducted to show the effectiveness of the proposed algorithm via comparing with the state-of-the-art methods, particularly, much lower computational cost with good prediction performance. △ Less

Submitted 30 January, 2018; v1 submitted 17 November, 2017; originally announced November 2017.

Comments: 11 pages, 3 figures, 2 tables, accepted by AAAI2018

MSC Class: aaai.org

arXiv:1702.08701 [pdf, ps, other]

Learning rates for classification with Gaussian kernels

Authors: Shao-Bo Lin, **shan Zeng, Xiangyu Chang

Abstract: This paper aims at refined error analysis for binary classification using support vector machine (SVM) with Gaussian kernel and convex loss. Our first result shows that for some loss functions such as the truncated quadratic loss and quadratic loss, SVM with Gaussian kernel can reach the almost optimal learning rate, provided the regression function is smooth. Our second result shows that, for a l… ▽ More This paper aims at refined error analysis for binary classification using support vector machine (SVM) with Gaussian kernel and convex loss. Our first result shows that for some loss functions such as the truncated quadratic loss and quadratic loss, SVM with Gaussian kernel can reach the almost optimal learning rate, provided the regression function is smooth. Our second result shows that, for a large number of loss functions, under some Tsybakov noise assumption, if the regression function is infinitely smooth, then SVM with Gaussian kernel can achieve the learning rate of order $m^{-1}$, where $m$ is the number of samples. △ Less

Submitted 5 October, 2017; v1 submitted 28 February, 2017; originally announced February 2017.

Comments: This paper has been accepted by Neural Computation

arXiv:1503.07810 [pdf, other]

doi 10.1111/rssa.12227

Interpretable Classification Models for Recidivism Prediction

Authors: Jiaming Zeng, Berk Ustun, Cynthia Rudin

Abstract: We investigate a long-debated question, which is how to create predictive models of recidivism that are sufficiently accurate, transparent, and interpretable to use for decision-making. This question is complicated as these models are used to support different decisions, from sentencing, to determining release on probation, to allocating preventative social services. Each use case might have an ob… ▽ More We investigate a long-debated question, which is how to create predictive models of recidivism that are sufficiently accurate, transparent, and interpretable to use for decision-making. This question is complicated as these models are used to support different decisions, from sentencing, to determining release on probation, to allocating preventative social services. Each use case might have an objective other than classification accuracy, such as a desired true positive rate (TPR) or false positive rate (FPR). Each (TPR, FPR) pair is a point on the receiver operator characteristic (ROC) curve. We use popular machine learning methods to create models along the full ROC curve on a wide range of recidivism prediction problems. We show that many methods (SVM, Ridge Regression) produce equally accurate models along the full ROC curve. However, methods that designed for interpretability (CART, C5.0) cannot be tuned to produce models that are accurate and/or interpretable. To handle this shortcoming, we use a new method known as SLIM (Supersparse Linear Integer Models) to produce accurate, transparent, and interpretable models along the full ROC curve. These models can be used for decision-making for many different use cases, since they are just as accurate as the most powerful black-box machine learning models, but completely transparent, and highly interpretable. △ Less

Submitted 7 July, 2016; v1 submitted 26 March, 2015; originally announced March 2015.

Comments: 45 pages, 17 figures

Journal ref: Journal of Royal Statistics - Series A (2017)

arXiv:1312.5465 [pdf, ps, other]

Learning rates of $l^q$ coefficient regularization learning with Gaussian kernel

Authors: Shaobo Lin, **shan Zeng, Jian Fang, Zongben Xu

Abstract: Regularization is a well recognized powerful strategy to improve the performance of a learning machine and $l^q$ regularization schemes with $0<q<\infty$ are central in use. It is known that different $q$ leads to different properties of the deduced estimators, say, $l^2$ regularization leads to smooth estimators while $l^1$ regularization leads to sparse estimators. Then, how does the generalizat… ▽ More Regularization is a well recognized powerful strategy to improve the performance of a learning machine and $l^q$ regularization schemes with $0<q<\infty$ are central in use. It is known that different $q$ leads to different properties of the deduced estimators, say, $l^2$ regularization leads to smooth estimators while $l^1$ regularization leads to sparse estimators. Then, how does the generalization capabilities of $l^q$ regularization learning vary with $q$? In this paper, we study this problem in the framework of statistical learning theory and show that implementing $l^q$ coefficient regularization schemes in the sample dependent hypothesis space associated with Gaussian kernel can attain the same almost optimal learning rates for all $0<q<\infty$. That is, the upper and lower bounds of learning rates for $l^q$ regularization learning are asymptotically identical for all $0<q<\infty$. Our finding tentatively reveals that, in some modeling contexts, the choice of $q$ might not have a strong impact with respect to the generalization capability. From this perspective, $q$ can be arbitrarily specified, or specified merely by other no generalization criteria like smoothness, computational complexity, sparsity, etc.. △ Less

Submitted 24 September, 2014; v1 submitted 19 December, 2013; originally announced December 2013.

Comments: 26 pages, 3 figures

MSC Class: 68T05 ACM Class: F.2.1

arXiv:1311.4150 [pdf, ps, other]

Towards Big Topic Modeling

Authors: Jian-Feng Yan, Jia Zeng, Zhi-Qiang Liu, Yang Gao

Abstract: To solve the big topic modeling problem, we need to reduce both time and space complexities of batch latent Dirichlet allocation (LDA) algorithms. Although parallel LDA algorithms on the multi-processor architecture have low time and space complexities, their communication costs among processors often scale linearly with the vocabulary size and the number of topics, leading to a serious scalabilit… ▽ More To solve the big topic modeling problem, we need to reduce both time and space complexities of batch latent Dirichlet allocation (LDA) algorithms. Although parallel LDA algorithms on the multi-processor architecture have low time and space complexities, their communication costs among processors often scale linearly with the vocabulary size and the number of topics, leading to a serious scalability problem. To reduce the communication complexity among processors for a better scalability, we propose a novel communication-efficient parallel topic modeling architecture based on power law, which consumes orders of magnitude less communication time when the number of topics is large. We combine the proposed communication-efficient parallel architecture with the online belief propagation (OBP) algorithm referred to as POBP for big topic modeling tasks. Extensive empirical results confirm that POBP has the following advantages to solve the big topic modeling problem: 1) high accuracy, 2) communication-efficient, 3) fast speed, and 4) constant memory usage when compared with recent state-of-the-art parallel LDA algorithms on the multi-processor architecture. △ Less

Submitted 17 November, 2013; originally announced November 2013.

Comments: 14 pages

arXiv:1307.6616

Does generalization performance of $l^q$ regularization learning depend on $q$? A negative example

Authors: Shaobo Lin, Chen Xu, **gshan Zeng, Jian Fang

Abstract: $l^q$-regularization has been demonstrated to be an attractive technique in machine learning and statistical modeling. It attempts to improve the generalization (prediction) capability of a machine (model) through appropriately shrinking its coefficients. The shape of a $l^q$ estimator differs in varying choices of the regularization order $q$. In particular, $l^1… ▽ More $l^q$-regularization has been demonstrated to be an attractive technique in machine learning and statistical modeling. It attempts to improve the generalization (prediction) capability of a machine (model) through appropriately shrinking its coefficients. The shape of a $l^q$ estimator differs in varying choices of the regularization order $q$. In particular, $l^1$ leads to the LASSO estimate, while $l^{2}$ corresponds to the smooth ridge regression. This makes the order $q$ a potential tuning parameter in applications. To facilitate the use of $l^{q}$-regularization, we intend to seek for a modeling strategy where an elaborative selection on $q$ is avoidable. In this spirit, we place our investigation within a general framework of $l^{q}$-regularized kernel learning under a sample dependent hypothesis space (SDHS). For a designated class of kernel functions, we show that all $l^{q}$ estimators for $0< q < \infty$ attain similar generalization error bounds. These estimated bounds are almost optimal in the sense that up to a logarithmic factor, the upper and lower bounds are asymptotically identical. This finding tentatively reveals that, in some modeling contexts, the choice of $q$ might not have a strong impact in terms of the generalization capability. From this perspective, $q$ can be arbitrarily specified, or specified merely by other no generalization criteria like smoothness, computational complexity, sparsity, etc.. △ Less

Submitted 13 June, 2023; v1 submitted 24 July, 2013; originally announced July 2013.

Comments: There is critical wrong in the proof

Showing 1–33 of 33 results for author: Zeng, J