Search | arXiv e-print repository

arXiv:2406.12017 [pdf, other]

Sparsity-Constraint Optimization via Splicing Iteration

Authors: Zezhi Wang, ** Zhu, Junxian Zhu, Borui Tang, Hongmei Lin, Xueqin Wang

Abstract: Sparsity-constraint optimization has wide applicability in signal processing, statistics, and machine learning. Existing fast algorithms must burdensomely tune parameters, such as the step size or the implementation of precise stop criteria, which may be challenging to determine in practice. To address this issue, we develop an algorithm named Sparsity-Constraint Optimization via sPlicing itEratio… ▽ More Sparsity-constraint optimization has wide applicability in signal processing, statistics, and machine learning. Existing fast algorithms must burdensomely tune parameters, such as the step size or the implementation of precise stop criteria, which may be challenging to determine in practice. To address this issue, we develop an algorithm named Sparsity-Constraint Optimization via sPlicing itEration (SCOPE) to optimize nonlinear differential objective functions with strong convexity and smoothness in low dimensional subspaces. Algorithmically, the SCOPE algorithm converges effectively without tuning parameters. Theoretically, SCOPE has a linear convergence rate and converges to a solution that recovers the true support set when it correctly specifies the sparsity. We also develop parallel theoretical results without restricted-isometry-property-type conditions. We apply SCOPE's versatility and power to solve sparse quadratic optimization, learn sparse classifiers, and recover sparse Markov networks for binary variables. The numerical results on these specific tasks reveal that SCOPE perfectly identifies the true support set with a 10--1000 speedup over the standard exact solver, confirming SCOPE's algorithmic and theoretical merits. Our open-source Python package skscope based on C++ implementation is publicly available on GitHub, reaching a ten-fold speedup on the competing convex relaxation methods implemented by the cvxpy library. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 34 pages

arXiv:2406.01252 [pdf, other]

Towards Scalable Automated Alignment of LLMs: A Survey

Authors: Boxi Cao, Keming Lu, Xinyu Lu, Jiawei Chen, Mengjie Ren, Hao Xiang, Peilin Liu, Yaojie Lu, Ben He, Xianpei Han, Le Sun, Hongyu Lin, Bowen Yu

Abstract: Alignment is the most critical step in building large language models (LLMs) that meet human needs. With the rapid development of LLMs gradually surpassing human capabilities, traditional alignment methods based on human-annotation are increasingly unable to meet the scalability demands. Therefore, there is an urgent need to explore new sources of automated alignment signals and technical approach… ▽ More Alignment is the most critical step in building large language models (LLMs) that meet human needs. With the rapid development of LLMs gradually surpassing human capabilities, traditional alignment methods based on human-annotation are increasingly unable to meet the scalability demands. Therefore, there is an urgent need to explore new sources of automated alignment signals and technical approaches. In this paper, we systematically review the recently emerging methods of automated alignment, attempting to explore how to achieve effective, scalable, automated alignment once the capabilities of LLMs exceed those of humans. Specifically, we categorize existing automated alignment methods into 4 major categories based on the sources of alignment signals and discuss the current status and potential development of each category. Additionally, we explore the underlying mechanisms that enable automated alignment and discuss the essential factors that make automated alignment technologies feasible and effective from the fundamental role of alignment. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2405.11284 [pdf, other]

The Logic of Counterfactuals and the Epistemology of Causal Inference

Authors: Hanti Lin

Abstract: The 2021 Nobel Prize in Economics recognized a theory of causal inference, which deserves more attention from philosophers. To that end, I develop a dialectic that extends the Lewis-Stalnaker debate on a logical principle called Conditional Excluded Middle (CEM). I first play the good cop for CEM, and give a new argument for it: a Quine-Putnam indispensability argument based on the Nobel-Prize win… ▽ More The 2021 Nobel Prize in Economics recognized a theory of causal inference, which deserves more attention from philosophers. To that end, I develop a dialectic that extends the Lewis-Stalnaker debate on a logical principle called Conditional Excluded Middle (CEM). I first play the good cop for CEM, and give a new argument for it: a Quine-Putnam indispensability argument based on the Nobel-Prize winning theory. But then I switch sides and play the bad cop: I undermine that argument with a new theory of causal inference that preserves the success of the original theory but dispenses with CEM. △ Less

Submitted 18 May, 2024; originally announced May 2024.

arXiv:2405.03723 [pdf, other]

Generative adversarial learning with optimal input dimension and its adaptive generator architecture

Authors: Zhiyao Tan, Ling Zhou, Huazhen Lin

Abstract: We investigate the impact of the input dimension on the generalization error in generative adversarial networks (GANs). In particular, we first provide both theoretical and practical evidence to validate the existence of an optimal input dimension (OID) that minimizes the generalization error. Then, to identify the OID, we introduce a novel framework called generalized GANs (G-GANs), which include… ▽ More We investigate the impact of the input dimension on the generalization error in generative adversarial networks (GANs). In particular, we first provide both theoretical and practical evidence to validate the existence of an optimal input dimension (OID) that minimizes the generalization error. Then, to identify the OID, we introduce a novel framework called generalized GANs (G-GANs), which includes existing GANs as a special case. By incorporating the group penalty and the architecture penalty developed in the paper, G-GANs have several intriguing features. First, our framework offers adaptive dimensionality reduction from the initial dimension to a dimension necessary for generating the target distribution. Second, this reduction in dimensionality also shrinks the required size of the generator network architecture, which is automatically identified by the proposed architecture penalty. Both reductions in dimensionality and the generator network significantly improve the stability and the accuracy of the estimation and prediction. Theoretical support for the consistent selection of the input dimension and the generator network is provided. Third, the proposed algorithm involves an end-to-end training process, and the algorithm allows for dynamic adjustments between the input dimension and the generator network during training, further enhancing the overall performance of G-GANs. Extensive experiments conducted with simulated and benchmark data demonstrate the superior performance of G-GANs. In particular, compared to that of off-the-shelf methods, G-GANs achieves an average improvement of 45.68% in the CT slice dataset, 43.22% in the MNIST dataset and 46.94% in the FashionMNIST dataset in terms of the maximum mean discrepancy or Frechet inception distance. Moreover, the features generated based on the input dimensions identified by G-GANs align with visually significant features. △ Less

Submitted 5 May, 2024; originally announced May 2024.

arXiv:2404.16954 [pdf, other]

Taming False Positives in Out-of-Distribution Detection with Human Feedback

Authors: Harit Vishwakarma, Heguang Lin, Ramya Korlakai Vinayak

Abstract: Robustness to out-of-distribution (OOD) samples is crucial for safely deploying machine learning models in the open world. Recent works have focused on designing scoring functions to quantify OOD uncertainty. Setting appropriate thresholds for these scoring functions for OOD detection is challenging as OOD samples are often unavailable up front. Typically, thresholds are set to achieve a desired t… ▽ More Robustness to out-of-distribution (OOD) samples is crucial for safely deploying machine learning models in the open world. Recent works have focused on designing scoring functions to quantify OOD uncertainty. Setting appropriate thresholds for these scoring functions for OOD detection is challenging as OOD samples are often unavailable up front. Typically, thresholds are set to achieve a desired true positive rate (TPR), e.g., $95\%$ TPR. However, this can lead to very high false positive rates (FPR), ranging from 60 to 96\%, as observed in the Open-OOD benchmark. In safety-critical real-life applications, e.g., medical diagnosis, controlling the FPR is essential when dealing with various OOD samples dynamically. To address these challenges, we propose a mathematically grounded OOD detection framework that leverages expert feedback to \emph{safely} update the threshold on the fly. We provide theoretical results showing that it is guaranteed to meet the FPR constraint at all times while minimizing the use of human feedback. Another key feature of our framework is that it can work with any scoring function for OOD uncertainty quantification. Empirical evaluation of our system on synthetic and benchmark OOD datasets shows that our method can maintain FPR at most $5\%$ while maximizing TPR. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: Appeared in the 27th International Conference on Artificial Intelligence and Statistics (AISTATS 2024)

Journal ref: PMLR 238:1486-1494, 2024

arXiv:2404.13309 [pdf, ps, other]

Latent Schr{ö}dinger Bridge Diffusion Model for Generative Learning

Authors: Yuling Jiao, Lican Kang, Huazhen Lin, ** Liu, Heng Zuo

Abstract: This paper aims to conduct a comprehensive theoretical analysis of current diffusion models. We introduce a novel generative learning methodology utilizing the Schr{ö}dinger bridge diffusion model in latent space as the framework for theoretical exploration in this domain. Our approach commences with the pre-training of an encoder-decoder architecture using data originating from a distribution tha… ▽ More This paper aims to conduct a comprehensive theoretical analysis of current diffusion models. We introduce a novel generative learning methodology utilizing the Schr{ö}dinger bridge diffusion model in latent space as the framework for theoretical exploration in this domain. Our approach commences with the pre-training of an encoder-decoder architecture using data originating from a distribution that may diverge from the target distribution, thus facilitating the accommodation of a large sample size through the utilization of pre-existing large-scale models. Subsequently, we develop a diffusion model within the latent space utilizing the Schr{ö}dinger bridge framework. Our theoretical analysis encompasses the establishment of end-to-end error analysis for learning distributions via the latent Schr{ö}dinger bridge diffusion model. Specifically, we control the second-order Wasserstein distance between the generated distribution and the target distribution. Furthermore, our obtained convergence rates effectively mitigate the curse of dimensionality, offering robust theoretical support for prevailing diffusion models. △ Less

Submitted 20 April, 2024; originally announced April 2024.

arXiv:2402.14966 [pdf, other]

Smoothness Adaptive Hypothesis Transfer Learning

Authors: Haotian Lin, Matthew Reimherr

Abstract: Many existing two-phase kernel-based hypothesis transfer learning algorithms employ the same kernel regularization across phases and rely on the known smoothness of functions to obtain optimality. Therefore, they fail to adapt to the varying and unknown smoothness between the target/source and their offset in practice. In this paper, we address these problems by proposing Smoothness Adaptive Trans… ▽ More Many existing two-phase kernel-based hypothesis transfer learning algorithms employ the same kernel regularization across phases and rely on the known smoothness of functions to obtain optimality. Therefore, they fail to adapt to the varying and unknown smoothness between the target/source and their offset in practice. In this paper, we address these problems by proposing Smoothness Adaptive Transfer Learning (SATL), a two-phase kernel ridge regression(KRR)-based algorithm. We first prove that employing the misspecified fixed bandwidth Gaussian kernel in target-only KRR learning can achieve minimax optimality and derive an adaptive procedure to the unknown Sobolev smoothness. Leveraging these results, SATL employs Gaussian kernels in both phases so that the estimators can adapt to the unknown smoothness of the target/source and their offset function. We derive the minimax lower bound of the learning problem in excess risk and show that SATL enjoys a matching upper bound up to a logarithmic factor. The minimax convergence rate sheds light on the factors influencing transfer dynamics and demonstrates the superiority of SATL compared to non-transfer learning settings. While our main objective is a theoretical analysis, we also conduct several experiments to confirm our results. △ Less

Submitted 22 February, 2024; originally announced February 2024.

arXiv:2311.13768 [pdf, other]

Valid confidence intervals for regression with best subset selection

Authors: Huiming Lin, Meng Li

Abstract: Classical confidence intervals after best subset selection are widely implemented in statistical software and are routinely used to guide practitioners in scientific fields to conclude significance. However, there are increasing concerns in the recent literature about the validity of these confidence intervals in that the intended frequentist coverage is not attained. In the context of the Akaike… ▽ More Classical confidence intervals after best subset selection are widely implemented in statistical software and are routinely used to guide practitioners in scientific fields to conclude significance. However, there are increasing concerns in the recent literature about the validity of these confidence intervals in that the intended frequentist coverage is not attained. In the context of the Akaike information criterion (AIC), recent studies observe an under-coverage phenomenon in terms of overfitting, where the estimate of error variance under the selected submodel is smaller than that for the true model. Under-coverage is particularly troubling in selective inference as it points to inflated Type I errors that would invalidate significant findings. In this article, we delineate a complementary, yet provably more deciding factor behind the incorrect coverage of classical confidence intervals under AIC, in terms of altered conditional sampling distributions of pivotal quantities. Resting on selective techniques developed in other settings, our finite-sample characterization of the selection event under AIC uncovers its geometry as a union of finitely many intervals on the real line, based on which we derive new confidence intervals with guaranteed coverage for any sample size. This geometry derived for AIC selection enables exact (and typically less than exact) conditioning, circumventing the need for the excessive conditioning common in other post-selection methods. The proposed methods are easy to implement and can be broadly applied to other commonly used best subset selection criteria. In an application to a classical US consumption dataset, the proposed confidence intervals arrive at different conclusions compared to the conventional ones, even when the selected model is the full model, leading to interpretable findings that better align with empirical observations. △ Less

Submitted 22 November, 2023; originally announced November 2023.

arXiv:2310.14608 [pdf, other]

CAD-DA: Controllable Anomaly Detection after Domain Adaptation by Statistical Inference

Authors: Vo Nguyen Le Duy, Hsuan-Tien Lin, Ichiro Takeuchi

Abstract: We propose a novel statistical method for testing the results of anomaly detection (AD) under domain adaptation (DA), which we call CAD-DA -- controllable AD under DA. The distinct advantage of the CAD-DA lies in its ability to control the probability of misidentifying anomalies under a pre-specified level $α$ (e.g., 0.05). The challenge within this DA setting is the necessity to account for the i… ▽ More We propose a novel statistical method for testing the results of anomaly detection (AD) under domain adaptation (DA), which we call CAD-DA -- controllable AD under DA. The distinct advantage of the CAD-DA lies in its ability to control the probability of misidentifying anomalies under a pre-specified level $α$ (e.g., 0.05). The challenge within this DA setting is the necessity to account for the influence of DA to ensure the validity of the inference results. Our solution to this challenge leverages the concept of conditional Selective Inference to handle the impact of DA. To our knowledge, this is the first work capable of conducting a valid statistical inference within the context of DA. We evaluate the performance of the CAD-DA method on both synthetic and real-world datasets. △ Less

Submitted 23 October, 2023; originally announced October 2023.

arXiv:2310.10048 [pdf, other]

Evaluation of transplant benefits with the U.S. Scientific Registry of Transplant Recipients by semiparametric regression of mean residual life

Authors: Ge Zhao, Yanyuan Ma, Huazhen Lin, Yi Li

Abstract: Kidney transplantation is the most effective renal replacement therapy for end stage renal disease patients. With the severe shortage of kidney supplies and for the clinical effectiveness of transplantation, patient's life expectancy post transplantation is used to prioritize patients for transplantation; however, severe comorbidity conditions and old age are the most dominant factors that negativ… ▽ More Kidney transplantation is the most effective renal replacement therapy for end stage renal disease patients. With the severe shortage of kidney supplies and for the clinical effectiveness of transplantation, patient's life expectancy post transplantation is used to prioritize patients for transplantation; however, severe comorbidity conditions and old age are the most dominant factors that negatively impact post-transplantation life expectancy, effectively precluding sick or old patients from receiving transplants. It would be crucial to design objective measures to quantify the transplantation benefit by comparing the mean residual life with and without a transplant, after adjusting for comorbidity and demographic conditions. To address this urgent need, we propose a new class of semiparametric covariate-dependent mean residual life models. Our method estimates covariate effects semiparametrically efficiently and the mean residual life function nonparametrically, enabling us to predict the residual life increment potential for any given patient. Our method potentially leads to a more fair system that prioritizes patients who would have the largest residual life gains. Our analysis of the kidney transplant data from the U.S. Scientific Registry of Transplant Recipients also suggests that a single index of covariates summarize well the impacts of multiple covariates, which may facilitate interpretations of each covariate's effect. Our subgroup analysis further disclosed inequalities in survival gains across groups defined by race, gender and insurance type (reflecting socioeconomic status). △ Less

Submitted 17 October, 2023; v1 submitted 16 October, 2023; originally announced October 2023.

Comments: 68 pages, 13 figures. arXiv admin note: text overlap with arXiv:2011.04067

arXiv:2310.07999 [pdf, other]

LEMON: Lossless model expansion

Authors: Yite Wang, Jiahao Su, Hanlin Lu, Cong Xie, Tianyi Liu, Jianbo Yuan, Haibin Lin, Ruoyu Sun, Hongxia Yang

Abstract: Scaling of deep neural networks, especially Transformers, is pivotal for their surging performance and has further led to the emergence of sophisticated reasoning capabilities in foundation models. Such scaling generally requires training large models from scratch with random initialization, failing to leverage the knowledge acquired by their smaller counterparts, which are already resource-intens… ▽ More Scaling of deep neural networks, especially Transformers, is pivotal for their surging performance and has further led to the emergence of sophisticated reasoning capabilities in foundation models. Such scaling generally requires training large models from scratch with random initialization, failing to leverage the knowledge acquired by their smaller counterparts, which are already resource-intensive to obtain. To tackle this inefficiency, we present $\textbf{L}$ossl$\textbf{E}$ss $\textbf{MO}$del Expansio$\textbf{N}$ (LEMON), a recipe to initialize scaled models using the weights of their smaller but pre-trained counterparts. This is followed by model training with an optimized learning rate scheduler tailored explicitly for the scaled models, substantially reducing the training time compared to training from scratch. Notably, LEMON is versatile, ensuring compatibility with various network structures, including models like Vision Transformers and BERT. Our empirical results demonstrate that LEMON reduces computational costs by 56.7% for Vision Transformers and 33.2% for BERT when compared to training from scratch. △ Less

Submitted 11 October, 2023; originally announced October 2023.

Comments: Preprint

arXiv:2309.12872 [pdf, other]

Deep regression learning with optimal loss function

Authors: Xuancheng Wang, Ling Zhou, Huazhen Lin

Abstract: In this paper, we develop a novel efficient and robust nonparametric regression estimator under a framework of feedforward neural network. There are several interesting characteristics for the proposed estimator. First, the loss function is built upon an estimated maximum likelihood function, who integrates the information from observed data, as well as the information from data structure. Consequ… ▽ More In this paper, we develop a novel efficient and robust nonparametric regression estimator under a framework of feedforward neural network. There are several interesting characteristics for the proposed estimator. First, the loss function is built upon an estimated maximum likelihood function, who integrates the information from observed data, as well as the information from data structure. Consequently, the resulting estimator has desirable optimal properties, such as efficiency. Second, different from the traditional maximum likelihood estimation (MLE), the proposed method avoid the specification of the distribution, hence is flexible to any kind of distribution, such as heavy tails, multimodal or heterogeneous distribution. Third, the proposed loss function relies on probabilities rather than direct observations as in least squares, contributing the robustness in the proposed estimator. Finally, the proposed loss function involves nonparametric regression function only. This enables a direct application of existing packages, simplifying the computation and programming. We establish the large sample property of the proposed estimator in terms of its excess risk and minimax near-optimal rate. The theoretical results demonstrate that the proposed estimator is equivalent to the true MLE in which the density function is known. Our simulation studies show that the proposed estimator outperforms the existing methods in terms of prediction accuracy, efficiency and robustness. Particularly, it is comparable to the true MLE, and even gets better as the sample size increases. This implies that the adaptive and data-driven loss function from the estimated density may offer an additional avenue for capturing valuable information. We further apply the proposed method to four real data examples, resulting in significantly reduced out-of-sample prediction errors compared to existing methods. △ Less

Submitted 22 September, 2023; originally announced September 2023.

arXiv:2309.00125 [pdf, other]

Pure Differential Privacy for Functional Summaries via a Laplace-like Process

Authors: Haotian Lin, Matthew Reimherr

Abstract: Many existing mechanisms to achieve differential privacy (DP) on infinite-dimensional functional summaries often involve embedding these summaries into finite-dimensional subspaces and applying traditional DP techniques. Such mechanisms generally treat each dimension uniformly and struggle with complex, structured summaries. This work introduces a novel mechanism for DP functional summary release:… ▽ More Many existing mechanisms to achieve differential privacy (DP) on infinite-dimensional functional summaries often involve embedding these summaries into finite-dimensional subspaces and applying traditional DP techniques. Such mechanisms generally treat each dimension uniformly and struggle with complex, structured summaries. This work introduces a novel mechanism for DP functional summary release: the Independent Component Laplace Process (ICLP) mechanism. This mechanism treats the summaries of interest as truly infinite-dimensional objects, thereby addressing several limitations of existing mechanisms. We establish the feasibility of the proposed mechanism in multiple function spaces. Several statistical estimation problems are considered, and we demonstrate one can enhance the utility of sanitized summaries by oversmoothing their non-private counterpart. Numerical experiments on synthetic and real datasets demonstrate the efficacy of the proposed mechanism. △ Less

Submitted 3 March, 2024; v1 submitted 31 August, 2023; originally announced September 2023.

arXiv:2308.00251 [pdf, other]

Best-Subset Selection in Generalized Linear Models: A Fast and Consistent Algorithm via Splicing Technique

Authors: Junxian Zhu, ** Zhu, Borui Tang, Xuanyu Chen, Hongmei Lin, Xueqin Wang

Abstract: In high-dimensional generalized linear models, it is crucial to identify a sparse model that adequately accounts for response variation. Although the best subset section has been widely regarded as the Holy Grail of problems of this type, achieving either computational efficiency or statistical guarantees is challenging. In this article, we intend to surmount this obstacle by utilizing a fast algo… ▽ More In high-dimensional generalized linear models, it is crucial to identify a sparse model that adequately accounts for response variation. Although the best subset section has been widely regarded as the Holy Grail of problems of this type, achieving either computational efficiency or statistical guarantees is challenging. In this article, we intend to surmount this obstacle by utilizing a fast algorithm to select the best subset with high certainty. We proposed and illustrated an algorithm for best subset recovery in regularity conditions. Under mild conditions, the computational complexity of our algorithm scales polynomially with sample size and dimension. In addition to demonstrating the statistical properties of our method, extensive numerical experiments reveal that it outperforms existing methods for variable selection and coefficient estimation. The runtime analysis shows that our implementation achieves approximately a fourfold speedup compared to popular variable selection toolkits like glmnet and ncvreg. △ Less

Submitted 31 July, 2023; originally announced August 2023.

arXiv:2211.12620 [pdf, other]

Promises and Pitfalls of Threshold-based Auto-labeling

Authors: Harit Vishwakarma, Heguang Lin, Frederic Sala, Ramya Korlakai Vinayak

Abstract: Creating large-scale high-quality labeled datasets is a major bottleneck in supervised machine learning workflows. Threshold-based auto-labeling (TBAL), where validation data obtained from humans is used to find a confidence threshold above which the data is machine-labeled, reduces reliance on manual annotation. TBAL is emerging as a widely-used solution in practice. Given the long shelf-life and… ▽ More Creating large-scale high-quality labeled datasets is a major bottleneck in supervised machine learning workflows. Threshold-based auto-labeling (TBAL), where validation data obtained from humans is used to find a confidence threshold above which the data is machine-labeled, reduces reliance on manual annotation. TBAL is emerging as a widely-used solution in practice. Given the long shelf-life and diverse usage of the resulting datasets, understanding when the data obtained by such auto-labeling systems can be relied on is crucial. This is the first work to analyze TBAL systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data. Our results provide two crucial insights. First, reasonable chunks of unlabeled data can be automatically and accurately labeled by seemingly bad models. Second, a hidden downside of TBAL systems is potentially prohibitive validation data usage. Together, these insights describe the promise and pitfalls of using such systems. We validate our theoretical guarantees with extensive experiments on synthetic and real datasets. △ Less

Submitted 21 February, 2024; v1 submitted 22 November, 2022; originally announced November 2022.

Comments: NeurIPS 2023 (Spotlight)

Journal ref: Thirty Seventh Conference on Neural Information Processing Systems (NeurIPS 2023)

arXiv:2211.12012 [pdf, other]

Factor-guided functional PCA for high-dimensional functional data

Authors: Shoudao Wen, Huazhen Lin

Abstract: The literature on high-dimensional functional data focuses on either the dependence over time or the correlation among functional variables. In this paper, we propose a factor-guided functional principal component analysis (FaFPCA) method to consider both temporal dependence and correlation of variables so that the extracted features are as sufficient as possible. In particular, we use a factor pr… ▽ More The literature on high-dimensional functional data focuses on either the dependence over time or the correlation among functional variables. In this paper, we propose a factor-guided functional principal component analysis (FaFPCA) method to consider both temporal dependence and correlation of variables so that the extracted features are as sufficient as possible. In particular, we use a factor process to consider the correlation among high-dimensional functional variables and then apply functional principal component analysis (FPCA) to the factor processes to address the dependence over time. Furthermore, to solve the computational problem arising from triple-infinite dimensions, we creatively build some moment equations to estimate loading, scores and eigenfunctions in closed form without rotation. Theoretically, we establish the asymptotical properties of the proposed estimator. Extensive simulation studies demonstrate that our proposed method outperforms other competitors in terms of accuracy and computational cost. The proposed method is applied to analyze the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, resulting in higher prediction accuracy and 41 important ROIs that are associated with Alzheimer's disease, 23 of which have been confirmed by the literature. △ Less

Submitted 22 November, 2022; originally announced November 2022.

Comments: 34 pages, 5 figures, 3 tables

arXiv:2207.09081 [pdf, other]

Generalizing Goal-Conditioned Reinforcement Learning with Variational Causal Reasoning

Authors: Wenhao Ding, Haohong Lin, Bo Li, Ding Zhao

Abstract: As a pivotal component to attaining generalizable solutions in human intelligence, reasoning provides great potential for reinforcement learning (RL) agents' generalization towards varied goals by summarizing part-to-whole arguments and discovering cause-and-effect relations. However, how to discover and represent causalities remains a huge gap that hinders the development of causal RL. In this pa… ▽ More As a pivotal component to attaining generalizable solutions in human intelligence, reasoning provides great potential for reinforcement learning (RL) agents' generalization towards varied goals by summarizing part-to-whole arguments and discovering cause-and-effect relations. However, how to discover and represent causalities remains a huge gap that hinders the development of causal RL. In this paper, we augment Goal-Conditioned RL (GCRL) with Causal Graph (CG), a structure built upon the relation between objects and events. We novelly formulate the GCRL problem into variational likelihood maximization with CG as latent variables. To optimize the derived objective, we propose a framework with theoretical performance guarantees that alternates between two steps: using interventional data to estimate the posterior of CG; using CG to learn generalizable models and interpretable policies. Due to the lack of public benchmarks that verify generalization capability under reasoning, we design nine tasks and then empirically show the effectiveness of the proposed method against five baselines on these tasks. Further theoretical analysis shows that our performance improvement is attributed to the virtuous cycle of causal discovery, transition modeling, and policy training, which aligns with the experimental evidence in extensive ablation studies. △ Less

Submitted 17 May, 2023; v1 submitted 19 July, 2022; originally announced July 2022.

Comments: Accepted to NeurIPS 2022

arXiv:2206.04277 [pdf, other]

On Hypothesis Transfer Learning of Functional Linear Models

Authors: Haotian Lin, Matthew Reimherr

Abstract: We study the transfer learning (TL) for the functional linear regression (FLR) under the Reproducing Kernel Hilbert Space (RKHS) framework, observing the TL techniques in existing high-dimensional linear regression is not compatible with the truncation-based FLR methods as functional data are intrinsically infinite-dimensional and generated by smooth underlying processes. We measure the similarity… ▽ More We study the transfer learning (TL) for the functional linear regression (FLR) under the Reproducing Kernel Hilbert Space (RKHS) framework, observing the TL techniques in existing high-dimensional linear regression is not compatible with the truncation-based FLR methods as functional data are intrinsically infinite-dimensional and generated by smooth underlying processes. We measure the similarity across tasks using RKHS distance, allowing the type of information being transferred tied to the properties of the imposed RKHS. Building on the hypothesis offset transfer learning paradigm, two algorithms are proposed: one conducts the transfer when positive sources are known, while the other leverages aggregation techniques to achieve robust transfer without prior information about the sources. We establish lower bounds for this learning problem and show the proposed algorithms enjoy a matching asymptotic upper bound. These analyses provide statistical insights into factors that contribute to the dynamics of the transfer. We also extend the results to functional generalized linear models. The effectiveness of the proposed algorithms is demonstrated on extensive synthetic data as well as a financial data application. △ Less

Submitted 22 February, 2024; v1 submitted 9 June, 2022; originally announced June 2022.

Comments: The results are extended to functional GLM

arXiv:2202.08180 [pdf, other]

Geometry of the Minimum Volume Confidence Sets

Authors: Heguang Lin, Mengze Li, Daniel Pimentel-Alarcón, Matthew Malloy

Abstract: Computation of confidence sets is central to data science and machine learning, serving as the workhorse of A/B testing and underpinning the operation and analysis of reinforcement learning algorithms. This paper studies the geometry of the minimum-volume confidence sets for the multinomial parameter. When used in place of more standard confidence sets and intervals based on bounds and asymptotic… ▽ More Computation of confidence sets is central to data science and machine learning, serving as the workhorse of A/B testing and underpinning the operation and analysis of reinforcement learning algorithms. This paper studies the geometry of the minimum-volume confidence sets for the multinomial parameter. When used in place of more standard confidence sets and intervals based on bounds and asymptotic approximation, learning algorithms can exhibit improved sample complexity. Prior work showed the minimum-volume confidence sets are the level-sets of a discontinuous function defined by an exact p-value. While the confidence sets are optimal in that they have minimum average volume, computation of membership of a single point in the set is challenging for problems of modest size. Since the confidence sets are level-sets of discontinuous functions, little is apparent about their geometry. This paper studies the geometry of the minimum volume confidence sets by enumerating and covering the continuous regions of the exact p-value function. This addresses a fundamental question in A/B testing: given two multinomial outcomes, how can one determine if their corresponding minimum volume confidence sets are disjoint? We answer this question in a restricted setting. △ Less

Submitted 16 February, 2022; originally announced February 2022.

arXiv:2110.09823 [pdf, other]

An Empirical Study: Extensive Deep Temporal Point Process

Authors: Haitao Lin, Cheng Tan, Lirong Wu, Zhangyang Gao, Stan. Z. Li

Abstract: Temporal point process as the stochastic process on continuous domain of time is commonly used to model the asynchronous event sequence featuring with occurrence timestamps. Thanks to the strong expressivity of deep neural networks, they are emerging as a promising choice for capturing the patterns in asynchronous sequences, in the context of temporal point process. In this paper, we first review… ▽ More Temporal point process as the stochastic process on continuous domain of time is commonly used to model the asynchronous event sequence featuring with occurrence timestamps. Thanks to the strong expressivity of deep neural networks, they are emerging as a promising choice for capturing the patterns in asynchronous sequences, in the context of temporal point process. In this paper, we first review recent research emphasis and difficulties in modeling asynchronous event sequences with deep temporal point process, which can be concluded into four fields: encoding of history sequence, formulation of conditional intensity function, relational discovery of events and learning approaches for optimization. We introduce most of recently proposed models by dismantling them into the four parts, and conduct experiments by remodularizing the first three parts with the same learning strategy for a fair empirical evaluation. Besides, we extend the history encoders and conditional intensity function family, and propose a Granger causality discovery framework for exploiting the relations among multi-types of events. Because the Granger causality can be represented by the Granger causality graph, discrete graph structure learning in the framework of Variational Inference is employed to reveal latent structures of the graph. Further experiments show that the proposed framework with latent graph discovery can both capture the relations and achieve an improved fitting and predicting performance. △ Less

Submitted 21 December, 2021; v1 submitted 19 October, 2021; originally announced October 2021.

Comments: 22 pages, 8 figures

arXiv:2110.04367 [pdf, other]

Hybrid Random Features

Authors: Krzysztof Choromanski, Haoxian Chen, Han Lin, Yuanzhe Ma, Arijit Sehanobish, Deepali Jain, Michael S Ryoo, Jake Varley, Andy Zeng, Valerii Likhosherstov, Dmitry Kalashnikov, Vikas Sindhwani, Adrian Weller

Abstract: We propose a new class of random feature methods for linearizing softmax and Gaussian kernels called hybrid random features (HRFs) that automatically adapt the quality of kernel estimation to provide most accurate approximation in the defined regions of interest. Special instantiations of HRFs lead to well-known methods such as trigonometric (Rahimi and Recht, 2007) or (recently introduced in the… ▽ More We propose a new class of random feature methods for linearizing softmax and Gaussian kernels called hybrid random features (HRFs) that automatically adapt the quality of kernel estimation to provide most accurate approximation in the defined regions of interest. Special instantiations of HRFs lead to well-known methods such as trigonometric (Rahimi and Recht, 2007) or (recently introduced in the context of linear-attention Transformers) positive random features (Choromanski et al., 2021). By generalizing Bochner's Theorem for softmax/Gaussian kernels and leveraging random features for compositional kernels, the HRF-mechanism provides strong theoretical guarantees - unbiased approximation and strictly smaller worst-case relative errors than its counterparts. We conduct exhaustive empirical evaluation of HRF ranging from pointwise kernel estimation experiments, through tests on data admitting clustering structure to benchmarking implicit-attention Transformers (also for downstream Robotics applications), demonstrating its quality in a wide spectrum of machine learning problems. △ Less

Submitted 30 January, 2022; v1 submitted 8 October, 2021; originally announced October 2021.

Comments: Published as a conference paper at ICLR 2022

arXiv:2105.07829 [pdf, other]

Compressed Communication for Distributed Training: Adaptive Methods and System

Authors: Yuchen Zhong, Cong Xie, Shuai Zheng, Haibin Lin

Abstract: Communication overhead severely hinders the scalability of distributed machine learning systems. Recently, there has been a growing interest in using gradient compression to reduce the communication overhead of the distributed training. However, there is little understanding of applying gradient compression to adaptive gradient methods. Moreover, its performance benefits are often limited by the n… ▽ More Communication overhead severely hinders the scalability of distributed machine learning systems. Recently, there has been a growing interest in using gradient compression to reduce the communication overhead of the distributed training. However, there is little understanding of applying gradient compression to adaptive gradient methods. Moreover, its performance benefits are often limited by the non-negligible compression overhead. In this paper, we first introduce a novel adaptive gradient method with gradient compression. We show that the proposed method has a convergence rate of $\mathcal{O}(1/\sqrt{T})$ for non-convex problems. In addition, we develop a scalable system called BytePS-Compress for two-way compression, where the gradients are compressed in both directions between workers and parameter servers. BytePS-Compress pipelines the compression and decompression on CPUs and achieves a high degree of parallelism. Empirical evaluations show that we improve the training time of ResNet50, VGG16, and BERT-base by 5.0%, 58.1%, 23.3%, respectively, without any accuracy loss with 25 Gb/s networking. Furthermore, for training the BERT models, we achieve a compression rate of 333x compared to the mixed-precision training. △ Less

Submitted 17 May, 2021; originally announced May 2021.

arXiv:2105.05555 [pdf, ps, other]

Robust Learning of Fixed-Structure Bayesian Networks in Nearly-Linear Time

Authors: Yu Cheng, Honghao Lin

Abstract: We study the problem of learning Bayesian networks where an $ε$-fraction of the samples are adversarially corrupted. We focus on the fully-observable case where the underlying graph structure is known. In this work, we present the first nearly-linear time algorithm for this problem with a dimension-independent error guarantee. Previous robust algorithms with comparable error guarantees are slower… ▽ More We study the problem of learning Bayesian networks where an $ε$-fraction of the samples are adversarially corrupted. We focus on the fully-observable case where the underlying graph structure is known. In this work, we present the first nearly-linear time algorithm for this problem with a dimension-independent error guarantee. Previous robust algorithms with comparable error guarantees are slower by at least a factor of $(d/ε)$, where $d$ is the number of variables in the Bayesian network and $ε$ is the fraction of corrupted samples. Our algorithm and analysis are considerably simpler than those in previous work. We achieve this by establishing a direct connection between robust learning of Bayesian networks and robust mean estimation. As a subroutine in our algorithm, we develop a robust mean estimation algorithm whose runtime is nearly-linear in the number of nonzeros in the input samples, which may be of independent interest. △ Less

Submitted 12 May, 2021; originally announced May 2021.

arXiv:2012.11100 [pdf, other]

Two-directional simultaneous inference for high-dimensional models

Authors: Wei Liu, Huazhen Lin, ** Liu, Shurong Zheng

Abstract: This paper proposes a general two directional simultaneous inference (TOSI) framework for high-dimensional models with a manifest variable or latent variable structure, for example, high-dimensional mean models, high-dimensional sparse regression models, and high-dimensional latent factors models. TOSI performs simultaneous inference on a set of parameters from two directions, one to test whether… ▽ More This paper proposes a general two directional simultaneous inference (TOSI) framework for high-dimensional models with a manifest variable or latent variable structure, for example, high-dimensional mean models, high-dimensional sparse regression models, and high-dimensional latent factors models. TOSI performs simultaneous inference on a set of parameters from two directions, one to test whether the assumed zero parameters indeed are zeros and one to test whether exist zeros in the parameter set of nonzeros. As a result, we can exactly identify whether the parameters are zeros, thereby kee** the data structure fully and parsimoniously expressed. We theoretically prove that the proposed TOSI method asymptotically controls the Type I error at the prespecified significance level and that the testing power converges to one. Simulations are conducted to examine the performance of the proposed method in finite sample situations and two real datasets are analyzed. The results show that the TOSI method is more predictive and has more interpretable estimators than existing methods. △ Less

Submitted 6 February, 2023; v1 submitted 20 December, 2020; originally announced December 2020.

arXiv:2011.04067 [pdf, ps, other]

Semiparametric regression of mean residual life with censoring and covariate dimension reduction

Authors: Ge Zhao, Yanyuan Ma, Huazhen Lin, Yi Li

Abstract: We propose a new class of semiparametric regression models of mean residual life for censored outcome data. The models, which enable us to estimate the expected remaining survival time and generalize commonly used mean residual life models, also conduct covariate dimension reduction. Using the geometric approaches in semiparametrics literature and the martingale properties with survival data, we p… ▽ More We propose a new class of semiparametric regression models of mean residual life for censored outcome data. The models, which enable us to estimate the expected remaining survival time and generalize commonly used mean residual life models, also conduct covariate dimension reduction. Using the geometric approaches in semiparametrics literature and the martingale properties with survival data, we propose a flexible inference procedure that relaxes the parametric assumptions on the dependence of mean residual life on covariates and how long a patient has lived. We show that the estimators for the covariate effects are root-$n$ consistent, asymptotically normal, and semiparametrically efficient. With the unspecified mean residual life function, we provide a nonparametric estimator for predicting the residual life of a given subject, and establish the root-$n$ consistency and asymptotic normality for this estimator. Numerical experiments are conducted to illustrate the feasibility of the proposed estimators. We apply the method to analyze a national kidney transplantation dataset to further demonstrate the utility of the work. △ Less

Submitted 8 November, 2020; originally announced November 2020.

Comments: 73 pages, 9 figures

arXiv:2009.11612 [pdf, other]

Clustering Based on Graph of Density Topology

Authors: Zhangyang Gao, Haitao Lin, Stan. Z Li

Abstract: Data clustering with uneven distribution in high level noise is challenging. Currently, HDBSCAN is considered as the SOTA algorithm for this problem. In this paper, we propose a novel clustering algorithm based on what we call graph of density topology (GDT). GDT jointly considers the local and global structures of data samples: firstly forming local clusters based on a density growing process wit… ▽ More Data clustering with uneven distribution in high level noise is challenging. Currently, HDBSCAN is considered as the SOTA algorithm for this problem. In this paper, we propose a novel clustering algorithm based on what we call graph of density topology (GDT). GDT jointly considers the local and global structures of data samples: firstly forming local clusters based on a density growing process with a strategy for properly noise handling as well as cluster boundary detection; and then estimating a GDT from relationship between local clusters in terms of a connectivity measure, givingglobal topological graph. The connectivity, measuring similarity between neighboring local clusters, is based on local clusters rather than individual points, ensuring its robustness to even very large noise. Evaluation results on both toy and real-world datasets show that GDT achieves the SOTA performance by far on almost all the popular datasets, and has a low time complexity of O(nlogn). The code is available at https://github.com/gaozhangyang/DGC.git. △ Less

Submitted 24 September, 2020; originally announced September 2020.

arXiv:2009.06795 [pdf, other]

DynamicVAE: Decoupling Reconstruction Error and Disentangled Representation Learning

Authors: Huajie Shao, Haohong Lin, Qinmin Yang, Shuochao Yao, Han Zhao, Tarek Abdelzaher

Abstract: This paper challenges the common assumption that the weight $β$, in $β$-VAE, should be larger than $1$ in order to effectively disentangle latent factors. We demonstrate that $β$-VAE, with $β< 1$, can not only attain good disentanglement but also significantly improve reconstruction accuracy via dynamic control. The paper removes the inherent trade-off between reconstruction accuracy and disentang… ▽ More This paper challenges the common assumption that the weight $β$, in $β$-VAE, should be larger than $1$ in order to effectively disentangle latent factors. We demonstrate that $β$-VAE, with $β< 1$, can not only attain good disentanglement but also significantly improve reconstruction accuracy via dynamic control. The paper removes the inherent trade-off between reconstruction accuracy and disentanglement for $β$-VAE. Existing methods, such as $β$-VAE and FactorVAE, assign a large weight to the KL-divergence term in the objective function, leading to high reconstruction errors for the sake of better disentanglement. To mitigate this problem, a ControlVAE has recently been developed that dynamically tunes the KL-divergence weight in an attempt to control the trade-off to more a favorable point. However, ControlVAE fails to eliminate the conflict between the need for a large $β$ (for disentanglement) and the need for a small $β$. Instead, we propose DynamicVAE that maintains a different $β$ at different stages of training, thereby decoupling disentanglement and reconstruction accuracy. In order to evolve the weight, $β$, along a trajectory that enables such decoupling, DynamicVAE leverages a modified incremental PI (proportional-integral) controller, and employs a moving average as well as a hybrid annealing method to evolve the value of KL-divergence smoothly in a tightly controlled fashion. We theoretically prove the stability of the proposed approach. Evaluation results on three benchmark datasets demonstrate that DynamicVAE significantly improves the reconstruction accuracy while achieving disentanglement comparable to the best of existing methods. The results verify that our method can separate disentangled representation learning and reconstruction, removing the inherent tension between the two. △ Less

Submitted 30 September, 2020; v1 submitted 14 September, 2020; originally announced September 2020.

arXiv:2007.13221 [pdf, other]

CSER: Communication-efficient SGD with Error Reset

Authors: Cong Xie, Shuai Zheng, Oluwasanmi Koyejo, Indranil Gupta, Mu Li, Haibin Lin

Abstract: The scalability of Distributed Stochastic Gradient Descent (SGD) is today limited by communication bottlenecks. We propose a novel SGD variant: Communication-efficient SGD with Error Reset, or CSER. The key idea in CSER is first a new technique called "error reset" that adapts arbitrary compressors for SGD, producing bifurcated local models with periodic reset of resulting local residual errors. S… ▽ More The scalability of Distributed Stochastic Gradient Descent (SGD) is today limited by communication bottlenecks. We propose a novel SGD variant: Communication-efficient SGD with Error Reset, or CSER. The key idea in CSER is first a new technique called "error reset" that adapts arbitrary compressors for SGD, producing bifurcated local models with periodic reset of resulting local residual errors. Second we introduce partial synchronization for both the gradients and the models, leveraging advantages from them. We prove the convergence of CSER for smooth non-convex problems. Empirical results show that when combined with highly aggressive compressors, the CSER algorithms accelerate the distributed training by nearly 10x for CIFAR-100, and by 4.5x for ImageNet. △ Less

Submitted 4 December, 2020; v1 submitted 26 July, 2020; originally announced July 2020.

arXiv:2007.04387 [pdf, other]

Double spike Dirichlet priors for structured weighting

Authors: Huiming Lin, Meng Li

Abstract: Assigning weights to a large pool of objects is a fundamental task in a wide variety of applications. In this article, we introduce the concept of structured high-dimensional probability simplexes, in which most components are zero or near zero and the remaining ones are close to each other. Such structure is well motivated by (i) high-dimensional weights that are common in modern applications, an… ▽ More Assigning weights to a large pool of objects is a fundamental task in a wide variety of applications. In this article, we introduce the concept of structured high-dimensional probability simplexes, in which most components are zero or near zero and the remaining ones are close to each other. Such structure is well motivated by (i) high-dimensional weights that are common in modern applications, and (ii) ubiquitous examples in which equal weights -- despite their simplicity -- often achieve favorable or even state-of-the-art predictive performance. This particular structure, however, presents unique challenges partly because, unlike high-dimensional linear regression, the parameter space is a simplex and pattern switching between partial constancy and sparsity is unknown. To address these challenges, we propose a new class of double spike Dirichlet priors to shrink a probability simplex to one with the desired structure. When applied to ensemble learning, such priors lead to a Bayesian method for structured high-dimensional ensembles that is useful for forecast combination and improving random forests, while enabling uncertainty quantification. We design efficient Markov chain Monte Carlo algorithms for implementation. Posterior contraction rates are established to study large sample behaviors of the posterior distribution. We demonstrate the wide applicability and competitive performance of the proposed methods through simulations and two real data applications using the European Central Bank Survey of Professional Forecasters data set and a data set from the UC Irvine Machine Learning Repository (UCI). △ Less

Submitted 16 September, 2022; v1 submitted 8 July, 2020; originally announced July 2020.

arXiv:2007.02235 [pdf, other]

Unbiased Risk Estimators Can Mislead: A Case Study of Learning with Complementary Labels

Authors: Yu-Ting Chou, Gang Niu, Hsuan-Tien Lin, Masashi Sugiyama

Abstract: In weakly supervised learning, unbiased risk estimator(URE) is a powerful tool for training classifiers when training and test data are drawn from different distributions. Nevertheless, UREs lead to overfitting in many problem settings when the models are complex like deep networks. In this paper, we investigate reasons for such overfitting by studying a weakly supervised problem called learning w… ▽ More In weakly supervised learning, unbiased risk estimator(URE) is a powerful tool for training classifiers when training and test data are drawn from different distributions. Nevertheless, UREs lead to overfitting in many problem settings when the models are complex like deep networks. In this paper, we investigate reasons for such overfitting by studying a weakly supervised problem called learning with complementary labels. We argue the quality of gradient estimation matters more in risk minimization. Theoretically, we show that a URE gives an unbiased gradient estimator(UGE). Practically, however, UGEs may suffer from huge variance, which causes empirical gradients to be usually far away from true gradients during minimization. To this end, we propose a novel surrogate complementary loss(SCL) framework that trades zero bias with reduced variance and makes empirical gradients more aligned with true gradients in the direction. Thanks to this characteristic, SCL successfully mitigates the overfitting issue and improves URE-based methods. △ Less

Submitted 21 August, 2020; v1 submitted 5 July, 2020; originally announced July 2020.

Comments: Accepted at ICML 2020

arXiv:2006.13484 [pdf, other]

Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes

Authors: Shuai Zheng, Haibin Lin, Sheng Zha, Mu Li

Abstract: BERT has recently attracted a lot of attention in natural language understanding (NLU) and achieved state-of-the-art results in various NLU tasks. However, its success requires large deep neural networks and huge amount of data, which result in long training time and impede development progress. Using stochastic gradient methods with large mini-batch has been advocated as an efficient tool to redu… ▽ More BERT has recently attracted a lot of attention in natural language understanding (NLU) and achieved state-of-the-art results in various NLU tasks. However, its success requires large deep neural networks and huge amount of data, which result in long training time and impede development progress. Using stochastic gradient methods with large mini-batch has been advocated as an efficient tool to reduce the training time. Along this line of research, LAMB is a prominent example that reduces the training time of BERT from 3 days to 76 minutes on a TPUv3 Pod. In this paper, we propose an accelerated gradient method called LANS to improve the efficiency of using large mini-batches for training. As the learning rate is theoretically upper bounded by the inverse of the Lipschitz constant of the function, one cannot always reduce the number of optimization iterations by selecting a larger learning rate. In order to use larger mini-batch size without accuracy loss, we develop a new learning rate scheduler that overcomes the difficulty of using large learning rate. Using the proposed LANS method and the learning rate scheme, we scaled up the mini-batch sizes to 96K and 33K in phases 1 and 2 of BERT pretraining, respectively. It takes 54 minutes on 192 AWS EC2 P3dn.24xlarge instances to achieve a target F1 score of 90.5 or higher on SQuAD v1.1, achieving the fastest BERT training time in the cloud. △ Less

Submitted 18 September, 2020; v1 submitted 24 June, 2020; originally announced June 2020.

Comments: Technical Report (not under reviewed in any venue)

arXiv:2005.13590 [pdf, other]

Demystifying Orthogonal Monte Carlo and Beyond

Authors: Han Lin, Haoxian Chen, Tianyi Zhang, Clement Laroche, Krzysztof Choromanski

Abstract: Orthogonal Monte Carlo (OMC) is a very effective sampling algorithm imposing structural geometric conditions (orthogonality) on samples for variance reduction. Due to its simplicity and superior performance as compared to its Quasi Monte Carlo counterparts, OMC is used in a wide spectrum of challenging machine learning applications ranging from scalable kernel methods to predictive recurrent neura… ▽ More Orthogonal Monte Carlo (OMC) is a very effective sampling algorithm imposing structural geometric conditions (orthogonality) on samples for variance reduction. Due to its simplicity and superior performance as compared to its Quasi Monte Carlo counterparts, OMC is used in a wide spectrum of challenging machine learning applications ranging from scalable kernel methods to predictive recurrent neural networks, generative models and reinforcement learning. However theoretical understanding of the method remains very limited. In this paper we shed new light on the theoretical principles behind OMC, applying theory of negatively dependent random variables to obtain several new concentration results. We also propose a novel extensions of the method leveraging number theory techniques and particle algorithms, called Near-Orthogonal Monte Carlo (NOMC). We show that NOMC is the first algorithm consistently outperforming OMC in applications ranging from kernel methods to approximating distances in probabilistic metric spaces. △ Less

Submitted 27 May, 2020; originally announced May 2020.

Comments: 22 pages, 4 figures

arXiv:2005.09159 [pdf, other]

Sketch-BERT: Learning Sketch Bidirectional Encoder Representation from Transformers by Self-supervised Learning of Sketch Gestalt

Authors: Hangyu Lin, Yanwei Fu, Yu-Gang Jiang, Xiangyang Xue

Abstract: Previous researches of sketches often considered sketches in pixel format and leveraged CNN based models in the sketch understanding. Fundamentally, a sketch is stored as a sequence of data points, a vector format representation, rather than the photo-realistic image of pixels. SketchRNN studied a generative neural representation for sketches of vector format by Long Short Term Memory networks (LS… ▽ More Previous researches of sketches often considered sketches in pixel format and leveraged CNN based models in the sketch understanding. Fundamentally, a sketch is stored as a sequence of data points, a vector format representation, rather than the photo-realistic image of pixels. SketchRNN studied a generative neural representation for sketches of vector format by Long Short Term Memory networks (LSTM). Unfortunately, the representation learned by SketchRNN is primarily for the generation tasks, rather than the other tasks of recognition and retrieval of sketches. To this end and inspired by the recent BERT model, we present a model of learning Sketch Bidirectional Encoder Representation from Transformer (Sketch-BERT). We generalize BERT to sketch domain, with the novel proposed components and pre-training algorithms, including the newly designed sketch embedding networks, and the self-supervised learning of sketch gestalt. Particularly, towards the pre-training task, we present a novel Sketch Gestalt Model (SGM) to help train the Sketch-BERT. Experimentally, we show that the learned representation of Sketch-BERT can help and improve the performance of the downstream tasks of sketch recognition, sketch retrieval, and sketch gestalt. △ Less

Submitted 18 May, 2020; originally announced May 2020.

Comments: Accepted to CVPR 2020

arXiv:2002.03273 [pdf, ps, other]

On the Complexity of Minimizing Convex Finite Sums Without Using the Indices of the Individual Functions

Authors: Yossi Arjevani, Amit Daniely, Stefanie Jegelka, Hongzhou Lin

Abstract: Recent advances in randomized incremental methods for minimizing $L$-smooth $μ$-strongly convex finite sums have culminated in tight complexity of $\tilde{O}((n+\sqrt{n L/μ})\log(1/ε))$ and $O(n+\sqrt{nL/ε})$, where $μ>0$ and $μ=0$, respectively, and $n$ denotes the number of individual functions. Unlike incremental methods, stochastic methods for finite sums do not rely on an explicit knowledge o… ▽ More Recent advances in randomized incremental methods for minimizing $L$-smooth $μ$-strongly convex finite sums have culminated in tight complexity of $\tilde{O}((n+\sqrt{n L/μ})\log(1/ε))$ and $O(n+\sqrt{nL/ε})$, where $μ>0$ and $μ=0$, respectively, and $n$ denotes the number of individual functions. Unlike incremental methods, stochastic methods for finite sums do not rely on an explicit knowledge of which individual function is being addressed at each iteration, and as such, must perform at least $Ω(n^2)$ iterations to obtain $O(1/n^2)$-optimal solutions. In this work, we exploit the finite noise structure of finite sums to derive a matching $O(n^2)$-upper bound under the global oracle model, showing that this lower bound is indeed tight. Following a similar approach, we propose a novel adaptation of SVRG which is both \emph{compatible with stochastic oracles}, and achieves complexity bounds of $\tilde{O}((n^2+n\sqrt{L/μ})\log(1/ε))$ and $O(n\sqrt{L/ε})$, for $μ>0$ and $μ=0$, respectively. Our bounds hold w.h.p. and match in part existing lower bounds of $\tildeΩ(n^2+\sqrt{nL/μ}\log(1/ε))$ and $\tildeΩ(n^2+\sqrt{nL/ε})$, for $μ>0$ and $μ=0$, respectively. △ Less

Submitted 8 February, 2020; originally announced February 2020.

arXiv:2001.09832 [pdf, other]

Polygames: Improved Zero Learning

Authors: Tristan Cazenave, Yen-Chi Chen, Guan-Wei Chen, Shi-Yu Chen, Xian-Dong Chiu, Julien Dehos, Maria Elsa, Qucheng Gong, Hengyuan Hu, Vasil Khalidov, Cheng-Ling Li, Hsin-I Lin, Yu-** Lin, Xavier Martinet, Vegard Mella, Jeremy Rapin, Baptiste Roziere, Gabriel Synnaeve, Fabien Teytaud, Olivier Teytaud, Shi-Cheng Ye, Yi-Jun Ye, Shi-Jim Yen, Sergey Zagoruyko

Abstract: Since DeepMind's AlphaZero, Zero learning quickly became the state-of-the-art method for many board games. It can be improved using a fully convolutional structure (no fully connected layer). Using such an architecture plus global pooling, we can create bots independent of the board size. The training can be made more robust by kee** track of the best checkpoints during the training and by train… ▽ More Since DeepMind's AlphaZero, Zero learning quickly became the state-of-the-art method for many board games. It can be improved using a fully convolutional structure (no fully connected layer). Using such an architecture plus global pooling, we can create bots independent of the board size. The training can be made more robust by kee** track of the best checkpoints during the training and by training against them. Using these features, we release Polygames, our framework for Zero learning, with its library of games and its checkpoints. We won against strong humans at the game of Hex in 19x19, which was often said to be untractable for zero learning; and in Havannah. We also won several first places at the TAAI competitions. △ Less

Submitted 27 January, 2020; originally announced January 2020.

arXiv:2001.04345 [pdf, ps, other]

Shareable Representations for Search Query Understanding

Authors: Mukul Kumar, Youna Hu, Will Headden, Rahul Goutam, Heran Lin, Bing Yin

Abstract: Understanding search queries is critical for shop** search engines to deliver a satisfying customer experience. Popular shop** search engines receive billions of unique queries yearly, each of which can depict any of hundreds of user preferences or intents. In order to get the right results to customers it must be known queries like "inexpensive prom dresses" are intended to not only surface r… ▽ More Understanding search queries is critical for shop** search engines to deliver a satisfying customer experience. Popular shop** search engines receive billions of unique queries yearly, each of which can depict any of hundreds of user preferences or intents. In order to get the right results to customers it must be known queries like "inexpensive prom dresses" are intended to not only surface results of a certain product type but also products with a low price. Referred to as query intents, examples also include preferences for author, brand, age group, or simply a need for customer service. Recent works such as BERT have demonstrated the success of a large transformer encoder architecture with language model pre-training on a variety of NLP tasks. We adapt such an architecture to learn intents for search queries and describe methods to account for the noisiness and sparseness of search query data. We also describe cost effective ways of hosting transformer encoder models in context with low latency requirements. With the right domain-specific training we can build a shareable deep learning model whose internal representation can be reused for a variety of query understanding tasks including query intent identification. Model sharing allows for fewer large models needed to be served at inference time and provides a platform to quickly build and roll out new search query classifiers. △ Less

Submitted 20 December, 2019; originally announced January 2020.

arXiv:1912.07663 [pdf, other]

Spatial-Temporal Self-Attention Network for Flow Prediction

Authors: Haoxing Lin, Weijia Jia, Yi** Sun, Yongjian You

Abstract: Flow prediction (e.g., crowd flow, traffic flow) with features of spatial-temporal is increasingly investigated in AI research field. It is very challenging due to the complicated spatial dependencies between different locations and dynamic temporal dependencies among different time intervals. Although measurements of both dependencies are employed, existing methods suffer from the following two p… ▽ More Flow prediction (e.g., crowd flow, traffic flow) with features of spatial-temporal is increasingly investigated in AI research field. It is very challenging due to the complicated spatial dependencies between different locations and dynamic temporal dependencies among different time intervals. Although measurements of both dependencies are employed, existing methods suffer from the following two problems. First, the temporal dependencies are measured either uniformly or bias against long-term dependencies, which overlooks the distinctive impacts of short-term and long-term temporal dependencies. Second, the existing methods capture spatial and temporal dependencies independently, which wrongly assumes that the correlations between these dependencies are weak and ignores the complicated mutual influences between them. To address these issues, we propose a Spatial-Temporal Self-Attention Network (ST-SAN). As the path-length of attending long-term dependency is shorter in the self-attention mechanism, the vanishing of long-term temporal dependencies is prevented. In addition, since our model relies solely on attention mechanisms, the spatial and temporal dependencies can be simultaneously measured. Experimental results on real-world data demonstrate that, in comparison with state-of-the-art methods, our model reduces the root mean square errors by 9% in inflow prediction and 4% in outflow prediction on Taxi-NYC data, which is very significant compared to the previous improvement. △ Less

Submitted 22 December, 2019; v1 submitted 13 December, 2019; originally announced December 2019.

Comments: 8 pages

arXiv:1911.09030 [pdf, other]

Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates

Authors: Cong Xie, Oluwasanmi Koyejo, Indranil Gupta, Haibin Lin

Abstract: When scaling distributed training, the communication overhead is often the bottleneck. In this paper, we propose a novel SGD variant with reduced communication and adaptive learning rates. We prove the convergence of the proposed algorithm for smooth but non-convex problems. Empirical results show that the proposed algorithm significantly reduces the communication overhead, which, in turn, reduces… ▽ More When scaling distributed training, the communication overhead is often the bottleneck. In this paper, we propose a novel SGD variant with reduced communication and adaptive learning rates. We prove the convergence of the proposed algorithm for smooth but non-convex problems. Empirical results show that the proposed algorithm significantly reduces the communication overhead, which, in turn, reduces the training time by up to 30% for the 1B word dataset. △ Less

Submitted 4 December, 2020; v1 submitted 20 November, 2019; originally announced November 2019.

arXiv:1910.13188 [pdf, other]

Learning from Label Proportions with Consistency Regularization

Authors: Kuen-Han Tsai, Hsuan-Tien Lin

Abstract: The problem of learning from label proportions (LLP) involves training classifiers with weak labels on bags of instances, rather than strong labels on individual instances. The weak labels only contain the label proportion of each bag. The LLP problem is important for many practical applications that only allow label proportions to be collected because of data privacy or annotation cost, and has r… ▽ More The problem of learning from label proportions (LLP) involves training classifiers with weak labels on bags of instances, rather than strong labels on individual instances. The weak labels only contain the label proportion of each bag. The LLP problem is important for many practical applications that only allow label proportions to be collected because of data privacy or annotation cost, and has recently received lots of research attention. Most existing works focus on extending supervised learning models to solve the LLP problem, but the weak learning nature makes it hard to further improve LLP performance with a supervised angle. In this paper, we take a different angle from semi-supervised learning. In particular, we propose a novel model inspired by consistency regularization, a popular concept in semi-supervised learning that encourages the model to produce a decision boundary that better describes the data manifold. With the introduction of consistency regularization, we further extend our study to non-uniform bag-generation and validation-based parameter-selection procedures that better match practical needs. Experiments not only justify that LLP with consistency regularization achieves superior performance, but also demonstrate the practical usability of the proposed procedures. △ Less

Submitted 29 October, 2019; originally announced October 2019.

arXiv:1910.08664 [pdf, ps, other]

Latent Variable Model for Multivariate Data with Measure-specific Sample Weights and Its Application in Hospital Compare

Authors: Chengan Du, Shu-Xia Li, Zhenqiu Lin, Haiqun Lin

Abstract: We developed a single factor model with measure-specific sample weights for multivariate data with multiple observed indicators clustered within a higher level subject. The factor is therefore a latent variable shared by multiple indicators within a same subject and the sample weights are different across different indicators and different subjects. Even after integrating out the latent variable,… ▽ More We developed a single factor model with measure-specific sample weights for multivariate data with multiple observed indicators clustered within a higher level subject. The factor is therefore a latent variable shared by multiple indicators within a same subject and the sample weights are different across different indicators and different subjects. Even after integrating out the latent variable, the likelihood of the data cannot be written as the sum of weighted likelihood of each subject because a subject has different sample weights respectively for its multiple indicators. In addition, the number of available indicators varies across subjects. We derive a pseudo likelihood for the latent variable model with measure-specific weights. We investigate various statistical properties of the latent variable model with measure-specific sample weights and its connection to the traditional factor analysis. We found that the latent variable model provides consistent estimates for its variances when the measure-specific sample weights are properly re-scaled. Two estimation procedures are developed - EM algorithm for the pseudo likelihood and marginalization of the pseudo likelihood by directly integrating out the latent variable to obtain the parameter estimates. This approach is illustrated by the analysis of publicly reported hospitals with indicators and sample weights. Numerical studies are conducted to investigate the influence of weights and their sample distribution. △ Less

Submitted 18 October, 2019; originally announced October 2019.

arXiv:1909.11616 [pdf, other]

Benchmarking Tropical Cyclone Rapid Intensification with Satellite Images and Attention-based Deep Models

Authors: Ching-Yuan Bai, Buo-Fu Chen, Hsuan-Tien Lin

Abstract: Rapid intensification (RI) of tropical cyclones often causes major destruction to human civilization due to short response time. It is an important yet challenging task to accurately predict this kind of extreme weather event in advance. Traditionally, meteorologists tackle the task with human-driven feature extraction and predictor correction procedures. Nevertheless, these procedures do not leve… ▽ More Rapid intensification (RI) of tropical cyclones often causes major destruction to human civilization due to short response time. It is an important yet challenging task to accurately predict this kind of extreme weather event in advance. Traditionally, meteorologists tackle the task with human-driven feature extraction and predictor correction procedures. Nevertheless, these procedures do not leverage the power of modern machine learning models and abundant sensor data, such as satellite images. In addition, the human-driven nature of such an approach makes it difficult to reproduce and benchmark prediction models. In this study, we build a benchmark for RI prediction using only satellite images, which are underutilized in traditional techniques. The benchmark follows conventional data science practices, making it easier for data scientists to contribute to RI prediction. We demonstrate the usefulness of the benchmark by designing a domain-inspired spatiotemporal deep learning model. The results showcase the promising performance of deep learning in solving complex meteorological problems such as RI prediction. △ Less

Submitted 24 September, 2020; v1 submitted 25 September, 2019; originally announced September 2019.

Comments: In Proceedings of the The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), September 2020

arXiv:1909.10582 [pdf, other]

Kalman Filtering with Gaussian Processes Measurement Noise

Authors: Vince Kurtz, Hai Lin

Abstract: Real-world measurement noise in applications like robotics is often correlated in time, but we typically assume i.i.d. Gaussian noise for filtering. We propose general Gaussian Processes as a non-parametric model for correlated measurement noise that is flexible enough to accurately reflect correlation in time, yet simple enough to enable efficient computation. We show that this model accurately r… ▽ More Real-world measurement noise in applications like robotics is often correlated in time, but we typically assume i.i.d. Gaussian noise for filtering. We propose general Gaussian Processes as a non-parametric model for correlated measurement noise that is flexible enough to accurately reflect correlation in time, yet simple enough to enable efficient computation. We show that this model accurately reflects the measurement noise resulting from vision-based Simultaneous Localization and Map** (SLAM), and argue that it provides a flexible means of modeling measurement noise for a wide variety of sensor systems and perception algorithms. We then extend existing results for Kalman filtering with autoregressive processes to more general Gaussian Processes, and demonstrate the improved performance of our approach. △ Less

Submitted 23 September, 2019; originally announced September 2019.

arXiv:1909.08417 [pdf, other]

Persistence B-Spline Grids: Stable Vector Representation of Persistence Diagrams Based on Data Fitting

Authors: Zhetong Dong, Hongwei Lin, Chi Zhou

Abstract: Many attempts have been made in recent decades to integrate machine learning (ML) and topological data analysis. A prominent problem in applying persistent homology to ML tasks is finding a vector representation of a persistence diagram (PD), which is a summary diagram for representing topological features. From the perspective of data fitting, a stable vector representation, namely, persistence B… ▽ More Many attempts have been made in recent decades to integrate machine learning (ML) and topological data analysis. A prominent problem in applying persistent homology to ML tasks is finding a vector representation of a persistence diagram (PD), which is a summary diagram for representing topological features. From the perspective of data fitting, a stable vector representation, namely, persistence B-spline grid (PBSG), is proposed based on the efficient technique of progressive-iterative approximation for least-squares B-spline function fitting. We theoretically prove that the PBSG method is stable with respect to the metric of 1-Wasserstein distance defined on the PD space. The proposed method was tested on a synthetic data set, data sets of randomly generated PDs, data of a dynamical system, and 3D CAD models, showing its effectiveness and efficiency △ Less

Submitted 22 April, 2022; v1 submitted 17 September, 2019; originally announced September 2019.

arXiv:1909.04323 [pdf]

Investigating the completeness and omission roads of OpenStreetMap data in Hubei, China by comparing with Street Map and Street View

Authors: Qi Zhou, Hao Lin

Abstract: OpenStreetMap (OSM) is a free map of the world which can be edited by global volunteers. Existing studies have showed that completeness of OSM road data in some develo** countries (e.g. China) is much lower, resulting in concern in utilizing the data in various applications. But very few have focused on investigating what types of road are still poorly mapped. This study aims not only to investi… ▽ More OpenStreetMap (OSM) is a free map of the world which can be edited by global volunteers. Existing studies have showed that completeness of OSM road data in some develo** countries (e.g. China) is much lower, resulting in concern in utilizing the data in various applications. But very few have focused on investigating what types of road are still poorly mapped. This study aims not only to investigate the completeness of OSM road datasets in China but also to investigate what types of road (called omission roads) have not been mapped, which is achieved by referring to both Street Map and Street View. 16 prefecture-level divisions in the urban areas of Hubei (China) were used as study areas. Results showed that: (1) the completeness for most prefecture-level divisions was at a low-to-medium level; most roads (in the Street Map), however, with traffic conditions had already been mapped well. (2) Most of the omission OSM roads were either private roads, or public roads not having yet been named and with only one single lane, indicating their lack of importance in the urban road network. We argue that although the OSM road datasets in China are incomplete, they may still be used for several applications. △ Less

Submitted 10 September, 2019; originally announced September 2019.

arXiv:1907.04433 [pdf, other]

GluonCV and GluonNLP: Deep Learning in Computer Vision and Natural Language Processing

Authors: Jian Guo, He He, Tong He, Leonard Lausen, Mu Li, Haibin Lin, Xingjian Shi, Chenguang Wang, Junyuan Xie, Sheng Zha, Aston Zhang, Hang Zhang, Zhi Zhang, Zhongyue Zhang, Shuai Zheng, Yi Zhu

Abstract: We present GluonCV and GluonNLP, the deep learning toolkits for computer vision and natural language processing based on Apache MXNet (incubating). These toolkits provide state-of-the-art pre-trained models, training scripts, and training logs, to facilitate rapid prototy** and promote reproducible research. We also provide modular APIs with flexible building blocks to enable efficient customiza… ▽ More We present GluonCV and GluonNLP, the deep learning toolkits for computer vision and natural language processing based on Apache MXNet (incubating). These toolkits provide state-of-the-art pre-trained models, training scripts, and training logs, to facilitate rapid prototy** and promote reproducible research. We also provide modular APIs with flexible building blocks to enable efficient customization. Leveraging the MXNet ecosystem, the deep learning models in GluonCV and GluonNLP can be deployed onto a variety of platforms with different programming languages. The Apache 2.0 license has been adopted by GluonCV and GluonNLP to allow for software distribution, modification, and usage. △ Less

Submitted 12 February, 2020; v1 submitted 9 July, 2019; originally announced July 2019.

Journal ref: Journal of Machine Learning Research 21 (2020) 1-7

arXiv:1905.10834 [pdf, other]

ABCD Neurocognitive Prediction Challenge 2019: Predicting individual residual fluid intelligence scores from cortical grey matter morphology

Authors: Neil P. Oxtoby, Fabio S. Ferreira, Agoston Mihalik, Tong Wu, Mikael Brudfors, Hongxiang Lin, Anita Rau, Stefano B. Blumberg, Maria Robu, Cemre Zor, Maira Tariq, Maria Del Mar Estarellas Garcia, Baris Kanber, Daniil I. Nikitichev, Janaina Mourao-Miranda

Abstract: We predicted residual fluid intelligence scores from T1-weighted MRI data available as part of the ABCD NP Challenge 2019, using morphological similarity of grey-matter regions across the cortex. Individual structural covariance networks (SCN) were abstracted into graph-theory metrics averaged over nodes across the brain and in data-driven communities/modules. Metrics included degree, path length,… ▽ More We predicted residual fluid intelligence scores from T1-weighted MRI data available as part of the ABCD NP Challenge 2019, using morphological similarity of grey-matter regions across the cortex. Individual structural covariance networks (SCN) were abstracted into graph-theory metrics averaged over nodes across the brain and in data-driven communities/modules. Metrics included degree, path length, clustering coefficient, centrality, rich club coefficient, and small-worldness. These features derived from the training set were used to build various regression models for predicting residual fluid intelligence scores, with performance evaluated both using cross-validation within the training set and using the held-out validation set. Our predictions on the test set were generated with a support vector regression model trained on the training set. We found minimal improvement over predicting a zero residual fluid intelligence score across the sample population, implying that structural covariance networks calculated from T1-weighted MR imaging data provide little information about residual fluid intelligence. △ Less

Submitted 26 May, 2019; originally announced May 2019.

Comments: 8 pages plus references, 3 figures, 2 tables. Submission to the ABCD Neurocognitive Prediction Challenge at MICCAI 2019

arXiv:1905.10831 [pdf, other]

ABCD Neurocognitive Prediction Challenge 2019: Predicting individual fluid intelligence scores from structural MRI using probabilistic segmentation and kernel ridge regression

Authors: Agoston Mihalik, Mikael Brudfors, Maria Robu, Fabio S. Ferreira, Hongxiang Lin, Anita Rau, Tong Wu, Stefano B. Blumberg, Baris Kanber, Maira Tariq, Maria Del Mar Estarellas Garcia, Cemre Zor, Daniil I. Nikitichev, Janaina Mourao-Miranda, Neil P. Oxtoby

Abstract: We applied several regression and deep learning methods to predict fluid intelligence scores from T1-weighted MRI scans as part of the ABCD Neurocognitive Prediction Challenge (ABCD-NP-Challenge) 2019. We used voxel intensities and probabilistic tissue-type labels derived from these as features to train the models. The best predictive performance (lowest mean-squared error) came from Kernel Ridge… ▽ More We applied several regression and deep learning methods to predict fluid intelligence scores from T1-weighted MRI scans as part of the ABCD Neurocognitive Prediction Challenge (ABCD-NP-Challenge) 2019. We used voxel intensities and probabilistic tissue-type labels derived from these as features to train the models. The best predictive performance (lowest mean-squared error) came from Kernel Ridge Regression (KRR; $λ=10$), which produced a mean-squared error of 69.7204 on the validation set and 92.1298 on the test set. This placed our group in the fifth position on the validation leader board and first place on the final (test) leader board. △ Less

Submitted 26 May, 2019; originally announced May 2019.

Comments: Winning entry in the ABCD Neurocognitive Prediction Challenge at MICCAI 2019. 7 pages plus references, 3 figures, 1 table

arXiv:1904.12043 [pdf, other]

Dynamic Mini-batch SGD for Elastic Distributed Training: Learning in the Limbo of Resources

Authors: Haibin Lin, Hang Zhang, Yifei Ma, Tong He, Zhi Zhang, Sheng Zha, Mu Li

Abstract: With an increasing demand for training powers for deep learning algorithms and the rapid growth of computation resources in data centers, it is desirable to dynamically schedule different distributed deep learning tasks to maximize resource utilization and reduce cost. In this process, different tasks may receive varying numbers of machines at different time, a setting we call elastic distributed… ▽ More With an increasing demand for training powers for deep learning algorithms and the rapid growth of computation resources in data centers, it is desirable to dynamically schedule different distributed deep learning tasks to maximize resource utilization and reduce cost. In this process, different tasks may receive varying numbers of machines at different time, a setting we call elastic distributed training. Despite the recent successes in large mini-batch distributed training, these methods are rarely tested in elastic distributed training environments and suffer degraded performance in our experiments, when we adjust the learning rate linearly immediately with respect to the batch size. One difficulty we observe is that the noise in the stochastic momentum estimation is accumulated over time and will have delayed effects when the batch size changes. We therefore propose to smoothly adjust the learning rate over time to alleviate the influence of the noisy momentum estimation. Our experiments on image classification, object detection and semantic segmentation have demonstrated that our proposed Dynamic SGD method achieves stabilized performance when varying the number of GPUs from 8 to 128. We also provide theoretical understanding on the optimality of linear learning rate scheduling and the effects of stochastic momentum. △ Less

Submitted 2 May, 2019; v1 submitted 26 April, 2019; originally announced April 2019.

arXiv:1904.00284 [pdf, other]

COCO-GAN: Generation by Parts via Conditional Coordinating

Authors: Chieh Hubert Lin, Chia-Che Chang, Yu-Sheng Chen, Da-Cheng Juan, Wei Wei, Hwann-Tzong Chen

Abstract: Humans can only interact with part of the surrounding environment due to biological restrictions. Therefore, we learn to reason the spatial relationships across a series of observations to piece together the surrounding environment. Inspired by such behavior and the fact that machines also have computational constraints, we propose \underline{CO}nditional \underline{CO}ordinate GAN (COCO-GAN) of w… ▽ More Humans can only interact with part of the surrounding environment due to biological restrictions. Therefore, we learn to reason the spatial relationships across a series of observations to piece together the surrounding environment. Inspired by such behavior and the fact that machines also have computational constraints, we propose \underline{CO}nditional \underline{CO}ordinate GAN (COCO-GAN) of which the generator generates images by parts based on their spatial coordinates as the condition. On the other hand, the discriminator learns to justify realism across multiple assembled patches by global coherence, local appearance, and edge-crossing continuity. Despite the full images are never generated during training, we show that COCO-GAN can produce \textbf{state-of-the-art-quality} full images during inference. We further demonstrate a variety of novel applications enabled by teaching the network to be aware of coordinates. First, we perform extrapolation to the learned coordinate manifold and generate off-the-boundary patches. Combining with the originally generated full image, COCO-GAN can produce images that are larger than training samples, which we called "beyond-boundary generation". We then showcase panorama generation within a cylindrical coordinate system that inherently preserves horizontally cyclic topology. On the computation side, COCO-GAN has a built-in divide-and-conquer paradigm that reduces memory requisition during training and inference, provides high-parallelism, and can generate parts of images on-demand. △ Less

Submitted 5 January, 2020; v1 submitted 30 March, 2019; originally announced April 2019.

Comments: Accepted to ICCV'19 (oral). All images are compressed due to size limit, please access the full-resolution version via Google Drive: http://bit.ly/COCO-GAN-full

arXiv:1812.06600 [pdf, other]

Double Deep Q-Learning for Optimal Execution

Authors: Brian Ning, Franco Ho Ting Lin, Sebastian Jaimungal

Abstract: Optimal trade execution is an important problem faced by essentially all traders. Much research into optimal execution uses stringent model assumptions and applies continuous time stochastic control to solve them. Here, we instead take a model free approach and develop a variation of Deep Q-Learning to estimate the optimal actions of a trader. The model is a fully connected Neural Network trained… ▽ More Optimal trade execution is an important problem faced by essentially all traders. Much research into optimal execution uses stringent model assumptions and applies continuous time stochastic control to solve them. Here, we instead take a model free approach and develop a variation of Deep Q-Learning to estimate the optimal actions of a trader. The model is a fully connected Neural Network trained using Experience Replay and Double DQN with input features given by the current state of the limit order book, other trading signals, and available execution actions, while the output is the Q-value function estimating the future rewards under an arbitrary action. We apply our model to nine different stocks and find that it outperforms the standard benchmark approach on most stocks using the measures of (i) mean and median out-performance, (ii) probability of out-performance, and (iii) gain-loss ratios. △ Less

Submitted 8 June, 2020; v1 submitted 16 December, 2018; originally announced December 2018.

Comments: 20 pages, 7 figures, 1 table. Updated minor typos

MSC Class: 91G99; 93E35

Showing 1–50 of 73 results for author: Lin, H