-
Deterministic and Stochastic Frank-Wolfe Recursion on Probability Spaces
Authors:
Di Yu,
Shane G. Henderson,
Raghu Pasupathy
Abstract:
Motivated by applications in emergency response and experimental design, we consider smooth stochastic optimization problems over probability measures supported on compact subsets of the Euclidean space. With the influence function as the variational object, we construct a deterministic Frank-Wolfe (dFW) recursion for probability spaces, made especially possible by a lemma that identifies a ``clos…
▽ More
Motivated by applications in emergency response and experimental design, we consider smooth stochastic optimization problems over probability measures supported on compact subsets of the Euclidean space. With the influence function as the variational object, we construct a deterministic Frank-Wolfe (dFW) recursion for probability spaces, made especially possible by a lemma that identifies a ``closed-form'' solution to the infinite-dimensional Frank-Wolfe sub-problem. Each iterate in dFW is expressed as a convex combination of the incumbent iterate and a Dirac measure concentrating on the minimum of the influence function at the incumbent iterate. To address common application contexts that have access only to Monte Carlo observations of the objective and influence function, we construct a stochastic Frank-Wolfe (sFW) variation that generates a random sequence of probability measures constructed using minima of increasingly accurate estimates of the influence function. We demonstrate that sFW's optimality gap sequence exhibits $O(k^{-1})$ iteration complexity almost surely and in expectation for smooth convex objectives, and $O(k^{-1/2})$ (in Frank-Wolfe gap) for smooth non-convex objectives. Furthermore, we show that an easy-to-implement fixed-step, fixed-sample version of (sFW) exhibits exponential convergence to $\varepsilon$-optimality. We end with a central limit theorem on the observed objective values at the sequence of generated random measures. To further intuition, we include several illustrative examples with exact influence function calculations.
△ Less
Submitted 29 June, 2024;
originally announced July 2024.
-
Efficient Algorithms for Empirical Group Distributional Robust Optimization and Beyond
Authors:
Dingzhi Yu,
Yunuo Cai,
Wei Jiang,
Lijun Zhang
Abstract:
We investigate the empirical counterpart of group distributionally robust optimization (GDRO), which aims to minimize the maximal empirical risk across $m$ distinct groups. We formulate empirical GDRO as a $\textit{two-level}$ finite-sum convex-concave minimax optimization problem and develop a stochastic variance reduced mirror prox algorithm. Unlike existing methods, we construct the stochastic…
▽ More
We investigate the empirical counterpart of group distributionally robust optimization (GDRO), which aims to minimize the maximal empirical risk across $m$ distinct groups. We formulate empirical GDRO as a $\textit{two-level}$ finite-sum convex-concave minimax optimization problem and develop a stochastic variance reduced mirror prox algorithm. Unlike existing methods, we construct the stochastic gradient by per-group sampling technique and perform variance reduction for all groups, which fully exploits the $\textit{two-level}$ finite-sum structure of empirical GDRO. Furthermore, we compute the snapshot and mirror snapshot point by a one-index-shifted weighted average, which distinguishes us from the naive ergodic average. Our algorithm also supports non-constant learning rates, which is different from existing literature. We establish convergence guarantees both in expectation and with high probability, demonstrating a complexity of $\mathcal{O}\left(\frac{m\sqrt{\bar{n}\ln{m}}}{\varepsilon}\right)$, where $\bar n$ is the average number of samples among $m$ groups. Remarkably, our approach outperforms the state-of-the-art method by a factor of $\sqrt{m}$. Furthermore, we extend our methodology to deal with the empirical minimax excess risk optimization (MERO) problem and manage to give the expectation bound and the high probability bound, accordingly. The complexity of our empirical MERO algorithm matches that of empirical GDRO at $\mathcal{O}\left(\frac{m\sqrt{\bar{n}\ln{m}}}{\varepsilon}\right)$, significantly surpassing the bounds of existing methods.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
Model Uncertainty and Selection of Risk Models for Left-Truncated and Right-Censored Loss Data
Authors:
Qian Zhao,
Sahadeb Upretee,
Dao** Yu
Abstract:
Insurance loss data are usually in the form of left-truncation and right-censoring due to deductibles and policy limits respectively. This paper investigates the model uncertainty and selection procedure when various parametric models are constructed to accommodate such left-truncated and right-censored data. The joint asymptotic properties of the estimators have been established using the Delta m…
▽ More
Insurance loss data are usually in the form of left-truncation and right-censoring due to deductibles and policy limits respectively. This paper investigates the model uncertainty and selection procedure when various parametric models are constructed to accommodate such left-truncated and right-censored data. The joint asymptotic properties of the estimators have been established using the Delta method along with Maximum Likelihood Estimation when the model is specified. We conduct the simulation studies using Fisk, Lognormal, Lomax, Paralogistic, and Weibull distributions with various proportions of loss data below deductibles and above policy limits. A variety of graphic tools, hypothesis tests, and penalized likelihood criteria are employed to validate the models, and their performances on the model selection are evaluated through the probability of each parent distribution being correctly selected. The effectiveness of each tool on model selection is also illustrated using {well-studied} data that represent Wisconsin property losses in the United States from 2007 to 2010.
△ Less
Submitted 30 January, 2024;
originally announced January 2024.
-
A Longitudinal Analysis about the Effect of Air Pollution on Astigmatism for Children and Young Adults
Authors:
Lin An,
Qiuyue Hu,
Jieying Guan,
Yingting Zhu,
Chenyao Jiang,
Xiaoyun Zhong,
Shuyue Ma,
Dongmei Yu,
Canyang Zhang,
Yehong Zhuo,
Peiwu Qin
Abstract:
Purpose: This study aimed to investigate the correlation between air pollution and astigmatism, considering the detrimental effects of air pollution on respiratory, cardiovascular, and eye health. Methods: A longitudinal study was conducted with 127,709 individuals aged 4-27 years from 9 cities in Guangdong Province, China, spanning from 2019 to 2021. Astigmatism was measured using cylinder values…
▽ More
Purpose: This study aimed to investigate the correlation between air pollution and astigmatism, considering the detrimental effects of air pollution on respiratory, cardiovascular, and eye health. Methods: A longitudinal study was conducted with 127,709 individuals aged 4-27 years from 9 cities in Guangdong Province, China, spanning from 2019 to 2021. Astigmatism was measured using cylinder values. Multiple measurements were taken at intervals of at least 1 year. Various exposure windows were used to assess the lagged impacts of air pollution on astigmatism. A panel data model with random effects was constructed to analyze the relationship between pollutant exposure and astigmatism. Results: The study revealed significant associations between astigmatism and exposure to carbon monoxide (CO), nitrogen dioxide (NO2), and particulate matter (PM2.5) over time. A 10 μg/m3 increase in a 3-year exposure window of NO2 and PM2.5 was associated with a decrease in cylinder value of -0.045 diopters and -0.017 diopters, respectively. A 0.1 mg/m3 increase in CO concentration within a 2-year exposure window correlated with a decrease in cylinder value of -0.009 diopters. No significant relationships were found between PM10 exposure and astigmatism. Conclusion: This study concluded that greater exposure to NO2 and PM2.5 over longer periods aggravates astigmatism. The negative effect of CO on astigmatism peaks in the exposure window of 2 years prior to examination and diminishes afterward. No significant association was found between PM10 exposure and astigmatism, suggesting that gaseous and smaller particulate pollutants have easier access to human eyes, causing heterogeneous morphological changes to the eyeball.
△ Less
Submitted 13 October, 2023;
originally announced October 2023.
-
Optimal Weighted Random Forests
Authors:
Xinyu Chen,
Dalei Yu,
Xinyu Zhang
Abstract:
The random forest (RF) algorithm has become a very popular prediction method for its great flexibility and promising accuracy. In RF, it is conventional to put equal weights on all the base learners (trees) to aggregate their predictions. However, the predictive performances of different trees within the forest can be very different due to the randomization of the embedded bootstrap sampling and f…
▽ More
The random forest (RF) algorithm has become a very popular prediction method for its great flexibility and promising accuracy. In RF, it is conventional to put equal weights on all the base learners (trees) to aggregate their predictions. However, the predictive performances of different trees within the forest can be very different due to the randomization of the embedded bootstrap sampling and feature selection. In this paper, we focus on RF for regression and propose two optimal weighting algorithms, namely the 1 Step Optimal Weighted RF (1step-WRF$_\mathrm{opt}$) and 2 Steps Optimal Weighted RF (2steps-WRF$_\mathrm{opt}$), that combine the base learners through the weights determined by weight choice criteria. Under some regularity conditions, we show that these algorithms are asymptotically optimal in the sense that the resulting squared loss and risk are asymptotically identical to those of the infeasible but best possible model averaging estimator. Numerical studies conducted on real-world data sets indicate that these algorithms outperform the equal-weight forest and two other weighted RFs proposed in existing literature in most cases.
△ Less
Submitted 17 May, 2023;
originally announced May 2023.
-
Measuring Discrete Risks on Infinite Domains: Theoretical Foundations, Conditional Five Number Summaries, and Data Analyses
Authors:
Dao** Yu,
Vytaras Brazauskas,
Ricardas Zitikis
Abstract:
To accommodate numerous practical scenarios, in this paper we extend statistical inference for smoothed quantile estimators from finite domains to infinite domains. We accomplish the task with the help of a newly designed truncation methodology for discrete loss distributions with infinite domains. A simulation study illustrates the methodology in the case of several distributions, such as Poisson…
▽ More
To accommodate numerous practical scenarios, in this paper we extend statistical inference for smoothed quantile estimators from finite domains to infinite domains. We accomplish the task with the help of a newly designed truncation methodology for discrete loss distributions with infinite domains. A simulation study illustrates the methodology in the case of several distributions, such as Poisson, negative binomial, and their zero inflated versions, which are commonly used in insurance industry to model claim frequencies. Additionally, we propose a very flexible bootstrap-based approach for the use in practice. Using automobile accident data and their modifications, we compute what we have termed the conditional five number summary (C5NS) for the tail risk and construct confidence intervals for each of the five quantiles making up C5NS, and then calculate the tail probabilities. The results show that the smoothed quantile approach classifies the tail riskiness of portfolios not only more accurately but also produces lower coefficients of variation in the estimation of tail probabilities than those obtained using the linear interpolation approach.
△ Less
Submitted 5 April, 2023;
originally announced April 2023.
-
Exploring the Limits of Differentially Private Deep Learning with Group-wise Clip**
Authors:
Jiyan He,
Xuechen Li,
Da Yu,
Huishuai Zhang,
Janardhan Kulkarni,
Yin Tat Lee,
Arturs Backurs,
Nenghai Yu,
Jiang Bian
Abstract:
Differentially private deep learning has recently witnessed advances in computational efficiency and privacy-utility trade-off. We explore whether further improvements along the two axes are possible and provide affirmative answers leveraging two instantiations of \emph{group-wise clip**}. To reduce the compute time overhead of private learning, we show that \emph{per-layer clip**}, where the…
▽ More
Differentially private deep learning has recently witnessed advances in computational efficiency and privacy-utility trade-off. We explore whether further improvements along the two axes are possible and provide affirmative answers leveraging two instantiations of \emph{group-wise clip**}. To reduce the compute time overhead of private learning, we show that \emph{per-layer clip**}, where the gradient of each neural network layer is clipped separately, allows clip** to be performed in conjunction with backpropagation in differentially private optimization. This results in private learning that is as memory-efficient and almost as fast per training update as non-private learning for many workflows of interest. While per-layer clip** with constant thresholds tends to underperform standard flat clip**, per-layer clip** with adaptive thresholds matches or outperforms flat clip** under given training epoch constraints, hence attaining similar or better task performance within less wall time. To explore the limits of scaling (pretrained) models in differentially private deep learning, we privately fine-tune the 175 billion-parameter GPT-3. We bypass scaling challenges associated with clip** gradients that are distributed across multiple devices with \emph{per-device clip**} that clips the gradient of each model piece separately on its host device. Privately fine-tuning GPT-3 with per-device clip** achieves a task performance at $ε=1$ better than what is attainable by non-privately fine-tuning the largest GPT-2 on a summarization task.
△ Less
Submitted 3 December, 2022;
originally announced December 2022.
-
eDNAPlus: A unifying modelling framework for DNA-based biodiversity monitoring
Authors:
Alex Diana,
Eleni Matechou,
Jim Griffin,
Douglas Yu,
Mingjie Luo,
Marie Tosa,
Alex Bush,
Richard Griffiths
Abstract:
DNA-based biodiversity surveys involve collecting physical samples from survey sites and assaying the contents in the laboratory to detect species via their diagnostic DNA sequences. DNA-based surveys are increasingly being adopted for biodiversity monitoring. The most commonly employed method is metabarcoding, which combines PCR with high-throughput DNA sequencing to amplify and then read `DNA ba…
▽ More
DNA-based biodiversity surveys involve collecting physical samples from survey sites and assaying the contents in the laboratory to detect species via their diagnostic DNA sequences. DNA-based surveys are increasingly being adopted for biodiversity monitoring. The most commonly employed method is metabarcoding, which combines PCR with high-throughput DNA sequencing to amplify and then read `DNA barcode' sequences. This process generates count data indicating the number of times each DNA barcode was read. However, DNA-based data are noisy and error-prone, with several sources of variation. In this paper, we present a unifying modelling framework for DNA-based data allowing for all key sources of variation and error in the data-generating process. The model can estimate within-species biomass changes across sites and link those changes to environmental covariates, while accounting for species and sites correlation. Inference is performed using MCMC, where we employ Gibbs or Metropolis-Hastings updates with Laplace approximations. We also implement a re-parameterisation scheme, appropriate for crossed-effects models, leading to improved mixing, and an adaptive approach for updating latent variables, reducing computation time. We discuss study design and present theoretical and simulation results to guide decisions on replication at different stages and on the use of quality control methods. We demonstrate the new framework on a dataset of Malaise-trap samples. We quantify the effects of elevation and distance-to-road on each species, infer species correlations, and produce maps identifying areas of high biodiversity, which can be used to rank areas by conservation value. We estimate the level of noise between sites and within sample replicates, and the probabilities of error at the PCR stage, which are close to zero for most species considered, validating the employed laboratory processing.
△ Less
Submitted 22 November, 2022;
originally announced November 2022.
-
New Definitions and Evaluations for Saliency Methods: Staying Intrinsic, Complete and Sound
Authors:
Arushi Gupta,
Nikunj Saunshi,
Dingli Yu,
Kaifeng Lyu,
Sanjeev Arora
Abstract:
Saliency methods compute heat maps that highlight portions of an input that were most {\em important} for the label assigned to it by a deep net. Evaluations of saliency methods convert this heat map into a new {\em masked input} by retaining the $k$ highest-ranked pixels of the original input and replacing the rest with \textquotedblleft uninformative\textquotedblright\ pixels, and checking if th…
▽ More
Saliency methods compute heat maps that highlight portions of an input that were most {\em important} for the label assigned to it by a deep net. Evaluations of saliency methods convert this heat map into a new {\em masked input} by retaining the $k$ highest-ranked pixels of the original input and replacing the rest with \textquotedblleft uninformative\textquotedblright\ pixels, and checking if the net's output is mostly unchanged. This is usually seen as an {\em explanation} of the output, but the current paper highlights reasons why this inference of causality may be suspect. Inspired by logic concepts of {\em completeness \& soundness}, it observes that the above type of evaluation focuses on completeness of the explanation, but ignores soundness. New evaluation metrics are introduced to capture both notions, while staying in an {\em intrinsic} framework -- i.e., using the dataset and the net, but no separately trained nets, human evaluations, etc. A simple saliency method is described that matches or outperforms prior methods in the evaluations. Experiments also suggest new intrinsic justifications, based on soundness, for popular heuristic tricks such as TV regularization and upsampling.
△ Less
Submitted 5 November, 2022;
originally announced November 2022.
-
Conformalized Fairness via Quantile Regression
Authors:
Meichen Liu,
Lei Ding,
Dengdeng Yu,
Wulong Liu,
Linglong Kong,
Bei Jiang
Abstract:
Algorithmic fairness has received increased attention in socially sensitive domains. While rich literature on mean fairness has been established, research on quantile fairness remains sparse but vital. To fulfill great needs and advocate the significance of quantile fairness, we propose a novel framework to learn a real-valued quantile function under the fairness requirement of Demographic Parity…
▽ More
Algorithmic fairness has received increased attention in socially sensitive domains. While rich literature on mean fairness has been established, research on quantile fairness remains sparse but vital. To fulfill great needs and advocate the significance of quantile fairness, we propose a novel framework to learn a real-valued quantile function under the fairness requirement of Demographic Parity with respect to sensitive attributes, such as race or gender, and thereby derive a reliable fair prediction interval. Using optimal transport and functional synchronization techniques, we establish theoretical guarantees of distribution-free coverage and exact fairness for the induced prediction interval constructed by fair quantiles. A hands-on pipeline is provided to incorporate flexible quantile regressions with an efficient fairness adjustment post-processing algorithm. We demonstrate the superior empirical performance of this approach on several benchmark datasets. Our results show the model's ability to uncover the mechanism underlying the fairness-accuracy trade-off in a wide range of societal and medical applications.
△ Less
Submitted 14 October, 2022; v1 submitted 5 October, 2022;
originally announced October 2022.
-
Testing Independence of Bivariate Censored Data using Random Walk on Restricted Permutation Graph
Authors:
Seonghun Cho,
Donghyeon Yu,
Johan Lim
Abstract:
In this paper, we propose a procedure to test the independence of bivariate censored data, which is generic and applicable to any censoring types in the literature. To test the hypothesis, we consider a rank-based statistic, Kendall's tau statistic. The censored data defines a restricted permutation space of all possible ranks of the observations. We propose the statistic, the average of Kendall's…
▽ More
In this paper, we propose a procedure to test the independence of bivariate censored data, which is generic and applicable to any censoring types in the literature. To test the hypothesis, we consider a rank-based statistic, Kendall's tau statistic. The censored data defines a restricted permutation space of all possible ranks of the observations. We propose the statistic, the average of Kendall's tau over the ranks in the restricted permutation space. To evaluate the statistic and its reference distribution, we develop a Markov chain Monte Carlo (MCMC) procedure to obtain uniform samples on the restricted permutation space and numerically approximate the null distribution of the averaged Kendall's tau. We apply the procedure to three real data examples with different censoring types, and compare the results with those by existing methods. We conclude the paper with some additional discussions not given in the main body of the paper.
△ Less
Submitted 11 July, 2022;
originally announced July 2022.
-
Adversarial Noises Are Linearly Separable for (Nearly) Random Neural Networks
Authors:
Huishuai Zhang,
Da Yu,
Yi** Lu,
Di He
Abstract:
Adversarial examples, which are usually generated for specific inputs with a specific model, are ubiquitous for neural networks. In this paper we unveil a surprising property of adversarial noises when they are put together, i.e., adversarial noises crafted by one-step gradient methods are linearly separable if equipped with the corresponding labels. We theoretically prove this property for a two-…
▽ More
Adversarial examples, which are usually generated for specific inputs with a specific model, are ubiquitous for neural networks. In this paper we unveil a surprising property of adversarial noises when they are put together, i.e., adversarial noises crafted by one-step gradient methods are linearly separable if equipped with the corresponding labels. We theoretically prove this property for a two-layer network with randomly initialized entries and the neural tangent kernel setup where the parameters are not far from initialization. The proof idea is to show the label information can be efficiently backpropagated to the input while kee** the linear separability. Our theory and experimental evidence further show that the linear classifier trained with the adversarial noises of the training data can well classify the adversarial noises of the test data, indicating that adversarial noises actually inject a distributional perturbation to the original data distribution. Furthermore, we empirically demonstrate that the adversarial noises may become less linearly separable when the above conditions are compromised while they are still much easier to classify than original features.
△ Less
Submitted 9 June, 2022;
originally announced June 2022.
-
Individual Privacy Accounting for Differentially Private Stochastic Gradient Descent
Authors:
Da Yu,
Gautam Kamath,
Janardhan Kulkarni,
Tie-Yan Liu,
Jian Yin,
Huishuai Zhang
Abstract:
Differentially private stochastic gradient descent (DP-SGD) is the workhorse algorithm for recent advances in private deep learning. It provides a single privacy guarantee to all datapoints in the dataset. We propose output-specific $(\varepsilon,δ)$-DP to characterize privacy guarantees for individual examples when releasing models trained by DP-SGD. We also design an efficient algorithm to inves…
▽ More
Differentially private stochastic gradient descent (DP-SGD) is the workhorse algorithm for recent advances in private deep learning. It provides a single privacy guarantee to all datapoints in the dataset. We propose output-specific $(\varepsilon,δ)$-DP to characterize privacy guarantees for individual examples when releasing models trained by DP-SGD. We also design an efficient algorithm to investigate individual privacy across a number of datasets. We find that most examples enjoy stronger privacy guarantees than the worst-case bound. We further discover that the training loss and the privacy parameter of an example are well-correlated. This implies groups that are underserved in terms of model utility simultaneously experience weaker privacy guarantees. For example, on CIFAR-10, the average $\varepsilon$ of the class with the lowest test accuracy is 44.2\% higher than that of the class with the highest accuracy.
△ Less
Submitted 2 September, 2023; v1 submitted 6 June, 2022;
originally announced June 2022.
-
An efficient GPU-Parallel Coordinate Descent Algorithm for Sparse Precision Matrix Estimation via Scaled Lasso
Authors:
Seunghwan Lee,
Sang Cheol Kim,
Donghyeon Yu
Abstract:
The sparse precision matrix plays an essential role in the Gaussian graphical model since a zero off-diagonal element indicates conditional independence of the corresponding two variables given others. In the Gaussian graphical model, many methods have been proposed, and their theoretical properties are given as well. Among these, the sparse precision matrix estimation via scaled lasso (SPMESL) ha…
▽ More
The sparse precision matrix plays an essential role in the Gaussian graphical model since a zero off-diagonal element indicates conditional independence of the corresponding two variables given others. In the Gaussian graphical model, many methods have been proposed, and their theoretical properties are given as well. Among these, the sparse precision matrix estimation via scaled lasso (SPMESL) has an attractive feature in which the penalty level is automatically set to achieve the optimal convergence rate under the sparsity and invertibility conditions. Conversely, other methods need to be used in searching for the optimal tuning parameter. Despite such an advantage, the SPMESL has not been widely used due to its expensive computational cost. In this paper, we develop a GPU-parallel coordinate descent (CD) algorithm for the SPMESL and numerically show that the proposed algorithm is much faster than the least angle regression (LARS) tailored to the SPMESL. Several comprehensive numerical studies are conducted to investigate the scalability of the proposed algorithm and the estimation performance of the SPMESL. The results show that the SPMESL has the lowest false discovery rate for all cases and the best performance in the case where the level of the sparsity of the columns is high.
△ Less
Submitted 28 March, 2022;
originally announced March 2022.
-
Differentially Private Fine-tuning of Language Models
Authors:
Da Yu,
Saurabh Naik,
Arturs Backurs,
Sivakanth Gopi,
Huseyin A. Inan,
Gautam Kamath,
Janardhan Kulkarni,
Yin Tat Lee,
Andre Manoel,
Lukas Wutschitz,
Sergey Yekhanin,
Huishuai Zhang
Abstract:
We give simpler, sparser, and faster algorithms for differentially private fine-tuning of large-scale pre-trained language models, which achieve the state-of-the-art privacy versus utility tradeoffs on many standard NLP tasks. We propose a meta-framework for this problem, inspired by the recent success of highly parameter-efficient methods for fine-tuning. Our experiments show that differentially…
▽ More
We give simpler, sparser, and faster algorithms for differentially private fine-tuning of large-scale pre-trained language models, which achieve the state-of-the-art privacy versus utility tradeoffs on many standard NLP tasks. We propose a meta-framework for this problem, inspired by the recent success of highly parameter-efficient methods for fine-tuning. Our experiments show that differentially private adaptations of these approaches outperform previous private algorithms in three important dimensions: utility, privacy, and the computational and memory cost of private training. On many commonly studied datasets, the utility of private models approaches that of non-private models. For example, on the MNLI dataset we achieve an accuracy of $87.8\%$ using RoBERTa-Large and $83.5\%$ using RoBERTa-Base with a privacy budget of $ε= 6.7$. In comparison, absent privacy constraints, RoBERTa-Large achieves an accuracy of $90.2\%$. Our findings are similar for natural language generation tasks. Privately fine-tuning with DART, GPT-2-Small, GPT-2-Medium, GPT-2-Large, and GPT-2-XL achieve BLEU scores of 38.5, 42.0, 43.1, and 43.8 respectively (privacy budget of $ε= 6.8,δ=$ 1e-5) whereas the non-private baseline is $48.1$. All our experiments suggest that larger models are better suited for private fine-tuning: while they are well known to achieve superior accuracy non-privately, we find that they also better maintain their accuracy when privacy is introduced.
△ Less
Submitted 14 July, 2022; v1 submitted 13 October, 2021;
originally announced October 2021.
-
An efficient parallel block coordinate descent algorithm for large-scale precision matrix estimation using graphics processing units
Authors:
Young-Geun Choi,
Seunghwan Lee,
Donghyeon Yu
Abstract:
Large-scale sparse precision matrix estimation has attracted wide interest from the statistics community. The convex partial correlation selection method (CONCORD) developed by Khare et al. (2015) has recently been credited with some theoretical properties for estimating sparse precision matrices. The CONCORD obtains its solution by a coordinate descent algorithm (CONCORD-CD) based on the convexit…
▽ More
Large-scale sparse precision matrix estimation has attracted wide interest from the statistics community. The convex partial correlation selection method (CONCORD) developed by Khare et al. (2015) has recently been credited with some theoretical properties for estimating sparse precision matrices. The CONCORD obtains its solution by a coordinate descent algorithm (CONCORD-CD) based on the convexity of the objective function. However, since a coordinate-wise update in CONCORD-CD is inherently serial, a scale-up is nontrivial. In this paper, we propose a novel parallelization of CONCORD-CD, namely, CONCORD-PCD. CONCORD-PCD partitions the off-diagonal elements into several groups and updates each group simultaneously without harming the computational convergence of CONCORD-CD. We guarantee this by employing the notion of edge coloring in graph theory. Specifically, we establish a nontrivial correspondence between scheduling the updates of the off-diagonal elements in CONCORD-CD and coloring the edges of a complete graph. It turns out that CONCORD-PCD simultaneously updates off-diagonal elements in which the associated edges are colorable with the same color. As a result, the number of steps required for updating off-diagonal elements reduces from p(p-1)/2 to p-1 (for even p) or p (for odd p), where p denotes the number of variables. We prove that the number of such steps is irreducible In addition, CONCORD-PCD is tailored to single-instruction multiple-data (SIMD) parallelism. A numerical study shows that the SIMD-parallelized PCD algorithm implemented in graphics processing units (GPUs) boosts the CONCORD-CD algorithm multiple times.
△ Less
Submitted 17 June, 2021;
originally announced June 2021.
-
Mining GIS Data to Predict Urban Sprawl
Authors:
Anita Pampoore-Thampi,
Aparna S. Varde,
Danlin Yu
Abstract:
This paper addresses the interesting problem of processing and analyzing data in geographic information systems (GIS) to achieve a clear perspective on urban sprawl. The term urban sprawl refers to overgrowth and expansion of low-density areas with issues such as car dependency and segregation between residential versus commercial use. Sprawl has impacts on the environment and public health. In ou…
▽ More
This paper addresses the interesting problem of processing and analyzing data in geographic information systems (GIS) to achieve a clear perspective on urban sprawl. The term urban sprawl refers to overgrowth and expansion of low-density areas with issues such as car dependency and segregation between residential versus commercial use. Sprawl has impacts on the environment and public health. In our work, spatiotemporal features related to real GIS data on urban sprawl such as population growth and demographics are mined to discover knowledge for decision support. We adapt data mining algorithms, Apriori for association rule mining and J4.8 for decision tree classification to geospatial analysis, deploying the ArcGIS tool for map**. Knowledge discovered by mining this spatiotemporal data is used to implement a prototype spatial decision support system (SDSS). This SDSS predicts whether urban sprawl is likely to occur. Further, it estimates the values of pertinent variables to understand how the variables impact each other. The SDSS can help decision-makers identify problems and create solutions for avoiding future sprawl occurrence and conducting urban planning where sprawl already occurs, thus aiding sustainable development. This work falls in the broad realm of geospatial intelligence and sets the stage for designing a large scale SDSS to process big data in complex environments, which constitutes part of our future work.
△ Less
Submitted 21 March, 2021;
originally announced March 2021.
-
Multivariate functional responses low rank regression with an application to brain imaging data
Authors:
Xiucai Ding,
Dengdeng Yu,
Zhengwu Zhang,
Dehan Kong
Abstract:
We propose a multivariate functional responses low rank regression model with possible high dimensional functional responses and scalar covariates. By expanding the slope functions on a set of sieve basis, we reconstruct the basis coefficients as a matrix. To estimate these coefficients, we propose an efficient procedure using nuclear norm regularization. We also derive error bounds for our estima…
▽ More
We propose a multivariate functional responses low rank regression model with possible high dimensional functional responses and scalar covariates. By expanding the slope functions on a set of sieve basis, we reconstruct the basis coefficients as a matrix. To estimate these coefficients, we propose an efficient procedure using nuclear norm regularization. We also derive error bounds for our estimates and evaluate our method using simulations. We further apply our method to the Human Connectome Project neuroimaging data to predict cortical surface motor task-evoked functional magnetic resonance imaging signals using various clinical covariates to illustrate the usefulness of our results.
△ Less
Submitted 7 October, 2020;
originally announced October 2020.
-
Analysis and Optimization for Large-Scale LoRa Networks: Throughput Fairness and Scalability
Authors:
Jiangbin Lyu,
Dan Yu,
Liqun Fu
Abstract:
LoRa networks are pivotally enabling Long Range connectivity to low-cost and power-constrained user equipments (UEs) in a wide area, whereas a critical issue is to effectively allocate wireless resources to support potentially massive UEs while resolving the prominent near-far fairness issue, which is challenging due to the lack of tractable analytical model and the practical requirement for low-c…
▽ More
LoRa networks are pivotally enabling Long Range connectivity to low-cost and power-constrained user equipments (UEs) in a wide area, whereas a critical issue is to effectively allocate wireless resources to support potentially massive UEs while resolving the prominent near-far fairness issue, which is challenging due to the lack of tractable analytical model and the practical requirement for low-complexity and low-overhead design. Leveraging on stochastic geometry, especially the Poisson rain model, we derive (semi-) closed-form formulas for the aggregate interference distribution, packet success probability and hence system throughput in both single-cell and multi-cell setups with frequency reuse, by accounting for channel fading, random UE distribution, partial packet overlap**, and/or multi-gateway packet reception. The analytical formulas require only average channel statistics and spatial UE distribution, which enable tractable network performance evaluation and incubate our proposed Iterative Balancing (IB) method that quickly yields high-level policies of joint spreading factor (SF) allocation, power control, and duty cycle adjustment for gauging the average max-min UE throughput or supported UE density with rate requirements. Numerical results validate the analytical formulas and the effectiveness of our proposed optimization scheme, which greatly alleviates the near-far fairness issue and reduces the spatial power consumption, while significantly improving the cell-edge throughput as well as the spatial (sum) throughput for the majority of UEs, by adapting to the UE/gateway densities.
△ Less
Submitted 5 November, 2021; v1 submitted 17 August, 2020;
originally announced August 2020.
-
How Does Data Augmentation Affect Privacy in Machine Learning?
Authors:
Da Yu,
Huishuai Zhang,
Wei Chen,
Jian Yin,
Tie-Yan Liu
Abstract:
It is observed in the literature that data augmentation can significantly mitigate membership inference (MI) attack. However, in this work, we challenge this observation by proposing new MI attacks to utilize the information of augmented data. MI attack is widely used to measure the model's information leakage of the training set. We establish the optimal membership inference when the model is tra…
▽ More
It is observed in the literature that data augmentation can significantly mitigate membership inference (MI) attack. However, in this work, we challenge this observation by proposing new MI attacks to utilize the information of augmented data. MI attack is widely used to measure the model's information leakage of the training set. We establish the optimal membership inference when the model is trained with augmented data, which inspires us to formulate the MI attack as a set classification problem, i.e., classifying a set of augmented instances instead of a single data point, and design input permutation invariant features. Empirically, we demonstrate that the proposed approach universally outperforms original methods when the model is trained with data augmentation. Even further, we show that the proposed approach can achieve higher MI attack success rates on models trained with some data augmentation than the existing methods on models trained without data augmentation. Notably, we achieve a 70.1% MI attack success rate on CIFAR10 against a wide residual network while the previous best approach only attains 61.9%. This suggests the privacy risk of models trained with data augmentation could be largely underestimated.
△ Less
Submitted 26 February, 2021; v1 submitted 20 July, 2020;
originally announced July 2020.
-
Map** the Genetic-Imaging-Clinical Pathway with Applications to Alzheimer's Disease
Authors:
Dengdeng Yu,
Linbo Wang,
Dehan Kong,
Hongtu Zhu
Abstract:
Alzheimer's disease is a progressive form of dementia that results in problems with memory, thinking, and behavior. It often starts with abnormal aggregation and deposition of beta amyloid and tau, followed by neuronal damage such as atrophy of the hippocampi, leading to Alzheimer's Disease (AD). The aim of this paper is to map the genetic-imaging-clinical pathway for AD in order to delineate the…
▽ More
Alzheimer's disease is a progressive form of dementia that results in problems with memory, thinking, and behavior. It often starts with abnormal aggregation and deposition of beta amyloid and tau, followed by neuronal damage such as atrophy of the hippocampi, leading to Alzheimer's Disease (AD). The aim of this paper is to map the genetic-imaging-clinical pathway for AD in order to delineate the genetically regulated brain changes that drive disease progression based on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset. We develop a novel two-step approach to delineate the association between high-dimensional 2D hippocampal surface exposures and the Alzheimer's Disease Assessment Scale (ADAS) cognitive score, while taking into account the ultra-high dimensional clinical and genetic covariates at baseline. Analysis results suggest that the radial distance of each pixel of both hippocampi is negatively associated with the severity of behavioral deficits conditional on observed clinical and genetic covariates. These associations are stronger in Cornu Ammonis region 1 (CA1) and subiculum subregions compared to Cornu Ammonis region 2 (CA2) and Cornu Ammonis region 3 (CA3) subregions.
△ Less
Submitted 2 June, 2022; v1 submitted 9 July, 2020;
originally announced July 2020.
-
Knowledge Embedding Based Graph Convolutional Network
Authors:
Donghan Yu,
Yiming Yang,
Ruohong Zhang,
Yuexin Wu
Abstract:
Recently, a considerable literature has grown up around the theme of Graph Convolutional Network (GCN). How to effectively leverage the rich structural information in complex graphs, such as knowledge graphs with heterogeneous types of entities and relations, is a primary open challenge in the field. Most GCN methods are either restricted to graphs with a homogeneous type of edges (e.g., citation…
▽ More
Recently, a considerable literature has grown up around the theme of Graph Convolutional Network (GCN). How to effectively leverage the rich structural information in complex graphs, such as knowledge graphs with heterogeneous types of entities and relations, is a primary open challenge in the field. Most GCN methods are either restricted to graphs with a homogeneous type of edges (e.g., citation links only), or focusing on representation learning for nodes only instead of jointly propagating and updating the embeddings of both nodes and edges for target-driven objectives. This paper addresses these limitations by proposing a novel framework, namely the Knowledge Embedding based Graph Convolutional Network (KE-GCN), which combines the power of GCNs in graph-based belief propagation and the strengths of advanced knowledge embedding (a.k.a. knowledge graph embedding) methods, and goes beyond. Our theoretical analysis shows that KE-GCN offers an elegant unification of several well-known GCN methods as specific cases, with a new perspective of graph convolution. Experimental results on benchmark datasets show the advantageous performance of KE-GCN over strong baseline methods in the tasks of knowledge graph alignment and entity classification.
△ Less
Submitted 23 April, 2021; v1 submitted 12 June, 2020;
originally announced June 2020.
-
Correlation-aware Unsupervised Change-point Detection via Graph Neural Networks
Authors:
Ruohong Zhang,
Yu Hao,
Donghan Yu,
Wei-Cheng Chang,
Guokun Lai,
Yiming Yang
Abstract:
Change-point detection (CPD) aims to detect abrupt changes over time series data. Intuitively, effective CPD over multivariate time series should require explicit modeling of the dependencies across input variables. However, existing CPD methods either ignore the dependency structures entirely or rely on the (unrealistic) assumption that the correlation structures are static over time. In this pap…
▽ More
Change-point detection (CPD) aims to detect abrupt changes over time series data. Intuitively, effective CPD over multivariate time series should require explicit modeling of the dependencies across input variables. However, existing CPD methods either ignore the dependency structures entirely or rely on the (unrealistic) assumption that the correlation structures are static over time. In this paper, we propose a Correlation-aware Dynamics Model for CPD, which explicitly models the correlation structure and dynamics of variables by incorporating graph neural networks into an encoder-decoder framework. Extensive experiments on synthetic and real-world datasets demonstrate the advantageous performance of the proposed model on CPD tasks over strong baselines, as well as its ability to classify the change-points as correlation changes or independent changes. Keywords: Multivariate Time Series, Change-point Detection, Graph Neural Networks
△ Less
Submitted 13 September, 2020; v1 submitted 24 April, 2020;
originally announced April 2020.
-
A Unified Framework for Speech Separation
Authors:
Fahimeh Bahmaninezhad,
Shi-Xiong Zhang,
Yong Xu,
Meng Yu,
John H. L. Hansen,
Dong Yu
Abstract:
Speech separation refers to extracting each individual speech source in a given mixed signal. Recent advancements in speech separation and ongoing research in this area, have made these approaches as promising techniques for pre-processing of naturalistic audio streams. After incorporating deep learning techniques into speech separation, performance on these systems is improving faster. The initia…
▽ More
Speech separation refers to extracting each individual speech source in a given mixed signal. Recent advancements in speech separation and ongoing research in this area, have made these approaches as promising techniques for pre-processing of naturalistic audio streams. After incorporating deep learning techniques into speech separation, performance on these systems is improving faster. The initial solutions introduced for deep learning based speech separation analyzed the speech signals into time-frequency domain with STFT; and then encoded mixed signals were fed into a deep neural network based separator. Most recently, new methods are introduced to separate waveform of the mixed signal directly without analyzing them using STFT. Here, we introduce a unified framework to include both spectrogram and waveform separations into a single structure, while being only different in the kernel function used to encode and decode the data; where, both can achieve competitive performance. This new framework provides flexibility; in addition, depending on the characteristics of the data, or limitations of the memory and latency can set the hyper-parameters to flow in a pipeline of the framework which fits the task properly. We extend single-channel speech separation into multi-channel framework with end-to-end training of the network while optimizing the speech separation criterion (i.e., Si-SNR) directly. We emphasize on how tied kernel functions for calculating spatial features, encoder, and decoder in multi-channel framework can be effective. We simulate spatialized reverberate data for both WSJ0 and LibriSpeech corpora here, and while these two sets of data are different in the matter of size and duration, the effect of capturing shorter and longer dependencies of previous/+future samples are studied in detail. We report SDR, Si-SNR and PESQ to evaluate the performance of developed solutions.
△ Less
Submitted 16 December, 2019;
originally announced December 2019.
-
Gradient Perturbation is Underrated for Differentially Private Convex Optimization
Authors:
Da Yu,
Huishuai Zhang,
Wei Chen,
Tie-Yan Liu,
Jian Yin
Abstract:
Gradient perturbation, widely used for differentially private optimization, injects noise at every iterative update to guarantee differential privacy. Previous work first determines the noise level that can satisfy the privacy requirement and then analyzes the utility of noisy gradient updates as in the non-private case. In contrast, we explore how privacy noise affects optimization property. We s…
▽ More
Gradient perturbation, widely used for differentially private optimization, injects noise at every iterative update to guarantee differential privacy. Previous work first determines the noise level that can satisfy the privacy requirement and then analyzes the utility of noisy gradient updates as in the non-private case. In contrast, we explore how privacy noise affects optimization property. We show that for differentially private convex optimization, the utility guarantee of differentially private (stochastic) gradient descent is determined by an \emph{expected curvature} rather than the minimum curvature. The \emph{expected curvature}, which represents the average curvature over the optimization path, is usually much larger than the minimum curvature. By using the \emph{expected curvature}, we show that gradient perturbation can achieve a significantly improved utility guarantee that can theoretically justify the advantage of gradient perturbation over other perturbation methods. Finally, our extensive experiments suggest that gradient perturbation with the advanced composition method indeed outperforms other perturbation approaches by a large margin, matching our theoretical findings.
△ Less
Submitted 26 October, 2020; v1 submitted 26 November, 2019;
originally announced November 2019.
-
Graph-Revised Convolutional Network
Authors:
Donghan Yu,
Ruohong Zhang,
Zhengbao Jiang,
Yuexin Wu,
Yiming Yang
Abstract:
Graph Convolutional Networks (GCNs) have received increasing attention in the machine learning community for effectively leveraging both the content features of nodes and the linkage patterns across graphs in various applications. As real-world graphs are often incomplete and noisy, treating them as ground-truth information, which is a common practice in most GCNs, unavoidably leads to sub-optimal…
▽ More
Graph Convolutional Networks (GCNs) have received increasing attention in the machine learning community for effectively leveraging both the content features of nodes and the linkage patterns across graphs in various applications. As real-world graphs are often incomplete and noisy, treating them as ground-truth information, which is a common practice in most GCNs, unavoidably leads to sub-optimal solutions. Existing efforts for addressing this problem either involve an over-parameterized model which is difficult to scale, or simply re-weight observed edges without dealing with the missing-edge issue. This paper proposes a novel framework called Graph-Revised Convolutional Network (GRCN), which avoids both extremes. Specifically, a GCN-based graph revision module is introduced for predicting missing edges and revising edge weights w.r.t. downstream tasks via joint optimization. A theoretical analysis reveals the connection between GRCN and previous work on multigraph belief propagation. Experiments on six benchmark datasets show that GRCN consistently outperforms strong baseline methods by a large margin, especially when the original graphs are severely incomplete or the labeled instances for model training are highly sparse.
△ Less
Submitted 30 December, 2020; v1 submitted 16 November, 2019;
originally announced November 2019.
-
Enhanced Convolutional Neural Tangent Kernels
Authors:
Zhiyuan Li,
Ruosong Wang,
Dingli Yu,
Simon S. Du,
Wei Hu,
Ruslan Salakhutdinov,
Sanjeev Arora
Abstract:
Recent research shows that for training with $\ell_2$ loss, convolutional neural networks (CNNs) whose width (number of channels in convolutional layers) goes to infinity correspond to regression with respect to the CNN Gaussian Process kernel (CNN-GP) if only the last layer is trained, and correspond to regression with respect to the Convolutional Neural Tangent Kernel (CNTK) if all layers are tr…
▽ More
Recent research shows that for training with $\ell_2$ loss, convolutional neural networks (CNNs) whose width (number of channels in convolutional layers) goes to infinity correspond to regression with respect to the CNN Gaussian Process kernel (CNN-GP) if only the last layer is trained, and correspond to regression with respect to the Convolutional Neural Tangent Kernel (CNTK) if all layers are trained. An exact algorithm to compute CNTK (Arora et al., 2019) yielded the finding that classification accuracy of CNTK on CIFAR-10 is within 6-7% of that of that of the corresponding CNN architecture (best figure being around 78%) which is interesting performance for a fixed kernel. Here we show how to significantly enhance the performance of these kernels using two ideas. (1) Modifying the kernel using a new operation called Local Average Pooling (LAP) which preserves efficient computability of the kernel and inherits the spirit of standard data augmentation using pixel shifts. Earlier papers were unable to incorporate naive data augmentation because of the quadratic training cost of kernel regression. This idea is inspired by Global Average Pooling (GAP), which we show for CNN-GP and CNTK is equivalent to full translation data augmentation. (2) Representing the input image using a pre-processing technique proposed by Coates et al. (2011), which uses a single convolutional layer composed of random image patches. On CIFAR-10, the resulting kernel, CNN-GP with LAP and horizontal flip data augmentation, achieves 89% accuracy, matching the performance of AlexNet (Krizhevsky et al., 2012). Note that this is the best such result we know of for a classifier that is not a trained neural network. Similar improvements are obtained for Fashion-MNIST.
△ Less
Submitted 2 November, 2019;
originally announced November 2019.
-
Nonparametric principal subspace regression
Authors:
Mark Koudstaal,
Dengdeng Yu,
Dehan Kong,
Fang Yao
Abstract:
In scientific applications, multivariate observations often come in tandem with temporal or spatial covariates, with which the underlying signals vary smoothly. The standard approaches such as principal component analysis and factor analysis neglect the smoothness of the data, while multivariate linear or nonparametric regression fail to leverage the correlation information among multivariate resp…
▽ More
In scientific applications, multivariate observations often come in tandem with temporal or spatial covariates, with which the underlying signals vary smoothly. The standard approaches such as principal component analysis and factor analysis neglect the smoothness of the data, while multivariate linear or nonparametric regression fail to leverage the correlation information among multivariate response variables. We propose a novel approach named nonparametric principal subspace regression to overcome these issues. By decoupling the model discrepancy, a simple and general two-step framework is introduced, which leaves much flexibility in choice of model fitting. We establish theoretical property of the general framework, and offer implementation procedures that fulfill requirements and enjoy the theoretical guarantee. We demonstrate the favorable finite-sample performance of the proposed method through simulations and a real data application from an electroencephalogram study.
△ Less
Submitted 12 October, 2019; v1 submitted 7 October, 2019;
originally announced October 2019.
-
Automating Data Monitoring: Detecting Structural Breaks in Time Series Data Using Bayesian Minimum Description Length
Authors:
Yingbo Li,
Robert Cezeaux,
Di Yu
Abstract:
In modern business modeling and analytics, data monitoring plays a critical role. Nowadays, sophisticated models often rely on hundreds or even thousands of input variables. Over time, structural changes such as abrupt level shifts or trend slope changes may occur among some of these variables, likely due to changes in economy or government policies. As a part of data monitoring, it is important t…
▽ More
In modern business modeling and analytics, data monitoring plays a critical role. Nowadays, sophisticated models often rely on hundreds or even thousands of input variables. Over time, structural changes such as abrupt level shifts or trend slope changes may occur among some of these variables, likely due to changes in economy or government policies. As a part of data monitoring, it is important to identify these changepoints, in terms of which variables exhibit such changes, and what time locations do the changepoints occur. Being alerted about the changepoints can help modelers decide if models need modification or rebuilds, while ignoring them may increase risks of model degrading. Simple process control rules often flag too many false alarms because regular seasonal fluctuations or steady upward or downward trends usually trigger alerts. To reduce potential false alarms, we create a novel statistical method based on the Bayesian Minimum Description Length (BMDL) framework to perform multiple change-point detection. Our method is capable of detecting all structural breaks occurred in the past, and automatically handling data with or without seasonality and/or autocorrelation. It is implemented with computation algorithms such as Markov chain Monte Carlo (MCMC), and can be applied to all variables in parallel. As an explainable anomaly detection tool, our changepoint detection method not only triggers alerts, but provides useful information about the structural breaks, such as the times of changepoints, and estimation of mean levels and linear slopes before and after the changepoints. This makes future business analysis and evaluation on the structural breaks easier.
△ Less
Submitted 4 October, 2019;
originally announced October 2019.
-
Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks
Authors:
Sanjeev Arora,
Simon S. Du,
Zhiyuan Li,
Ruslan Salakhutdinov,
Ruosong Wang,
Dingli Yu
Abstract:
Recent research shows that the following two models are equivalent: (a) infinitely wide neural networks (NNs) trained under l2 loss by gradient descent with infinitesimally small learning rate (b) kernel regression with respect to so-called Neural Tangent Kernels (NTKs) (Jacot et al., 2018). An efficient algorithm to compute the NTK, as well as its convolutional counterparts, appears in Arora et a…
▽ More
Recent research shows that the following two models are equivalent: (a) infinitely wide neural networks (NNs) trained under l2 loss by gradient descent with infinitesimally small learning rate (b) kernel regression with respect to so-called Neural Tangent Kernels (NTKs) (Jacot et al., 2018). An efficient algorithm to compute the NTK, as well as its convolutional counterparts, appears in Arora et al. (2019a), which allowed studying performance of infinitely wide nets on datasets like CIFAR-10. However, super-quadratic running time of kernel methods makes them best suited for small-data tasks. We report results suggesting neural tangent kernels perform strongly on low-data tasks.
1. On a standard testbed of classification/regression tasks from the UCI database, NTK SVM beats the previous gold standard, Random Forests (RF), and also the corresponding finite nets.
2. On CIFAR-10 with 10 - 640 training samples, Convolutional NTK consistently beats ResNet-34 by 1% - 3%.
3. On VOC07 testbed for few-shot image classification tasks on ImageNet with transfer learning (Goyal et al., 2019), replacing the linear SVM currently used with a Convolutional NTK SVM consistently improves performance.
4. Comparing the performance of NTK with the finite-width net it was derived from, NTK behavior starts at lower net widths than suggested by theoretical analysis(Arora et al., 2019a). NTK's efficacy may trace to lower variance of output.
△ Less
Submitted 27 October, 2019; v1 submitted 3 October, 2019;
originally announced October 2019.
-
Simple and Effective Regularization Methods for Training on Noisily Labeled Data with Generalization Guarantee
Authors:
Wei Hu,
Zhiyuan Li,
Dingli Yu
Abstract:
Over-parameterized deep neural networks trained by simple first-order methods are known to be able to fit any labeling of data. Such over-fitting ability hinders generalization when mislabeled training examples are present. On the other hand, simple regularization methods like early-stop** can often achieve highly nontrivial performance on clean test data in these scenarios, a phenomenon not the…
▽ More
Over-parameterized deep neural networks trained by simple first-order methods are known to be able to fit any labeling of data. Such over-fitting ability hinders generalization when mislabeled training examples are present. On the other hand, simple regularization methods like early-stop** can often achieve highly nontrivial performance on clean test data in these scenarios, a phenomenon not theoretically understood. This paper proposes and analyzes two simple and intuitive regularization methods: (i) regularization by the distance between the network parameters to initialization, and (ii) adding a trainable auxiliary variable to the network output for each training example. Theoretically, we prove that gradient descent training with either of these two methods leads to a generalization guarantee on the clean data distribution despite being trained using noisy labels. Our generalization analysis relies on the connection between wide neural network and neural tangent kernel (NTK). The generalization bound is independent of the network size, and is comparable to the bound one can get when there is no label noise. Experimental results verify the effectiveness of these methods on noisily labeled datasets.
△ Less
Submitted 2 October, 2020; v1 submitted 27 May, 2019;
originally announced May 2019.
-
Enhancing Domain Word Embedding via Latent Semantic Imputation
Authors:
Shibo Yao,
Dantong Yu,
Keli Xiao
Abstract:
We present a novel method named Latent Semantic Imputation (LSI) to transfer external knowledge into semantic space for enhancing word embedding. The method integrates graph theory to extract the latent manifold structure of the entities in the affinity space and leverages non-negative least squares with standard simplex constraints and power iteration method to derive spectral embeddings. It prov…
▽ More
We present a novel method named Latent Semantic Imputation (LSI) to transfer external knowledge into semantic space for enhancing word embedding. The method integrates graph theory to extract the latent manifold structure of the entities in the affinity space and leverages non-negative least squares with standard simplex constraints and power iteration method to derive spectral embeddings. It provides an effective and efficient approach to combining entity representations defined in different Euclidean spaces. Specifically, our approach generates and imputes reliable embedding vectors for low-frequency words in the semantic space and benefits downstream language tasks that depend on word embedding. We conduct comprehensive experiments on a carefully designed classification problem and language modeling and demonstrate the superiority of the enhanced embedding via LSI over several well-known benchmark embeddings. We also confirm the consistency of the results under different parameter settings of our method.
△ Less
Submitted 21 May, 2019;
originally announced May 2019.
-
Encrypted Speech Recognition using Deep Polynomial Networks
Authors:
Shi-Xiong Zhang,
Yifan Gong,
Dong Yu
Abstract:
The cloud-based speech recognition/API provides developers or enterprises an easy way to create speech-enabled features in their applications. However, sending audios about personal or company internal information to the cloud, raises concerns about the privacy and security issues. The recognition results generated in cloud may also reveal some sensitive information. This paper proposes a deep pol…
▽ More
The cloud-based speech recognition/API provides developers or enterprises an easy way to create speech-enabled features in their applications. However, sending audios about personal or company internal information to the cloud, raises concerns about the privacy and security issues. The recognition results generated in cloud may also reveal some sensitive information. This paper proposes a deep polynomial network (DPN) that can be applied to the encrypted speech as an acoustic model. It allows clients to send their data in an encrypted form to the cloud to ensure that their data remains confidential, at mean while the DPN can still make frame-level predictions over the encrypted speech and return them in encrypted form. One good property of the DPN is that it can be trained on unencrypted speech features in the traditional way. To keep the cloud away from the raw audio and recognition results, a cloud-local joint decoding framework is also proposed. We demonstrate the effectiveness of model and framework on the Switchboard and Cortana voice assistant tasks with small performance degradation and latency increased comparing with the traditional cloud-based DNNs.
△ Less
Submitted 10 May, 2019;
originally announced May 2019.
-
Decoupled Data Based Approach for Learning to Control Nonlinear Dynamical Systems
Authors:
Ran Wang,
Karthikeya Parunandi,
Dan Yu,
Dileep Kalathil,
Suman Chakravorty
Abstract:
This paper addresses the problem of learning the optimal control policy for a nonlinear stochastic dynamical system with continuous state space, continuous action space and unknown dynamics. This class of problems are typically addressed in stochastic adaptive control and reinforcement learning literature using model-based and model-free approaches respectively. Both methods rely on solving a dyna…
▽ More
This paper addresses the problem of learning the optimal control policy for a nonlinear stochastic dynamical system with continuous state space, continuous action space and unknown dynamics. This class of problems are typically addressed in stochastic adaptive control and reinforcement learning literature using model-based and model-free approaches respectively. Both methods rely on solving a dynamic programming problem, either directly or indirectly, for finding the optimal closed loop control policy. The inherent `curse of dimensionality' associated with dynamic programming method makes these approaches also computationally difficult.
This paper proposes a novel decoupled data-based control (D2C) algorithm that addresses this problem using a decoupled, `open loop - closed loop', approach. First, an open-loop deterministic trajectory optimization problem is solved using a black-box simulation model of the dynamical system. Then, a closed loop control is developed around this open loop trajectory by linearization of the dynamics about this nominal trajectory. By virtue of linearization, a linear quadratic regulator based algorithm can be used for this closed loop control. We show that the performance of D2C algorithm is approximately optimal. Moreover, simulation performance suggests significant reduction in training time compared to other state of the art algorithms.
△ Less
Submitted 17 April, 2019;
originally announced April 2019.
-
Stabilize Deep ResNet with A Sharp Scaling Factor $τ$
Authors:
Huishuai Zhang,
Da Yu,
Mingyang Yi,
Wei Chen,
Tie-Yan Liu
Abstract:
We study the stability and convergence of training deep ResNets with gradient descent. Specifically, we show that the parametric branch in the residual block should be scaled down by a factor $τ=O(1/\sqrt{L})$ to guarantee stable forward/backward process, where $L$ is the number of residual blocks. Moreover, we establish a converse result that the forward process is unbounded when…
▽ More
We study the stability and convergence of training deep ResNets with gradient descent. Specifically, we show that the parametric branch in the residual block should be scaled down by a factor $τ=O(1/\sqrt{L})$ to guarantee stable forward/backward process, where $L$ is the number of residual blocks. Moreover, we establish a converse result that the forward process is unbounded when $τ>L^{-\frac{1}{2}+c}$, for any positive constant $c$. The above two results together establish a sharp value of the scaling factor in determining the stability of deep ResNet. Based on the stability result, we further show that gradient descent finds the global minima if the ResNet is properly over-parameterized, which significantly improves over the previous work with a much larger range of $τ$ that admits global convergence. Moreover, we show that the convergence rate is independent of the depth, theoretically justifying the advantage of ResNet over vanilla feedforward network. Empirically, with such a factor $τ$, one can train deep ResNet without normalization layer. Moreover, for ResNets with normalization layer, adding such a factor $τ$ also stabilizes the training and obtains significant performance gain for deep ResNet.
△ Less
Submitted 30 January, 2023; v1 submitted 17 March, 2019;
originally announced March 2019.
-
Winning Is Not Everything: A contextual analysis of hockey face-offs
Authors:
Nick Czuzoj-Shulman,
David Yu,
Christopher Boucher,
Luke Bornn,
Mehrsan Javan
Abstract:
This paper takes a different approach to evaluating face-offs in ice hockey. Instead of looking at win percentages, the de facto measure of successful face-off takers for decades, focuses on the game events following the face-off and how directionality, clean wins, and player handedness play a significant role in creating value. This will demonstrate how not all face-off wins are made equal: some…
▽ More
This paper takes a different approach to evaluating face-offs in ice hockey. Instead of looking at win percentages, the de facto measure of successful face-off takers for decades, focuses on the game events following the face-off and how directionality, clean wins, and player handedness play a significant role in creating value. This will demonstrate how not all face-off wins are made equal: some players consistently create post-face-off value through clean wins and by directing the puck to high-value areas of the ice. As a result, we propose an expected events face-off model as well as a wins above expected model that take into account the value added on a face-off by targeting the puck to specific areas on the ice in various contexts, as well as the impact this has on subsequent game events.
△ Less
Submitted 6 February, 2019;
originally announced February 2019.
-
Playing Fast Not Loose: Evaluating team-level pace of play in ice hockey using spatio-temporal possession data
Authors:
David Yu,
Christopher Boucher,
Luke Bornn,
Mehrsan Javan
Abstract:
Pace of play is an important characteristic in hockey as well as other team sports. We provide the first comprehensive study of pace within the sport of hockey, focusing on how teams and players impact pace in different regions of the ice, and the resultant effect on other aspects of the game.
First we examined how pace of play varies across the surface of the rink, across different periods, at…
▽ More
Pace of play is an important characteristic in hockey as well as other team sports. We provide the first comprehensive study of pace within the sport of hockey, focusing on how teams and players impact pace in different regions of the ice, and the resultant effect on other aspects of the game.
First we examined how pace of play varies across the surface of the rink, across different periods, at different manpower situations, between different professional leagues, and through time between seasons. Our analysis of pace by zone helps to explain some of the counter-intuitive results reported in prior studies. For instance, we show that the negative correlation between attacking speed and shots/goals is likely due to a large decline in attacking speed in the OZ.
We also studied how pace impacts the outcomes of various events. We found that pace is positively-correlated with both high-danger zone entries (e.g. odd-man rushes) and higher shot quality. However, we find that passes with failed receptions occur at higher speeds than successful receptions. These findings suggest that increased pace is beneficial, but perhaps only up to a certain extent. Higher pace can create breakdowns in defensive structure and lead to better scoring chances but can also lead to more turnovers.
Finally, we analyzed team and player-level pace in the NHL, highlighting the considerable variability in how teams and players attack and defend against pace. Taken together, our results demonstrate that measures of team-level pace derived from spatio-temporal data are informative metrics in hockey and should prove useful in other team sports.
△ Less
Submitted 5 February, 2019;
originally announced February 2019.
-
Unsupervised Speech Recognition via Segmental Empirical Output Distribution Matching
Authors:
Chih-Kuan Yeh,
Jianshu Chen,
Chengzhu Yu,
Dong Yu
Abstract:
We consider the problem of training speech recognition systems without using any labeled data, under the assumption that the learner can only access to the input utterances and a phoneme language model estimated from a non-overlap** corpus. We propose a fully unsupervised learning algorithm that alternates between solving two sub-problems: (i) learn a phoneme classifier for a given set of phonem…
▽ More
We consider the problem of training speech recognition systems without using any labeled data, under the assumption that the learner can only access to the input utterances and a phoneme language model estimated from a non-overlap** corpus. We propose a fully unsupervised learning algorithm that alternates between solving two sub-problems: (i) learn a phoneme classifier for a given set of phoneme segmentation boundaries, and (ii) refining the phoneme boundaries based on a given classifier. To solve the first sub-problem, we introduce a novel unsupervised cost function named Segmental Empirical Output Distribution Matching, which generalizes the work in (Liu et al., 2017) to segmental structures. For the second sub-problem, we develop an approximate MAP approach to refining the boundaries obtained from Wang et al. (2017). Experimental results on TIMIT dataset demonstrate the success of this fully unsupervised phoneme recognition system, which achieves a phone error rate (PER) of 41.6%. Although it is still far away from the state-of-the-art supervised systems, we show that with oracle boundaries and matching language model, the PER could be improved to 32.5%.This performance approaches the supervised system of the same model architecture, demonstrating the great potential of the proposed method.
△ Less
Submitted 22 December, 2018;
originally announced December 2018.
-
A Comparison of Lattice-free Discriminative Training Criteria for Purely Sequence-Trained Neural Network Acoustic Models
Authors:
Chao Weng,
Dong Yu
Abstract:
In this work, three lattice-free (LF) discriminative training criteria for purely sequence-trained neural network acoustic models are compared on LVCSR tasks, namely maximum mutual information (MMI), boosted maximum mutual information (bMMI) and state-level minimum Bayes risk (sMBR). We demonstrate that, analogous to LF-MMI, a neural network acoustic model can also be trained from scratch using LF…
▽ More
In this work, three lattice-free (LF) discriminative training criteria for purely sequence-trained neural network acoustic models are compared on LVCSR tasks, namely maximum mutual information (MMI), boosted maximum mutual information (bMMI) and state-level minimum Bayes risk (sMBR). We demonstrate that, analogous to LF-MMI, a neural network acoustic model can also be trained from scratch using LF-bMMI or LF-sMBR criteria respectively without the need of cross-entropy pre-training. Furthermore, experimental results on Switchboard-300hrs and Switchboard+Fisher-2100hrs datasets show that models trained with LF-bMMI consistently outperform those trained with plain LF-MMI and achieve a relative word error rate (WER) reduction of 5% over competitive temporal convolution projected LSTM (TDNN-LSTMP) LF-MMI baselines.
△ Less
Submitted 17 November, 2018; v1 submitted 8 November, 2018;
originally announced November 2018.
-
Hull Form Optimization with Principal Component Analysis and Deep Neural Network
Authors:
Dongchi Yu,
Lu Wang
Abstract:
Designing and modifying complex hull forms for optimal vessel performances have been a major challenge for naval architects. In the present study, Principal Component Analysis (PCA) is introduced to compress the geometric representation of a group of existing vessels, and the resulting principal scores are manipulated to generate a large number of derived hull forms, which are evaluated computatio…
▽ More
Designing and modifying complex hull forms for optimal vessel performances have been a major challenge for naval architects. In the present study, Principal Component Analysis (PCA) is introduced to compress the geometric representation of a group of existing vessels, and the resulting principal scores are manipulated to generate a large number of derived hull forms, which are evaluated computationally for their calm-water performances. The results are subsequently used to train a Deep Neural Network (DNN) to accurately establish the relation between different hull forms and their associated performances. Then, based on the fast, parallel DNN-based hull-form evaluation, the large-scale search for optimal hull forms is performed.
△ Less
Submitted 27 October, 2018;
originally announced October 2018.
-
An Alternative Approach to Functional Linear Partial Quantile Regression
Authors:
Dengdeng Yu,
Matthew Pietrosanu,
Ivan Mizera,
Bei Jiang,
Linglong Kong,
Wei Tu
Abstract:
Functional data such as curves and surfaces have become more and more common with modern technological advancements. The use of functional predictors remains challenging due to its inherent infinite-dimensionality. The common practice is to project functional data into a finite dimensional space. The popular partial least square (PLS) method has been well studied for the functional linear model [1…
▽ More
Functional data such as curves and surfaces have become more and more common with modern technological advancements. The use of functional predictors remains challenging due to its inherent infinite-dimensionality. The common practice is to project functional data into a finite dimensional space. The popular partial least square (PLS) method has been well studied for the functional linear model [1]. As an alternative, quantile regression provides a robust and more comprehensive picture of the conditional distribution of a response when it is non-normal, heavy-tailed, or contaminated by outliers. While partial quantile regression (PQR) was proposed in [2], no theoretical guarantees were provided due to the iterative nature of the algorithm and the non-smoothness of quantile loss function. To address these issues, we propose an alternative PQR (APQR) formulation with guaranteed convergence. This novel formulation motivates new theories and allows us to establish asymptotic properties. Numerical studies on a benchmark dataset show the superiority of our new approach. We also apply our novel method to a functional magnetic resonance imaging (fMRI) data to predict attention deficit hyperactivity disorder (ADHD) and a diffusion tensor imaging (DTI) dataset to predict Alzheimer's disease (AD).
△ Less
Submitted 30 January, 2023; v1 submitted 7 September, 2017;
originally announced September 2017.
-
Convergence Analysis of Optimization Algorithms
Authors:
HyoungSeok Kim,
JiHoon Kang,
WooMyoung Park,
SukHyun Ko,
YoonHo Cho,
DaeSung Yu,
YoungSook Song,
JungWon Choi
Abstract:
The regret bound of an optimization algorithms is one of the basic criteria for evaluating the performance of the given algorithm. By inspecting the differences between the regret bounds of traditional algorithms and adaptive one, we provide a guide for choosing an optimizer with respect to the given data set and the loss function. For analysis, we assume that the loss function is convex and its g…
▽ More
The regret bound of an optimization algorithms is one of the basic criteria for evaluating the performance of the given algorithm. By inspecting the differences between the regret bounds of traditional algorithms and adaptive one, we provide a guide for choosing an optimizer with respect to the given data set and the loss function. For analysis, we assume that the loss function is convex and its gradient is Lipschitz continuous.
△ Less
Submitted 6 July, 2017;
originally announced July 2017.
-
Sparse Wavelet Estimation in Quantile Regression with Multiple Functional Predictors
Authors:
Dengdeng Yu,
Li Zhang,
Ivan Mizera,
Bei Jiang,
Linglong Kong
Abstract:
In this manuscript, we study quantile regression in partial functional linear model where response is scalar and predictors include both scalars and multiple functions. Wavelet basis are adopted to better approximate functional slopes while effectively detect local features. The sparse group lasso penalty is imposed to select important functional predictors while capture shared information among t…
▽ More
In this manuscript, we study quantile regression in partial functional linear model where response is scalar and predictors include both scalars and multiple functions. Wavelet basis are adopted to better approximate functional slopes while effectively detect local features. The sparse group lasso penalty is imposed to select important functional predictors while capture shared information among them. The estimation problem can be reformulated into a standard second-order cone program and then solved by an interior point method. We also give a novel algorithm by using alternating direction method of multipliers (ADMM) which was recently employed by many researchers in solving penalized quantile regression problems. The asymptotic properties such as the convergence rate and prediction error bound have been established. Simulations and a real data from ADHD-200 fMRI data are investigated to show the superiority of our proposed method.
△ Less
Submitted 2 December, 2017; v1 submitted 7 June, 2017;
originally announced June 2017.
-
Easily parallelizable and distributable class of algorithms for structured sparsity, with optimal acceleration
Authors:
Seyoon Ko,
Donghyeon Yu,
Joong-Ho Won
Abstract:
Many statistical learning problems can be posed as minimization of a sum of two convex functions, one typically a composition of non-smooth and linear functions. Examples include regression under structured sparsity assumptions. Popular algorithms for solving such problems, e.g., ADMM, often involve non-trivial optimization subproblems or smoothing approximation. We consider two classes of primal-…
▽ More
Many statistical learning problems can be posed as minimization of a sum of two convex functions, one typically a composition of non-smooth and linear functions. Examples include regression under structured sparsity assumptions. Popular algorithms for solving such problems, e.g., ADMM, often involve non-trivial optimization subproblems or smoothing approximation. We consider two classes of primal-dual algorithms that do not incur these difficulties, and unify them from a perspective of monotone operator theory. From this unification we propose a continuum of preconditioned forward-backward operator splitting algorithms amenable to parallel and distributed computing. For the entire region of convergence of the whole continuum of algorithms, we establish its rates of convergence. For some known instances of this continuum, our analysis closes the gap in theory. We further exploit the unification to propose a continuum of accelerated algorithms. We show that the whole continuum attains the theoretically optimal rate of convergence. The scalability of the proposed algorithms, as well as their convergence behavior, is demonstrated up to 1.2 million variables with a distributed implementation.
△ Less
Submitted 19 June, 2018; v1 submitted 20 February, 2017;
originally announced February 2017.
-
Partial Functional Linear Quantile Regression for Neuroimaging Data Analysis
Authors:
Dengdeng Yu,
Linglong Kong,
Ivan Mizera
Abstract:
We propose a prediction procedure for the functional linear quantile regression model by using partial quantile covariance techniques and develop a simple partial quantile regression (SIMPQR) algorithm to efficiently extract partial quantile regression (PQR) basis for estimating functional coefficients. We further extend our partial quantile covariance techniques to functional composite quantile r…
▽ More
We propose a prediction procedure for the functional linear quantile regression model by using partial quantile covariance techniques and develop a simple partial quantile regression (SIMPQR) algorithm to efficiently extract partial quantile regression (PQR) basis for estimating functional coefficients. We further extend our partial quantile covariance techniques to functional composite quantile regression (CQR) defining partial composite quantile covariance. There are three major contributions. (1) We define partial quantile covariance between two scalar variables through linear quantile regression. We compute PQR basis by sequentially maximizing the partial quantile covariance between the response and projections of functional covariates. (2) In order to efficiently extract PQR basis, we develop a SIMPQR algorithm analogous to simple partial least squares (SIMPLS). (3) Under the homoscedasticity assumption, we extend our techniques to partial composite quantile covariance and use it to find the partial composite quantile regression (PCQR) basis. The SIMPQR algorithm is then modified to obtain the SIMPCQR algorithm. Two simulation studies show the superiority of our proposed methods. Two real data from ADHD-200 sample and ADNI are analyzed using our proposed methods.
△ Less
Submitted 2 November, 2015;
originally announced November 2015.
-
High-dimensional Fused Lasso Regression using Majorization-Minimization and Parallel Processing
Authors:
Donghyeon Yu,
Joong-Ho Won,
Taehoon Lee,
Johan Lim,
Sungroh Yoon
Abstract:
In this paper, we propose a majorization-minimization (MM) algorithm for high-dimensional fused lasso regression (FLR) suitable for parallelization using graphics processing units (GPUs). The MM algorithm is stable and flexible as it can solve the FLR problems with various types of design matrices and penalty structures within a few tens of iterations. We also show that the convergence of the prop…
▽ More
In this paper, we propose a majorization-minimization (MM) algorithm for high-dimensional fused lasso regression (FLR) suitable for parallelization using graphics processing units (GPUs). The MM algorithm is stable and flexible as it can solve the FLR problems with various types of design matrices and penalty structures within a few tens of iterations. We also show that the convergence of the proposed algorithm is guaranteed. We conduct numerical studies to compare our algorithm with other existing algorithms, demonstrating that the proposed MM algorithm is competitive in many settings including the two-dimensional FLR with arbitrary design matrices. The merit of GPU parallelization is also exhibited.
△ Less
Submitted 14 December, 2013; v1 submitted 8 June, 2013;
originally announced June 2013.
-
Monotone false discovery rate
Authors:
Joong-Ho Won,
Johan Lim,
Donghyeon Yu,
Byung Soo Kim,
Kyunga Kim
Abstract:
This paper proposes a procedure to obtain monotone estimates of both the local and the tail false discovery rates that arise in large-scale multiple testing. The proposed monotonization is asymptotically optimal for controlling the false discovery rate and also has many attractive finite-sample properties.
This paper proposes a procedure to obtain monotone estimates of both the local and the tail false discovery rates that arise in large-scale multiple testing. The proposed monotonization is asymptotically optimal for controlling the false discovery rate and also has many attractive finite-sample properties.
△ Less
Submitted 13 December, 2013; v1 submitted 27 May, 2013;
originally announced May 2013.
-
Regression shrinkage and grou** of highly correlated predictors with HORSES
Authors:
Woncheol Jang,
Johan Lim,
Nicole A. Lazar,
Ji Meng Loh,
Donghyeon Yu
Abstract:
Identifying homogeneous subgroups of variables can be challenging in high dimensional data analysis with highly correlated predictors. We propose a new method called Hexagonal Operator for Regression with Shrinkage and Equality Selection, HORSES for short, that simultaneously selects positively correlated variables and identifies them as predictive clusters. This is achieved via a constrained leas…
▽ More
Identifying homogeneous subgroups of variables can be challenging in high dimensional data analysis with highly correlated predictors. We propose a new method called Hexagonal Operator for Regression with Shrinkage and Equality Selection, HORSES for short, that simultaneously selects positively correlated variables and identifies them as predictive clusters. This is achieved via a constrained least-squares problem with regularization that consists of a linear combination of an L_1 penalty for the coefficients and another L_1 penalty for pairwise differences of the coefficients. This specification of the penalty function encourages grou** of positively correlated predictors combined with a sparsity solution. We construct an efficient algorithm to implement the HORSES procedure. We show via simulation that the proposed method outperforms other variable selection methods in terms of prediction error and parsimony. The technique is demonstrated on two data sets, a small data set from analysis of soil in Appalachia, and a high dimensional data set from a near infrared (NIR) spectroscopy study, showing the flexibility of the methodology.
△ Less
Submitted 1 February, 2013;
originally announced February 2013.